CN113889135A - Method for estimating direction of arrival of sound source, electronic equipment and chip system - Google Patents

Method for estimating direction of arrival of sound source, electronic equipment and chip system Download PDF

Info

Publication number
CN113889135A
CN113889135A CN202010643053.0A CN202010643053A CN113889135A CN 113889135 A CN113889135 A CN 113889135A CN 202010643053 A CN202010643053 A CN 202010643053A CN 113889135 A CN113889135 A CN 113889135A
Authority
CN
China
Prior art keywords
sound source
signal
target
target sound
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010643053.0A
Other languages
Chinese (zh)
Inventor
朱梦尧
刘志韬
施栋
刘鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010643053.0A priority Critical patent/CN113889135A/en
Publication of CN113889135A publication Critical patent/CN113889135A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application provides a method for estimating the direction of arrival of a sound source, electronic equipment and a chip system, relates to the technical field of audio processing, can be applied to the acoustic environment of multiple sound sources, and can obtain a relatively pure sound source signal. The method comprises the following steps: the electronic equipment acquires an audio signal, wherein the audio signal comprises noise, a sound source signal of one or more target sound sources and a reverberation signal; the electronic equipment performs joint processing on the audio signals to obtain sound source signals of one or more target sound sources and arrival directions of the sound source signals of the one or more target sound sources, and the joint processing comprises the following steps: dereverberation processing, blind source separation processing and direction of arrival estimation processing.

Description

Method for estimating direction of arrival of sound source, electronic equipment and chip system
Technical Field
The embodiment of the application relates to the field of audio processing, in particular to a method for estimating the direction of arrival of a sound source, electronic equipment and a chip system.
Background
Direction of Arrival (DOA) estimation is a research focus of array signal processing, and the Direction of a sound source can be determined through DOA estimation, so that target sound can be better picked up. Therefore, DOA estimation becomes a key technology of man-machine interaction equipment such as mobile phones, smart speakers, and teleconference large screens.
Because the audio signal collected by the device is usually formed by mixing the sound source signal and the noise, when the DOA estimation is performed, firstly, the noise is used as an interference signal, and frequency point data which does not contain the noise or has small noise influence is screened from the audio signal. Then, DOA estimation is performed on the screened frequency bin data. However, when the DOA estimation method is applied, the DOA estimation method cannot be applied to a scene including a plurality of sound sources, and the frequency point data screened out in a complex acoustic environment is less and may include other interference components, so that the accuracy in DOA estimation is low.
Disclosure of Invention
The embodiment of the application provides a method for estimating the direction of arrival of a sound source, electronic equipment and a chip system, which can be applied to an acoustic environment with multiple sound sources and can improve the accuracy of DOA estimation in a complex acoustic environment.
In order to achieve the purpose, the technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a method for estimating a direction of arrival of a sound source, including: the electronic device acquires an audio signal, the audio signal comprising: noise, sound source signals of one or more target sound sources, reverberation signals; the electronic equipment performs Nth dereverberation processing on the audio signal to obtain an Nth prediction matrix and an Nth dereverberation signal, wherein the Nth dereverberation signal comprises signals except the Nth dereverberation signal in the audio signal, and the Nth dereverberation signal is a dereverberation signal removed in the Nth dereverberation processing; the electronic equipment performs blind source separation processing on the Nth dereverberation signal for the Nth time to obtain an Nth dereverberation matrix and an Nth de-noise signal, wherein the Nth de-noise signal is a signal obtained by removing noise obtained by blind source separation processing for the Nth time from the audio signal; the electronic equipment continues to perform dereverberation processing and blind source separation processing on the Nth de-noised signal; when N is a preset value or the Nth prediction matrix is converged and the Nth de-mixing matrix is converged, the electronic equipment obtains sound source signals of one or more target sound sources according to the Nth de-mixing matrix and the Nth de-reverberation signal; wherein N is a positive integer starting from 1; the electronic device determines a direction of arrival of a sound source signal of one or more target sound sources.
The electronic device may separate the noise, the reverberant signal, and the sound source signal after performing a joint iterative process of dereverberation and blind source separation on the acquired audio signal. That is, when the audio signal contains sound source signals of one or more target sound sources, the sound source signal of each target sound source can be obtained after the electronic device performs joint iterative processing of dereverberation and blind source separation on the audio signal. Thus, the electronic device can estimate the direction of arrival of the sound source signal of each target sound source, so that it can be applied in an acoustic environment of multiple sound sources; the reverberation signal is a signal acquired by a microphone in a delayed manner after a sound source signal of a target sound source is reflected, so that the direction of the reverberation signal is changed; the direction of the noise is all-around, and therefore, the sound source signal of each target sound source obtained after the joint iteration process contains little or little noise and reverberation signals affecting the DOA estimation accuracy. Therefore, the electronic equipment has higher DOA estimation precision when estimating the direction of arrival of the sound source signal of the target sound source; and even if the acquisition environment of the audio signal is a high-noise and high-reverberation acoustic environment, the DOA estimation accuracy is still higher.
In one possible embodiment of the first aspect, for the first target sound source, the first target sound source is any one of one or more target sound sources, and the direction of arrival of the sound source signal of the first target sound source includes: and the first direction information of the sound source signal of the first target sound source on different frequency points.
For convenience of description, any one of the one or more target sound sources may be denoted as a first target sound source.
In a possible implementation manner of the first aspect, the electronic device performs, according to first direction information of a sound source signal of a first target sound source on different frequency points, joint processing of sound source separation and direction of arrival estimation on the sound source signal of the first target sound source, to obtain second direction information of the sound source signal of the first target sound source on different frequency points, where the sound source separation processing includes: joint iterative processing of dereverberation and blind source separation.
The dereverberation and the blind source separation are both estimation algorithms, noise and/or reverberation signals may be contained in sound source signals of a target sound source obtained after the audio signals are subjected to joint iterative processing of dereverberation and blind source separation, and first direction information obtained by the electronic equipment based on the sound source signals containing the noise and/or reverberation signals may be inaccurate; therefore, the electronic equipment can continue to perform sound source separation processing and DOA estimation processing on the first target sound source based on the currently obtained first direction information of the sound source signal of the first target sound source on different frequency points, and the process of sound source separation processing is constrained by the first direction information of the sound source signal of the first target sound source on different frequency points when the sound source separation processing is performed again, so that the electronic equipment can obtain more accurate sound source signals of the first target sound source, the de-mixing matrix and the mixing matrix after performing sound source separation, and the electronic equipment can obtain more accurate second direction information of the sound source signal of the first target sound source on different frequency points according to the more accurate sound source signals of the first target sound source, the de-mixing matrix or the mixing matrix. Of course, the obtained second direction information of the sound source signal of the first target sound source on different frequency points is more accurate than the first direction information of the sound source signal of the first target sound source on different frequency points.
In a possible implementation manner of the first aspect, the electronic device performs smoothing filtering processing or kernel density estimation processing on first direction information of a sound source signal of a first target sound source at different frequency points to obtain third direction information of the first target sound source at different frequency points; and the electronic equipment fuses the third direction information of the first target sound source on different frequency points to obtain the direction of the first target sound source.
In the embodiment of the present application, the smoothing filtering process and the kernel density estimation process are to remove some interferences in the sound source signal of the first target sound source, so that the direction of the sound source signal of the first target sound source can be obtained according to the third direction information from which some interferences are removed.
In one possible implementation manner of the first aspect, the electronic device continuing to perform the dereverberation process and the blind source separation process on the nth denoised signal includes:
if the p prediction matrix obtained by the p dereverberation processing is converged and the p unmixing matrix obtained by the p blind source separation processing is not converged, the electronic equipment executes the p + i blind source separation processing until the p + i unmixing matrix is converged, or alternatively executes the p + i dereverberation processing and the p + i blind source separation processing until the p + i prediction matrix and the p + i unmixing matrix are converged simultaneously, wherein p is a positive integer, and i is a positive integer starting from 1;
if the q prediction matrix obtained by the q dereverberation processing is not converged and the q de-mixing matrix obtained by the q blind source separation processing is converged, the electronic equipment executes the q + i dereverberation processing until the q + i prediction matrix is converged, or alternatively executes the q + i dereverberation processing and the q + i blind source separation processing alternately until the q + i prediction matrix and the q + i de-mixing matrix are converged simultaneously, wherein q is a positive integer and i is a positive integer starting from 1.
In the embodiment, when the electronic device executes the dereverberation processing and the blind source separation processing, the dereverberation processing and the blind source separation processing may be executed circularly until both the prediction matrix and the unmixing matrix converge, or when one of the matrices converges, the matrix which is not converged may be iterated independently until both the prediction matrix and the unmixing matrix converge, so that a process of repeated operation of the converged matrix is avoided, and the processing efficiency is improved.
In a possible implementation manner of the first aspect, the electronic device performs the j th dereverberation processing including a process of updating the prediction matrix m times, and the electronic device performs the j th blind source separation processing including a process of updating the unmixing matrix n times, where j, m, and n are positive integers.
The electronic equipment can include the independent iteration matrix process in each execution dereverberation process and blind source separation process, so that the operation amount is reduced, and the processing efficiency is improved.
In a possible implementation manner of the first aspect, a dereverberation signal obtained by a last dereverberation process is used as a processing signal for blind source separation, and an audio signal is used as a processing signal for dereverberation; or, the de-noise signal obtained by the last blind source separation processing is used as the processing signal for de-reverberation, and the audio signal is used as the processing signal for blind source separation.
In a possible implementation manner of the first aspect, a dereverberation signal obtained in any time of the history is used as a processing signal for the current blind source separation, and a denoising signal obtained in any time of the history is used as a processing signal for the current dereverberation.
When the electronic device executes the joint iteration of the dereverberation processing and the blind source separation processing, the parameters of the dereverberation processing process can affect the blind source separation processing, and the parameters of the blind source separation processing process can also affect the dereverberation processing. And, the parameter may be a dereverberation signal and/or a de-noise signal obtained any one time before.
In one possible implementation manner of the first aspect, the electronic device acquiring the audio signal includes: the electronic equipment acquires audio signals through an acoustic vector sensor on the electronic equipment; or the electronic equipment receives audio signals collected by the acoustic vector sensor on other electronic equipment.
In one possible implementation of the first aspect, the electronic device determining a direction of arrival of a sound source signal of one or more target sound sources comprises: the electronic equipment obtains the arrival direction of the sound source signal of a second target sound source according to one or more of the amplitude values of the sound source signal of the second target sound source on a plurality of channels, the de-mixing matrix or the mixing matrix of the sound source signals of one or more target sound sources, wherein the second target sound source is any one of the one or more target sound sources; wherein the unmixing matrix represents a conversion relation when the audio signal is separated into the sound source signals of the one or more target sound sources, and the mixing matrix represents a conversion relation when the sound source signals of the one or more target sound sources in the audio signal are mixed into the audio signal.
In the embodiment of the application, in combination with the feature of common point of the channels of the acoustic vector microphone, it is considered that the amplitude in each channel direction is related to the direction of the target sound source after the sound source signal of a single target sound source is collected by the acoustic vector microphone, because the electronic device can obtain the arrival direction of the sound source signal of the first target sound source according to the amplitudes of the sound source signal of the first target sound source in multiple channels; in combination with the feature that the channels of the acoustic vector microphone share a common point and the second mixing models of the plurality of target sound sources, it can be considered that the direction of the sound source signal of the target sound source is also contained in the conversion relation when the sound source signal and the noise of the target sound source are mixed into the audio signal, and therefore, the electronic device can obtain the arrival direction of any one target sound source according to the mixing matrix or the de-mixing matrix of the sound source signals of one or more target sound sources.
In one possible implementation manner of the first aspect, the obtaining, by the electronic device, a direction of arrival of a sound source signal of a second target sound source according to a mixing matrix of sound source signals of one or more target sound sources includes: the electronic device determines a target column in the mixing matrix, and a first target row and a second target row in the target column, wherein the target column is a column representing a sound source signal of a second target sound source, and the first target row and the second target row are rows related to an angle of the sound source signal of the second target sound source; the electronic device obtains the arrival direction of the sound source signal of the second target sound source according to the elements of the first target row and the elements of the second target row in the target column.
Due to the co-point characteristic of the acoustic vector sensors, the ratio of the amplitude values of the sound source signal of each target sound source on each channel is implied in the mixing matrix, or the relationship of angles is implied in the mixing matrix, and the angles can be determined by the elements on the first target row and the second target row related to the angles in the columns of the sound source signals representing the target sound sources in the mixing matrix.
In a possible implementation manner of the first aspect, when the first target row represents a row of a first channel of the acoustic vector sensor, and the second target row represents a row of a second channel of the acoustic vector sensor, a direction of arrival of a sound source signal of the second target sound source includes a horizontal angle of the sound source signal of the second target sound source, and the horizontal angle is an angle in a coordinate system where the acoustic vector sensor is located; and/or when the first target behavior represents the row of a third channel of the acoustic vector sensor and the second target behavior represents the row of an omnidirectional channel of the acoustic vector sensor, the arrival direction of the sound source signal of the second target sound source comprises the pitch angle of the sound source signal of the second target sound source, and the pitch angle is the angle in the coordinate system where the acoustic vector sensor is located.
In this implementation, the 1 st column in the mixing matrix represents the column in which the sound source signal of the first target sound source is located, and the 2 nd column in the mixing matrix represents the column in which the sound source signal of the second target sound source is located, … …. Each row in the hybrid matrix represents an omnidirectional channel, an X channel, a Y channel, and a Z channel (three-dimensional four-channel acoustic vector microphone), respectively. The electronic equipment can obtain a horizontal angle of a sound source signal of the first target sound source according to elements of a row representing an X channel of the acoustic vector sensor and elements of a row representing a Y channel of the acoustic vector sensor in a target column representing the first target sound source, wherein the horizontal angle is an angle in a coordinate system where the acoustic vector sensor is located; the electronic device may obtain a pitch angle of a sound source signal of the first target sound source according to an element of a row representing a Z channel of the acoustic vector sensor and an element of a row representing an omni-directional channel of the acoustic vector sensor in a target column representing the first target sound source, where the pitch angle is an angle in a coordinate system where the acoustic vector sensor is located.
In a possible implementation manner of the first aspect, after the electronic device obtains the sound source signals of the one or more target sound sources according to the nth downmix matrix and the nth de-noised signals, the electronic device performs a first enhancement process on the sound source signals of the one or more target sound sources, where the first enhancement process includes: the method comprises the steps of interference spectrum filtering processing and/or harmonic enhancement processing, wherein a first target sound source is any one of sound source signals of one or more target sound sources; interference spectrum filtering processing for filtering an interference component mixed in a sound source signal of any one target sound source based on the spectrum energy of the sound source signal of any one target sound source in the sound source signals of one or more target sound sources; and harmonic enhancement processing for obtaining a harmonic enhancement signal of one or more target sound sources, the harmonic enhancement signal being a sound source signal containing harmonic components.
In this implementation manner, the interference spectrum filtering process may be that the electronic device filters interference components mixed in the sound source signal of the first target sound source based on the spectrum energy of the sound source signal of the first target sound source, so as to obtain a purer sound source signal; the harmonic enhancement processing can enrich the sound we hear or reproduce the real sound emitted by musical instruments and the like.
In a possible implementation manner of the first aspect, the electronic device performs second enhancement processing on a sound source signal of a first target sound source based on first direction information of the sound source signal of the first target sound source on different frequency points, where the second enhancement processing includes interference direction filtering processing and/or beam forming directional enhancement processing; the interference direction filtering processing is used for filtering frequency points of which the direction angles are not within an expected angle range in the sound source signals of the first target sound source; and a beam forming directional enhancement process for enhancing the power of the sound source signal of the first target sound source in a desired direction.
In this implementation manner, the interference direction filtering process is used to filter frequency points, whose direction angles are not within an expected angle range, in the sound source signal of the first target sound source, so as to suppress sounds in directions other than the direction in which the first target sound source is located; the beamforming directional enhancement process is used to enhance the power of the sound source signal in a desired direction.
In a possible implementation manner of the first aspect, when N is a preset value or the nth prediction matrix converges and the nth downmix matrix converges, the method further includes: the electronic equipment obtains noise and reverberation signals of one or more target sound sources from the audio signals; the electronic device adjusts a proportional relationship between the noise, a sound source signal of a first target sound source, and a reverberation signal of the first target sound source, the first target sound source being any one of the one or more target sound sources.
In the embodiment of the present application, the electronic device executes the step of adjusting the proportional relationship between the sound source signal of the first target sound source, the reverberation signal of the first target sound source, and the noise, so as to obtain sounds with different scene effects, such as a KTV effect, a concert hall effect, an open field effect, and the like.
In a second aspect, an embodiment of the present application provides an electronic device, including: an audio signal acquisition unit for acquiring an audio signal including noise, a sound source signal of one or more target sound sources;
an audio signal acquisition unit for acquiring an audio signal including noise, a sound source signal of one or more target sound sources, a reverberation signal;
the dereverberation processing unit is used for carrying out the Nth dereverberation processing on the audio signal to obtain an Nth prediction matrix and an Nth dereverberation signal, wherein the Nth dereverberation signal comprises signals except the Nth dereverberation signal in the audio signal, and the Nth dereverberation signal is a dereverberation signal removed in the Nth dereverberation processing;
the blind source separation processing unit is used for carrying out nth blind source separation processing on the nth dereverberation signal to obtain an nth de-mixing matrix and an nth de-noise signal, wherein the nth de-noise signal is a signal obtained by removing noise obtained by the nth blind source separation processing in the audio signal;
a sound source signal obtaining unit for performing dereverberation processing and blind source separation processing on the Nth de-noised signal; when N is a preset value or the Nth prediction matrix is converged and the Nth de-mixing matrix is converged, obtaining sound source signals of one or more target sound sources according to the Nth de-mixing matrix and the Nth de-reverberation signal; wherein N is a positive integer starting from 1
A sound source direction estimation unit for determining a direction of arrival of a sound source signal of one or more target sound sources.
In a third aspect, an electronic device is provided, comprising a processor configured to execute a computer program stored in a memory, and to implement the method of any of the first aspect of the present application.
In a fourth aspect, a chip system is provided, which includes a processor coupled to a memory, the processor executing a computer program stored in the memory to implement the method of any one of the first aspect of the present application.
In a fifth aspect, there is provided a computer readable storage medium storing a computer program which, when executed by one or more processors, performs the method of any one of the first aspects of the present application.
In a sixth aspect, the present application provides a computer program product, which when run on an electronic device, causes the electronic device to perform any one of the methods of the first aspect.
It is understood that the beneficial effects of the second to sixth aspects can be seen from the description of the first aspect, and are not described herein again.
Drawings
Fig. 1 is a schematic view of an application scenario of a method for estimating a direction of arrival of a sound source according to an embodiment of the present application;
fig. 2 is a schematic hardware structure diagram of an electronic device that executes a method for estimating a direction of arrival of a sound source according to an embodiment of the present application;
FIG. 3 is a schematic flowchart of a method for estimating a direction of arrival of a sound source according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating another method for estimating the direction of arrival of a sound source according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a second mixing model of sound source signals of a plurality of target sound sources provided in an embodiment of the present application;
fig. 6 is a schematic diagram of a separation model for separating a sound source signal of each target sound source from an audio signal according to an embodiment of the present application;
FIG. 7 is a flow diagram illustrating one embodiment of a joint iterative process of dereverberation and blind source separation in the embodiment of FIG. 4;
FIG. 8 is a schematic flow chart illustrating another method for estimating the direction of arrival of a sound source according to an embodiment of the present application;
fig. 9 is a structural effect diagram of a three-dimensional four-channel acoustic vector sensor according to an embodiment of the present application;
FIG. 10 is a schematic block diagram of a joint process including sound source separation, DOA estimation and enhancement processes provided by embodiments of the present application;
fig. 11 is a schematic block diagram of functional architecture modules of an electronic device that executes a method for estimating a direction of arrival of a sound source according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that in the embodiments of the present application, "one or more" means one, two, or more than two; "and/or" describes the association relationship of the associated objects, indicating that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
The embodiment of the present application may be applied in an acoustic environment in which one or more target sound sources exist, referring to fig. 1, fig. 1 is an application scenario of the method for estimating a direction of arrival of a sound source provided by the embodiment of the present application, as shown in fig. 1, voice information uttered by one or more users may be acquired by a microphone array, which may include one or more microphones, each for acquiring voice information uttered by one or more users. In fig. 1, the microphone array includes 4 microphones, and the number of users is 4 as an example, it is understood that in an actual process, the microphone array may include more than 4 microphones or less than 4 microphones, and the number of users may also be more than 4 microphones or less than 4 microphones.
For example, the 4 users may be talking or singing from time to time (hereinafter, the sound made by the users may be referred to as voice information), and at least two of the 4 users may be talking at the same time in a certain time period, and it may be considered that at least two target sound sources exist. Possibly for another period of time, when only one user is talking, a target sound source may be considered to be present. Of course, it is possible that for a certain period of time, 4 users may each consider that there are up to four target sound sources at the time of the conversation. If an acoustic environment in which 4 users are talking is taken as an example, there are 4 target sound sources in the acoustic environment. There are 4 microphones in the microphone array, and the voice information sent by each user can be collected by all 4 microphones, and similarly, each microphone can also collect the voice information sent by each user, and the information collected by each microphone in the microphone array may include noise (for example, mixed ambient noise, device noise), reverberation signal, etc. in addition to the voice information sent by each user.
Since each microphone can collect not only voice information sent by each user, but also mixed environmental noise, device noise and reverberation signals, all information collected by the microphone can be referred to as audio signals in the embodiment of the application. The mixed ambient noise and the device noise are collectively referred to as noise in the embodiments of the present application.
The audio signal collected by each microphone in the microphone array is referred to as an audio signal of one channel, and when the audio signal is collected by the microphone array in the application scenario shown in fig. 1, the audio signal collected by the microphone array is an audio signal of 4 channels. The audio signals of one channel may comprise speech signals uttered by different users.
The electronic device can separate the voice information uttered by each user from the audio signal containing the reverberation signal and the noise, and the voice information uttered by each user can be understood as a sound source signal of a target sound source. The electronic device may also obtain a direction of arrival of the sound source signal for each target sound source.
It should be understood that each microphone in the microphone array can collect the voice information respectively uttered by the 4 users.
The embodiment of the present application provides a method for estimating a direction of arrival of a sound source, where the method may be applied to an electronic device, and the electronic device may be: electronic devices such as mobile phones, tablet computers, wearable devices, vehicle-mounted devices, Augmented Reality (AR)/Virtual Reality (VR) devices, notebook computers, ultra-mobile personal computers (UMPCs), netbooks, and Personal Digital Assistants (PDAs). The specific type of the embodiment of the present application is not limited.
Fig. 2 shows a schematic structural diagram of an electronic device. The electronic device 200 may include a processor 210, an internal memory 221, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 270A, a receiver 270B, a microphone 270C, and a headset interface 270D.
It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 200. In other embodiments of the present application, the electronic device 200 may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Processor 210 may include one or more processing units, such as: the processor 210 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. The different processing units may be separate devices or may be integrated into one or more processors. For example, the processor 210 is configured to execute a method of estimating a direction of arrival of a sound source in the embodiment of the present application, for example, the following steps 301 to 302.
A memory may also be provided in processor 210 for storing instructions and data. In some embodiments, the memory in the processor 210 is a cache memory. The memory may hold instructions or data that have just been used or recycled by processor 210. If the processor 210 needs to reuse the instruction or data, it may be called directly from memory. Avoiding repeated accesses reduces the latency of the processor 210, thereby increasing the efficiency of the system.
In some embodiments, processor 210 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.
The wireless communication function of the electronic device 200 may be implemented by the antenna 1, the antenna 2, the mobile communication module 250, the wireless communication module 260, the modem processor, the baseband processor, and the like.
The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 200 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
As an example, when the electronic device is not provided with a microphone array, the audio signal of the other electronic device may be acquired through the antenna 1 and the mobile communication module 250, and the audio signal of the other electronic device may also be acquired through the antenna 2 and the wireless communication module 260.
The mobile communication module 250 may provide a solution including 2G/3G/4G/5G wireless communication applied on the electronic device 200. The mobile communication module 250 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 250 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 250 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave.
In some embodiments, at least some of the functional modules of the mobile communication module 250 may be disposed in the processor 210. In some embodiments, at least some of the functional modules of the mobile communication module 250 may be disposed in the same device as at least some of the modules of the processor 210.
The wireless communication module 260 may provide a solution for wireless communication applied to the electronic device 200, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR), and the like. The wireless communication module 260 may be one or more devices integrating at least one communication processing module. The wireless communication module 260 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 210. The wireless communication module 260 may also receive a signal to be transmitted from the processor 210, frequency-modulate and amplify the signal, and convert the signal into electromagnetic waves via the antenna 2 to radiate the electromagnetic waves.
In some embodiments, antenna 1 of electronic device 200 is coupled to mobile communication module 250 and antenna 2 is coupled to wireless communication module 260, such that electronic device 200 may communicate with networks and other devices via wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), General Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), Wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), Long Term Evolution (LTE), LTE, BT, GNSS, WLAN, NFC, FM, and/or IR technologies, among others. GNSS may include Global Positioning System (GPS), global navigation satellite system (GLONASS), beidou satellite navigation system (BDS), quasi-zenith satellite system (QZSS), and/or Satellite Based Augmentation System (SBAS).
Internal memory 221 may be used to store computer-executable program code, which includes instructions. The processor 210 executes various functional applications of the electronic device 200 and data processing by executing instructions stored in the internal memory 221. The internal memory 221 may include a program storage area and a data storage area. The storage program area may store an operating system, and an application program (such as a sound playing function, an image playing function, etc.) required by at least one function. The storage data area may store data created during use of the electronic device 200.
In addition, the internal memory 221 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.
Electronic device 200 may implement audio functions via audio module 270, speaker 270A, receiver 270B, microphone 270C, headset interface 270D, and an application processor, among other things. Such as music playing, recording, etc.
Audio module 270 is used to convert digital audio signals to analog audio signal outputs and also to convert analog audio inputs to digital audio signals. Audio module 270 may also be used to encode and decode audio signals. In some embodiments, the audio module 270 may be disposed in the processor 210, or some functional modules of the audio module 270 may be disposed in the processor 210.
The speaker 270A, also called a "horn", is used to convert an audio electrical signal into an acoustic signal. The electronic device 200 may play the sound source signal obtained in the embodiment of the present application through the speaker 270A.
The receiver 270B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal. When the electronic device 200 receives a call or voice information, the user can receive the voice by placing the receiver 270B close to the ear of the user, for example, the user receives the sound source signal obtained in the embodiment of the present application through the receiver in the hearing aid.
The microphone 270C, also referred to as a "microphone," is used to convert acoustic signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 270C by speaking the user's mouth near the microphone 270C. The electronic device 200 may be provided with at least one microphone 270C. In other embodiments, the electronic device 200 may be provided with two microphones 270C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 200 may further include three, four, or more microphones 270C to form a microphone array provided in the embodiments of the present application, so as to achieve sound signal collection, noise reduction, sound source identification, and directional recording. For example, the microphone 270C may be used to capture audio signals related to embodiments of the present application.
It should be noted that, if the electronic device is a server, the server includes a processor and a communication interface.
In the embodiment of the present application, a specific structure of an execution subject of a method of estimating a direction of arrival of a sound source is not particularly limited as long as communication can be performed by a method of estimating a direction of arrival of a sound source according to the embodiment of the present application by running a program recorded with codes of a method of estimating a direction of arrival of a sound source of the embodiment of the present application. For example, an execution subject of the method for estimating the direction of arrival of a sound source provided by the embodiment of the present application may be a functional module capable of calling a program and executing the program in an electronic device, or a communication device, such as a chip, applied to the electronic device. The following embodiments are described taking, as an example, an execution subject of a method of estimating a direction of arrival of a sound source as an electronic device.
Referring to fig. 3, fig. 3 is a schematic flowchart of a method for estimating a direction of arrival of a sound source according to an embodiment of the present application, where as shown in the figure, the method includes:
step 301, the electronic device acquires an audio signal, where the audio signal includes: noise, sound source signals of one or more target sound sources, reverberation signals.
In the embodiment of the present application, the audio signal may be a multi-channel audio signal or a single-channel audio signal, and the multi-channel audio signal indicates that the audio signal is from multiple channels.
The implementation of step 301 differs depending on whether the electronic device has the function of collecting audio signals, and the following description is provided:
example 1, an electronic device has a function of acquiring an audio signal.
In one possible implementation, step 301 may be implemented by: an electronic device captures audio signals through an audio capture device (e.g., an array of microphones) disposed within the electronic device.
For example, the electronic device may be the electronic device of the microphone array shown in fig. 1.
Example 2, the electronic device does not have a function of acquiring an audio signal.
In one possible implementation, step 301 may be implemented by: the electronic device receives audio signals from other devices. The other device is provided with an array of microphones for picking up audio signals.
For example, in example 2 the electronic device may be a server, a cloud platform, or the like.
It should be understood that in the case where the electronic device has a function of acquiring an audio signal, the electronic device may also receive an audio signal transmitted by another device.
Environmental noise exists in most spaces in a natural state, and interference (device noise for short) irrelevant to the existence of signals may also exist in a generating, checking, measuring or recording device, so that the audio signals collected by the microphone array may contain the environmental noise and the device noise, which are hereinafter collectively referred to as noise. In addition, since the environmental noise in the space is actually emitted by the sound source, sound sources other than the emitted noise in the space may be regarded as target sound sources for the convenience of distinction. Of course, in practical applications, the number of target sound sources in an audio signal may be one or more.
Since the target sound source may reflect the sound from the ground, wall, etc., the audio signal may include noise, and the sound signal from the target sound source may also include sound collected by the microphone array after the sound from the target sound source is reflected. In the embodiment of the application, the component collected by the microphone after the target sound source emits sound is recorded as the sound source signal, and the component collected by the microphone after the target sound source emits sound is recorded as the reverberation signal.
The audio signal in the embodiment of the present application is a mixed signal obtained by noise, a reverberation signal, and a sound source signal of one or more target sound sources. Alternatively, the audio signal in the embodiment of the present application is a mixed signal obtained by noise, and sound source signals of one or more target sound sources.
Since the audio signal acquired by the electronic device may be a time domain signal or a frequency domain signal, but the actions of the electronic device before performing step 302 are different in different cases, the following will be described:
case 1, the audio signal in the embodiment of the present application is a time domain signal.
Correspondingly, the method provided by the embodiment of the present application may further include, before step 302: the electronic device may perform time-frequency transformation on the audio signal to obtain a frequency domain signal of the audio signal. It is understood that the frequency domain signal corresponding to the audio signal is the processing object in the subsequent step.
For example, the time-frequency transformation may adopt fourier transformation, fast fourier transformation, wavelet transformation, and the like, and what transformation mode is specifically adopted may be determined according to the actual application requirements, and the specific processing procedures of fourier transformation, fast fourier transformation, and wavelet transformation are not described herein again. Of course, in practical application, other time-frequency transformation methods may also be adopted, and are not limited herein.
Case 2, the audio signal in the embodiment of the present application is a frequency domain signal.
In this case 2, the electronic device does not need to perform the step of time-frequency transforming the audio signals before performing the joint processing on the audio signals.
Step 302, the electronic device performs joint processing on the audio signals to obtain sound source signals of one or more target sound sources and arrival directions of the sound source signals of the one or more target sound sources, where the joint processing includes: dereverberation processing, blind source separation processing and direction of arrival estimation processing.
In this embodiment of the present application, the dereverberation processing and the blind source separation processing may be referred to as sound source separation processing, where the sound source separation processing may be first dereverberation processing and then blind source separation processing, may also be first blind source separation processing and then dereverberation processing, and may also be joint iteration processing of dereverberation processing and blind source separation processing. The dereverberation process, the blind source separation process, and the joint iterative process of the dereverberation process and the blind source separation process refer to the descriptions in the subsequent embodiments.
When the electronic device executes the joint processing, the sound source separation processing can be performed first, and then the direction of arrival estimation processing can be performed; the sound source separation processing and the direction of arrival estimation processing may also be performed cyclically; the direction of arrival of the waves can be estimated and then the sound source can be separated; it is also possible to cyclically perform the direction-of-arrival estimation process and the sound source separation process. Therefore, the electronic device performs joint processing on the audio signals in at least several ways:
in the first mode, the electronic device performs joint iterative processing of dereverberation processing and blind source separation processing on the audio signal to obtain sound source signals of one or more target sound sources;
the electronic device determines a direction of arrival of a sound source signal of one or more target sound sources.
As another implementation of the first approach, after the electronic device determines the direction of arrival of the sound source signals of the one or more target sound sources, the electronic device performs joint iterative processing of dereverberation processing and blind source separation processing and direction of arrival estimation processing on the audio signals based on the direction of arrival of the sound source signals of the one or more target sound sources, resulting in cleaner sound source signals and more accurate direction of arrival of the one or more target sound sources.
In a second mode, the electronic device performs direction-of-arrival estimation processing on the audio signal to obtain third signals of one or more target sound sources and directions of arrival of the third signals of the one or more target sound sources, where the third signals are sound source signals containing noise;
the electronic equipment performs joint iterative processing of dereverberation and blind source separation on the third signals of the one or more target sound sources according to the arrival directions of the third signals of the one or more target sound sources to obtain sound source signals of the one or more target sound sources;
the electronic device takes the direction of arrival of the third signal of the one or more target sound sources as the direction of arrival of the sound source signals of the one or more target sound sources.
As another implementation manner of the second mode, after the electronic device performs joint iterative processing of dereverberation and blind source separation on the third signals of the one or more target sound sources according to the arrival directions of the third signals of the one or more target sound sources, and obtains the sound source signals of the one or more target sound sources, the electronic device may further continue to determine the arrival directions of the sound source signals of the one or more target sound sources instead of using the arrival directions of the third signals of the one or more target sound sources as the arrival directions of the sound source signals of the one or more target sound sources.
As can be understood from the above description, one difference between the first and second manners in which the electronic device performs the joint processing on the audio signals is whether the sound source separation processing or the direction of arrival estimation processing is performed first.
The present application describes, by way of example in a first manner, a process in which an electronic device performs joint processing on an audio signal, and in a second manner, dereverberation processing, blind source separation processing, and direction-of-arrival estimation processing may refer to the description in the first manner. Referring to fig. 4, fig. 4 is a schematic flowchart of another method for estimating a direction of arrival of a sound source according to an embodiment of the present disclosure, where as shown in the figure, the method includes steps 401 to 403, where the contents of step 401 and step 301 are consistent and are not repeated.
In step 402, the electronic device performs joint iterative processing of dereverberation and blind source separation on the audio signal to obtain a sound source signal of one or more target sound sources.
The step may specifically refer to the description of the following embodiments, and is not repeated herein.
In step 403, the electronic device determines the direction of arrival of the sound source signals of one or more target sound sources.
The electronic device may perform a DOA estimation process on a sound source signal of each of one or more target sound sources to obtain a respective direction of arrival for each target sound source.
Taking an example that the one or more target sound sources include a first target sound source, the electronic device performs a DOA estimation process on a sound source signal of the first target sound source to obtain a direction of arrival of the first target sound source.
As an example, the direction of arrival of each target sound source includes first direction information of the target sound source at different frequency points. For example, taking an example that the plurality of target sound sources includes a first target sound source, the direction of arrival of the first target sound source includes first direction information of the first target sound source on different frequency points (e.g., frequency point 1, frequency point 2, and frequency point 3). By way of example, when the audio signal is a frequency domain signal, the frequency values of the audio signal are within a certain range (e.g., 500 to 3000Hz), and each frequency value within the range may be recorded as a frequency point, for example, the first direction information (for example, a horizontal angle and a pitch angle) of the first target sound source on the frequency point 1 may be an angle value (30 °, 60 °) of the first target sound source on the frequency point 500 Hz. The first direction information of the first target sound source at the frequency point 2 may be the angle value (31 °, 59 °) of the first target sound source at the frequency point 501 Hz. The first direction information of the first target sound source at the frequency point 3 may be the angle value (30 °, 59 °) at the frequency point 502Hz for the first target sound source. In practical applications, frequency values in a frequency range of the audio signal may also be divided into equally spaced frequency segments (for example, a frequency segment is formed every 5 Hz), and each frequency segment is marked as a frequency point, which is not specifically illustrated.
The first target sound source is any one of a plurality of target sound sources, and does not have an indicative meaning. In addition, the frequency points corresponding to different target sound sources may be the same or different. For example, the different frequency points corresponding to the first target sound source may be a plurality of frequency points within a first frequency value (e.g., 500-2500 Hz), and the different frequency points corresponding to the second target sound source of the plurality of target sound sources may be a plurality of frequency points within a second frequency value (e.g., 600-3000 Hz).
In one possible implementation manner, the electronic device determines first direction information of a first target sound source on different frequency points. The electronic equipment obtains the direction of arrival of the first target sound source according to the first direction information of the first target sound source on different frequency points. Of course, the electronic device may also fuse the amplitude information of the first target sound source at different frequency points. Then, the electronic device calculates a direction angle of the first target sound source according to the fused amplitude information.
It should be noted that, for example, only how the electronic device calculates the direction of arrival of the first target sound source is taken as an example, the calculation process of the direction of arrival of the first target sound source may be referred to for the calculation manner of the directions of arrival of the remaining target sound sources in the one or more target sound sources, and details are not repeated here.
In the method for estimating the direction of arrival of a sound source provided in the embodiment of the present application, in a normal case, besides the sound source signal including noise and one or more target sound sources, the audio signal may also include a reverberation signal, where the reverberation signal is a signal acquired by delaying a microphone after the sound source signal of the target sound source is reflected, so that the direction of the reverberation signal has changed, and the direction of the noise is in all directions, and therefore, the sound source signal of each target sound source obtained after the joint processing hardly includes the noise and the reverberation signal affecting the DOA estimation accuracy, or includes a very small amount of the noise and the reverberation signal affecting the DOA estimation accuracy. Therefore, the electronic device has high DOA estimation accuracy in estimating the direction of arrival of the sound source signal of each target sound source. And even if the acquisition environment of the audio signal is a high-noise and high-reverberation acoustic environment, the DOA estimation accuracy is still higher.
It should be noted that the audio signal in the embodiment of the present application may be a signal that does not include a reverberation signal, and after the audio signal that does not include the reverberation signal is processed by the step of the joint iterative processing of dereverberation and blind source separation provided in the embodiment of the present application, the obtained reverberation signal may be 0; considering the accuracy of the dereverberation algorithm, the resulting reverberation signal may also be: a small proportion of the signal derived from the audio signal is considered to be a reverberant signal. Even if a small proportion of reverberation signals are obtained from audio signals which do not contain the reverberation signals, the method is equivalent to that signals with small parts processed from the audio signals are adopted in blind source separation, and therefore, the accuracy of the final DOA estimation is hardly influenced.
As a possible implementation manner, step 402 in the embodiment of the present application may be implemented by the following manner:
the electronic equipment performs Nth dereverberation processing on the audio signal to obtain an Nth prediction matrix and an Nth dereverberation signal, wherein the Nth dereverberation signal comprises signals except the Nth dereverberation signal in the audio signal, and the Nth dereverberation signal is a dereverberation signal removed in the Nth dereverberation processing; the electronic equipment performs blind source separation processing on the Nth dereverberation signal for the Nth time to obtain an Nth dereverberation matrix and an Nth de-noise signal, wherein the Nth de-noise signal is a signal obtained by removing noise obtained by blind source separation processing for the Nth time from the audio signal; the electronic equipment continues to perform dereverberation processing and blind source separation processing on the Nth de-noised signal; when N is a preset value or the Nth prediction matrix is converged and the Nth de-mixing matrix is converged, the electronic equipment obtains sound source signals of one or more target sound sources according to the Nth de-mixing matrix and the Nth de-reverberation signal; wherein N is a positive integer starting from 1.
In this embodiment of the present application, the electronic device performs the dereverberation process to remove the reverberation signal in the audio signal, so as to obtain a signal from which the reverberation signal is removed in the audio signal, where the signal from which the reverberation signal is removed in the audio signal may be recorded as an nth dereverberation signal or a present dereverberation signal, and the electronic device performs the blind source separation process to separate each sound source signal from noise, where the nth dereverberation signal is a signal from which the noise obtained in the present time is removed in the audio signal, and may also be understood as a present dereverberation signal. Therefore, theoretically, the sound source signal of the target sound source does not contain reverberation components and noise, however, considering the accuracy of the joint iterative process of dereverberation and blind source separation, a small amount of reverberation signals and/or noise may also exist in the sound source signal of each target sound source, and therefore, the sound source signal of the target sound source described in the embodiment of the present application does not mean that the reverberation signals and/or noise are not contained at all.
As an example of the dereverberation process, the dereverberation process is a WPE (weighted Prediction error) algorithm, and the idea of the WPE algorithm is as follows: assuming that the current received signal (which may also be understood as a processed signal during dereverberation processing) is a linear combination of the current clean signal (which is understood as a sound source signal) and the received signals of several past frames (which is understood as a reverberation signal), noise in the audio signal is ignored during the dereverberation processing.
As an example, a first hybrid model is obtained according to the above-mentioned WPE algorithm:
Figure BDA0002569397260000121
wherein, yl(t) represents a received signal of the microphone,
Figure BDA0002569397260000122
representing reverberant signals, i.e. past Δ frames to Δ + Kl-1 frame of the received signal corresponding to the reverberation signal,
Figure BDA0002569397260000131
indicating clean signals, l frequency points, τ frame numbers,
Figure BDA0002569397260000132
called prediction matrix
Figure BDA0002569397260000133
A conjugate transpose matrix of (also called linear coefficients).
From the above description, it can be understood that the objective of the dereverberation process is to estimate a prediction matrix, obtain a reverberation signal according to the prediction matrix, and subtract the reverberation signal from the current received signal, so as to recover a current pure signal, i.e., recover a signal with the reverberation signal removed.
Note that, when noise is ignored, the signal from which the reverberation signal is removed is a sound source signal obtained from the received signal, and when noise is present, the signal from which the reverberation signal is removed is a signal from which an estimated reverberation signal is removed from the received signal, and may be understood as a mixture of the sound source signal and the noise.
Therefore, the electronic device performs the dereverberation process, which includes a process of computing a prediction matrix that the electronic device can solve in an iterative manner such that the dereverberated processed signal passes through the matrix, separating the reverberated signal and the dereverberated signal as much as possible.
The process of solving the prediction matrix may be through a maximum likelihood function, and the optimization problem may be represented as:
step 1, initializing a prediction matrix,
Figure BDA0002569397260000134
wherein tau represents a frame number, delta is less than or equal to tau and less than or equal to delta + Kl-1。
Step 2, calculating the dereverberation
Figure BDA0002569397260000135
Γ is the set of frame numbers of the received signal, and the other parameters refer to the explanations in the first hybrid model.
Step 3, estimating the spatial correlation coefficient
Figure BDA0002569397260000136
Wherein E () is a desired function,
Figure BDA0002569397260000137
which is indicative of the processing of the signal,
Figure BDA0002569397260000138
to represent
Figure BDA0002569397260000139
Is a predetermined normal number, δ.
Step 4, calculating a weighted sample correlation matrix, assuming that the clean signal follows Gaussian distribution, i.e.
Figure BDA00025693972600001310
Then there areThe following are shown:
Figure BDA00025693972600001311
Figure BDA00025693972600001312
wherein,
Figure BDA00025693972600001313
is yl(t) a conjugate transpose of the matrix,
Figure BDA00025693972600001314
to indicate psilAnd (t-delta) and N is the number of the microphones in the microphone array.
Step 5, updating the prediction matrix parameters
Figure BDA00025693972600001315
By rearranging
Figure BDA00025693972600001316
The items are updated to obtain a prediction matrix
Figure BDA00025693972600001317
And 6, judging whether the prediction matrix is converged, if not, returning to the step 2, and if so, ending.
The process of calculating the prediction matrix by the electronic device is described above, but in practical applications, the prediction matrix may also be calculated in other iterative manners, which is not limited herein.
By illustrating the iterative process of the electronic device performing the dereverberation process in the embodiment of the present application, it can be understood that the prediction matrix and the processing signal (i.e. y) updated last time are used for each update of the prediction matrixl(t))。
The method is a post reverberation suppression technology based on linear prediction of time delay, can effectively suppress post reverberation signals (namely late reverberation signals), and damages short-time correlation of voice, so that independence among channels is increased to a certain extent.
As an example of the blind source separation process, the electronic device performs the blind source separation process by separating a sound source signal of each target sound source from a received signal (which may also be understood as a processed signal in the blind source separation process), and the blind source separation is a process of recovering each independent component (for example, a sound source signal of each target sound source) from only the received signal in the blind source separation process according to the statistical characteristics of the input sound source signal without knowing parameters of the sound source signal and a transmission channel.
Referring to fig. 5, fig. 5 is a schematic diagram of a second mixing model of sound source signals of multiple target sound sources, as shown in fig. 5, multiple target sound sources may exist in an environment, so sound source signals of multiple target sound sources may exist in an audio signal, environmental noise also exists in the environment, and a microphone array acquiring the audio signal may also cause device noise in the audio signal due to its own cause, so that it may be set that the audio signal acquired by the microphone array is formed by mixing sound source signals and noise of multiple target sound sources, and a reverberation signal in the audio signal is ignored in the blind source separation process, so the second mixing model may be expressed as:
X=AS+Ns
wherein X represents the processing signal during blind source separation processing, A is the mixing matrix, S is the sound source signal, NsIs noise.
Of course, if the microphone array that collects the audio signal contains a plurality of microphones, the received signal is a multi-channel audio signal, assuming that the number of channels of the processed signal is M.
The processing signal in the blind source separation processing is represented by a time domain angle as follows:
X(t)=[x1(t),x2(t),…,xM(t)]T
assuming that the processed signal corresponds to N independent source signals, represented in time domain angles are:
S(t)=[s1(t),s2(t),…,sN(t)]T
the second mixing model of the multiple sound source signal mixing system is:
X(t)=AS(t)+Ns(t)。
wherein X (t) is an M-dimensional observation vector, S (t) is an N-dimensional unknown sound source signal vector, and Ns(t) is M-dimensional noise, and A is an M × N-dimensional mixing matrix.
Fig. 6 is a schematic diagram of a separation model when separating the sound source signal of each target sound source from the processed signal, according to which the electronic device obtains an estimated value of each sound source information after performing blind source separation processing on the processed signal, and can also be regarded as a sound source signal. Can be expressed as follows:
Figure BDA0002569397260000141
wherein,
Figure BDA0002569397260000142
represents an estimated vector of the sound source signal, and W is a unmixing matrix.
As can be understood from the above formula, the blind source separation processing includes a process of solving a unmixing matrix, and in the embodiment of the present application, the electronic device may solve the unmixing matrix in an iterative manner, so that each component is separated as much as possible by a processing signal during the blind source separation processing through the matrix.
When the electronic device executes the blind source separation processing, the target sound source signal can be separated by adopting methods such as Independent Vector Analysis (IVA), Independent Component Analysis (ICA), Independent low-rank matrix analysis (ILRMA), and the like, separation can be performed based on signal independence maximization in the separation process, the cost function in the separation process can be a log-likelihood function of maximum likelihood estimation, and specific separation methods, cost functions, and optimization algorithms are not limited.
By way of example, in the embodiment of the present application, the electronic device may employ an independent vector analysis method, which essentially expands an independent component analysis technique to multiple data sets, fully utilizes statistical relevance between the multiple data sets, and decomposes the data sets with high-order statistics and second-order statistics. The goal of independent vector analysis is that the sources in each data set are independent of each other, and that a source in each data set is at most related to a source in the other data sets.
In order to enable the electronic device to satisfy the above description when performing blind source separation processing on a processed signal, in this embodiment of the application, data at each frequency (frequency point) in a frequency domain signal may be referred to as a data set, and then there are multiple data sets, where each data set is formed by linearly mixing multiple independent sound source signals.
In the independent Vector analysis method, a Source Component Vector (SCV) is defined, the Source Component Vector is composed of sound sources at corresponding positions of different data sets, and if the second hybrid model is converted from the time domain representation mode described above into a frequency domain representation mode.
The independent vector analysis method is actually the determination of a cost function including a unmixing matrix and an optimization algorithm for solving the unmixing matrix in the cost function. The cost function used requires a separation criterion based on independence measure, such as non-gaussian maximization criterion, mutual information minimization criterion, information maximization, maximum likelihood criterion, etc. The following description is from the perspective of the frequency domain.
By way of example, the cost function may be:
Figure BDA0002569397260000151
wherein J (W) represents the mutual information content in SCV, E [ [ alpha ] ]]Indicates expectation, skIs a vector of the kth sound source, there are K sound sources in total, G () is a contrast function, if G(s)k)=-logp(sk) G () is then the maximum entropy comparison function, p(s)k) As a function of the edge density of the SCV,omega denotes frequency point, total NωThe frequency bin, det () is the determinant of the matrix.
It can be seen from the above cost function that when the minimization of the cost function is realized, the minimization of entropy values between sound source vectors is also realized, and mutual information between SCVs is minimized.
For the above cost function, the process of minimizing the cost function is the process of iteratively solving the unmixing matrix W.
By way of example, the process of minimizing the cost function is as follows:
1. updating the weighted covariance Vk(ω):
Figure BDA0002569397260000152
Wherein E () represents a desired function,
Figure BDA0002569397260000153
for each frequency point omega, rkIs used in a general way and has the advantages that,
Figure BDA0002569397260000154
representing an item w in a demixing matrixkA conjugate transpose matrix of (ω), x (ω) being a received signal, xhAnd (ω) represents the conjugate transpose of x (ω).
2. Updating the unmixing matrix W:
wk(ω)←(W(ω)Vk(ω))-1ek
Figure BDA0002569397260000155
for the explanation of the parameters, refer to the above description.
Arrangement wkAnd (omega), obtaining the updated unmixing matrix, and circularly executing the step 1 and the step 2 until the unmixing matrix is converged.
As can be understood from the above update process of the unmixing matrix, the unmixing matrix updated in the last iteration and the processing signal (x (ω)) are required to be used in each iteration update of the unmixing matrix.
In actual application, the unmixing matrix of each data set can be estimated by adopting a batch processing algorithm, a self-adaptive algorithm, a successive extraction algorithm, a gradient descent method and a Newton-Laverson iterative algorithm.
As described above, the dereverberation process expands the independence between channels when the electronic device performs dereverberation. Therefore, in the embodiment of the present application, the independent iterative process of the dereverberation process and the independent iterative process of the blind source separation process are combined, so that the dereverberation process and the blind source separation process are performed simultaneously. Moreover, when the electronic device executes the joint iterative processing, because the processed signal subjected to dereverberation is the audio signal without noise and the processed signal subjected to blind source separation is the audio signal without reverberation components, the processes of dereverberation processing and blind source separation processing both meet respective algorithm models, and thus a more accurate separation result can be obtained.
As an example, the joint iterative process of the dereverberation process and the blind source separation process may adopt a process of "dereverberation process, blind source separation process, … …" or a process of "blind source separation process, dereverberation process, … …". For convenience of description, in the embodiment of the present application, the "dereverberation process-blind source separation process" is denoted as a one-time joint iterative process, but it may also be considered in practical application that the "blind source separation process-dereverberation process" is a one-time joint iterative process.
Of course, the process of performing the dereverberation processing by the electronic device during the joint iterative processing does not mean a process of updating the prediction matrix to convergence once, but is a process of updating the prediction matrix by one or more iterations during the dereverberation processing, that is, the prediction matrix obtained when the electronic device performs the dereverberation processing by one of the joint iterative processing may not converge. Similarly, the electronic device does not perform a blind source separation process in the joint iteration process to represent a process of updating the unmixing matrix to convergence once, but performs a process of updating the prediction matrix by one or more iterations during the blind source separation process, that is, the unmixing matrix obtained by the blind source separation process in the joint iteration process may not converge. The method comprises the steps that when electronic equipment executes dereverberation processing every time, reverberation signals and dereverberation signals of current iteration can be obtained according to a prediction matrix obtained by current iteration, the prediction matrix is more and more accurate along with the process of iteration, the obtained reverberation signals and dereverberation signals are more and more accurate, when the electronic equipment executes blind source separation processing every time, sound source signals and noise of each target sound source of the current iteration can be obtained according to a de-mixing matrix obtained by the current iteration, and along with the process of iteration, the de-mixing matrix is more and more accurate, and the obtained sound source signals and noise of each target sound source are more and more accurate. And circulating in this way until the condition of stopping circulation is met, wherein the condition that whether the condition of stopping circulation is met or not is judged by the electronic equipment by judging whether the number of times of joint iteration reaches the preset number or not, and whether the obtained prediction matrix and the obtained unmixing matrix are converged or not can also be judged by the electronic equipment. And after the condition of stopping circulation is met, obtaining sound source signals and noise of one or more target sound sources according to the finally obtained de-mixing matrix and the finally obtained de-reverberation signal.
Referring to fig. 7, fig. 7 is a schematic diagram of a process in which the electronic device performs joint iterative processing of dereverberation and blind source separation, taking 3 joint iterative processes as an example.
In the 1 st joint iteration processing process, in the dereverberation processing stage, the electronic equipment calculates and obtains a prediction matrix 1 according to the audio signal. Then, the electronic device calculates a reverberation signal 1 and a dereverberation signal 1 according to the prediction matrix 1 (the dereverberation signal 1 includes signals except the reverberation signal 1 in the audio signal). In the blind source separation stage, the electronic device uses the dereverberation signal 1 as a processing signal, calculates to obtain a de-mixing matrix 1, and obtains noise and a de-noise signal 1 according to the de-mixing matrix 1 (the de-noise signal 1 includes signals except the noise signal 1 in the audio signal).
In the 2 nd joint iteration processing process, in the dereverberation processing stage, the electronic equipment updates the prediction matrix 1 according to the de-noise signal 1 to obtain a prediction matrix 2. The electronic device obtains a reverberation signal 2 and a dereverberation signal 2 according to the prediction matrix 2 calculation (the dereverberation signal 2 comprises signals except the reverberation signal 2 in the audio signal). In the stage of performing blind source separation processing, the electronic device uses the dereverberation signal 2 as a processing signal, calculates a de-mixing matrix 2, and obtains noise and a de-noise signal 2 according to the de-mixing matrix 2 (the de-noise signal 2 includes signals except the noise signal 2 in the audio signal).
In the 3 rd joint iteration processing process, in the dereverberation understanding stage, the electronic equipment updates the prediction matrix 2 according to the de-noise signal 2 to obtain a prediction matrix 3. The electronic equipment calculates and obtains a reverberation signal 3 and a dereverberation signal 3 according to the prediction matrix 3 (the dereverberation signal 3 comprises signals except the reverberation signal 3 in the audio signals); in the blind source separation processing stage, the electronic device uses the dereverberation signal 3 as a processing signal, calculates a de-mixing matrix 3, and obtains noise and a de-noise signal 3 according to the de-mixing matrix 3 (the de-noise signal 3 includes signals except the noise signal 3 in the audio signal).
After the joint iteration processing is finished, the electronic equipment calculates and obtains the sound source signal of each target sound source according to the dereverberation signal obtained by the last joint iteration and the unmixing matrix obtained by the last joint iteration. Of course, the electronic device may also calculate the sound source signal of each target sound source according to the dereverberation signal obtained by the 2 nd last joint iteration and the de-mixing matrix obtained last time. This is because at the end of the joint iteration, the prediction matrix and the unmixing matrix converge, i.e., the prediction matrices for the last several times have small differences or have differences within an acceptable range, and the unmixing matrices for the last several times have small differences or have differences within an acceptable range. Thus, the electronic device may calculate the sound source signal for each target sound source based on any of the dereverberated signals obtained for the last several joint iterations and any of the unmixing matrices obtained for the last several joint iterations.
As another example, in each joint iteration processing process, in the dereverberation processing stage, when the electronic device calculates the prediction matrix obtained by dereverberation this time, the prediction matrix may be iterated independently for multiple times, and the prediction matrix obtained by the last independent iteration is used as the prediction matrix obtained by dereverberation this time in the joint iteration; in the blind source separation processing stage, when the electronic device calculates the unmixing matrix obtained by the blind source separation, the electronic device may also independently iterate the unmixing matrix multiple times, and the unmixing matrix obtained by the last independent iteration is used as the unmixing matrix obtained by the blind source separation of the joint iteration. Namely, the electronic device performs the process of dereverberation processing for the jth time including updating the prediction matrix for m times, and performs the process of blind source separation processing for the jth time including updating the unmixing matrix for n times, wherein j, m and n are positive integers.
As another example, in each joint iteration, the dereverberation signal obtained from the last dereverberation process may be used as the processing signal for blind source separation, and the audio signal may be used as the processing signal for dereverberation. Of course, in each joint iteration process, the de-noise signal obtained by the previous blind source separation process may also be used as the de-reverberation processing signal, and the audio signal may also be used as the blind source separation processing signal.
As another example, in each joint iteration process, a dereverberation signal obtained at any time in the history iteration process may be used as a processing signal for the current blind source separation, or a de-noise signal obtained at any time in the history iteration process may be used as a processing signal for the current dereverberation.
As an example, if the p-th prediction matrix obtained by the p-th dereverberation process converges and the p-th unmixing matrix obtained by the p-th blind source separation process does not converge, the electronic device performs the p + i-th blind source separation process until the p + i-th unmixing matrix converges,
or, the electronic device alternately performs the dereverberation processing of the (p + i) th time and the blind source separation processing of the (p + i) th time until the (p + i) th prediction matrix and the (p + i) th unmixing matrix converge simultaneously, wherein p is a positive integer, and i is a positive integer starting from 1.
As another example, in the joint iterative process, if the q-th prediction matrix obtained by the q-th dereverberation process does not converge and the q-th unmixing matrix obtained by the q-th blind source separation process converges, the electronic device performs the q + i-th dereverberation process until the q + i-th prediction matrix converges,
or, the electronic device alternately performs the (q + i) th dereverberation processing and the (q + i) th blind source separation processing until the (q + i) th prediction matrix and the (q + i) th unmixing matrix converge simultaneously, wherein q is a positive integer and i is a positive integer starting from 1.
As shown in fig. 8, fig. 8 shows another possible embodiment of the method for estimating the direction of arrival of a sound source provided by the present application, and the embodiment includes steps 801 to 804.
In steps 801 to 803, reference may be made to the descriptions of steps 401 to 403, which are not described herein again.
Step 804, the electronic device performs, according to first direction information of a sound source signal of a first target sound source on different frequency points, joint processing of sound source separation and direction of arrival estimation on the sound source signal of the first target sound source, to obtain second direction information of the sound source signal of the first target sound source on different frequency points, wherein the sound source separation processing includes: joint iterative processing of dereverberation and blind source separation.
Step 804 may be implemented by the following steps:
and step A, the electronic equipment executes sound source separation processing on the sound source signal of the first target sound source according to the first direction information of the sound source signal of the first target sound source on different frequency points so as to obtain the sound source signal of the first target sound source after the joint processing.
And B, the electronic equipment performs direction-of-arrival estimation processing on the sound source signals subjected to the joint processing of the first target sound source, so as to obtain first direction information of the first target sound source on different frequency points after the joint processing of the first target sound source.
And C, after the electronic equipment circularly executes the steps A to B Q times, recording the first direction information of the first target sound source on different frequency points obtained at the last time as the second direction information of the first target sound source on different frequency points, wherein Q is more than or equal to 1, and Q is a positive integer.
In the embodiment of the application, dereverberation and blind source separation are both estimation algorithms, after an audio signal is subjected to joint iterative processing of dereverberation and blind source separation, a sound source signal of a target sound source obtained by the joint iterative processing may further include a noise and/or a reverberation signal, and first direction information obtained by an electronic device based on the sound source signal including the noise and/or the reverberation signal may be inaccurate. Therefore, the electronic device can continue to perform sound source separation processing and DOA estimation processing on the sound source signal of the first target sound source based on the currently obtained first direction information of the sound source signal of the first target sound source on different frequency points, and the process of the sound source separation processing is constrained by the first direction information of the sound source signal of the first target sound source on different frequency points when the sound source separation processing is performed again, so that the electronic device can obtain more accurate sound source signal of the first target sound source, the de-mixing matrix and the mixing matrix after performing the sound source separation, and the electronic device can obtain more accurate second direction information of the sound source signal of the first target sound source on different frequency points according to the more accurate sound source signal of the first target sound source, the de-mixing matrix or the mixing matrix. Of course, the obtained second direction information of the sound source signal of the first target sound source on different frequency points is more accurate than the first direction information of the sound source signal of the first target sound source on different frequency points.
Since the processes of dereverberation, blind source separation and DOA estimation are estimation processes established on a model, in a joint iterative process of dereverberation and blind source separation performed by the electronic device, a small amount of reverberation signals may be doped in the dereverberation signals obtained at the last time, and some noise and/or reverberation signals may be doped in the obtained sound source signals of the target sound source, or even a small amount of sound source signals of other target sound sources. Therefore, the electronic device performs the sound source separation process again on the sound source signal of the first target sound source based on the obtained first direction information of the first target sound source in step 704, so that the obtained sound source signal of the first target sound source can be made purer, thereby improving the accuracy of the subsequent calculation of the sound source signal of the first target sound source.
As a possible embodiment, the method for estimating a direction of arrival of a sound source provided by the embodiment of the present application further includes:
the electronic equipment performs smoothing filtering processing or kernel density estimation processing on first direction information of a sound source signal of a first target sound source on different frequency points to obtain third direction information of the first target sound source on different frequency points; and the electronic equipment fuses the third direction information of the first target sound source on different frequency points to obtain the direction of the first target sound source.
For example, the electronic device performs smoothing filtering or kernel density estimation on the first direction information of the sound source signal of the first target sound source at the frequency point 1, the frequency point 2, and the frequency point … … (assuming that there are L frequency points), and obtains the third direction information of the first target sound source at the frequency point 1, the frequency point 2, the frequency point … …, and the frequency point L. The purpose of the smoothing filtering processing or the kernel density estimation processing can be to remove some interference, so as to obtain more accurate third direction information of the sound source signal of the first target sound source on different frequency points. And finally, the electronic equipment fuses the third direction information of the sound source signal of the first target sound source on different frequency points together so as to determine the more accurate direction of the sound source signal of the first target sound source.
Certainly, in practical applications, the electronic device may perform smoothing filtering processing or kernel density estimation processing on first direction information of a sound source signal of a first target sound source at different frequency points, or may perform smoothing filtering processing or kernel density estimation processing on second direction information of the sound source signal of the first target sound source at different frequency points, so as to obtain third direction information of the sound source signal of the first target sound source at different frequency points. As can be understood from the above description, the second direction information of the sound source signal of the first target sound source at different frequency points is more accurate than the first direction information of the sound source signal of the first target sound source at different frequency points, and the third direction information of the sound source signal of the first target sound source at different frequency points is more accurate than the first direction information or the second direction information of the sound source signal of the first target sound source at different frequency points.
The method comprises the steps of forming a first set by first direction information (such as an angle value of each frequency point) of a sound source signal of a first target sound source on each frequency point, forming a second set by second direction information of the sound source signal of the first target sound source on each frequency point, and forming a third set by third direction information of the sound source signal of the first target sound source on each frequency point, wherein the angle values in the first set are possibly scattered, the angle values in the second set are more concentrated than the angle values in the first set, and the angle values in the third set are more concentrated than the angle values in the second set.
As another embodiment of the present application, the microphone array may be an Acoustic Vector Sensor (AVS), which may also be referred to as an Acoustic Vector microphone.
Since the DOA estimation algorithm is related to the size and arrangement of the microphone array, the electronic device needs to adjust the DOA estimation algorithm according to the size and arrangement of the microphone array for acquiring the audio signal when performing the DOA estimation process. Referring to fig. 9, which is a schematic structural diagram of an acoustic vector microphone, an AVS is composed of 1 omnidirectional microphone and 2 to 3 orthogonal 8-shaped microphones, and it can be considered that each microphone in the AVS is concurrent, and the size of the AVS can be made smaller. For a sound source, the sound source signals of the sound source received by each microphone channel do not have phase difference, and the amplitude of the sound source signal of the same sound source received by each microphone channel is related to the direction of the sound source, so that the acoustic vector microphone does not need to consider the size and arrangement mode of an array when being applied, and has wider application scenes.
The sound source signal obtained by subjecting the audio signal to the sound source separation process is a signal from which noise and reverberation signals are removed, and the microphone array is a co-located microphone, so that there is no phase difference in the components of the sound source signal received in each channel for one sound source signal. Thus, when a source signal is acquired by two or three microphone channels in quadrature, the amplitude of the source signal on each channel is related to the direction, or the direction of arrival of the source signal is related to the amplitude of the source signal on each channel.
Taking the example of a three-dimensional four-channel microphone array, the three-dimensional four-channel microphone array includes: an omnidirectional microphone and 3 orthogonal figure-8 microphones, that is, the three-dimensional microphone array comprises 4 microphones, then the audio signals collected by the microphone array are four-channel audio signals. Assuming that the channels of an omni-directional microphone are denoted by W and the channels of 3 orthogonal figure 8 microphones are denoted by X, Y and Z, respectively, the amplitude and direction of the microphone array are represented as follows:
Figure BDA0002569397260000191
wherein, W represents the amplitude of the signal collected by the omnidirectional microphone channel; x, Y and Z represent the amplitude of the signal acquired by the channel in three orthogonal directions of a Cartesian coordinate system, respectively, theta represents the horizontal angle of the sound source in the Cartesian coordinate system,
Figure BDA0002569397260000192
and F represents the amplitude of the signal collected by the omnidirectional channel.
For the sound source signal from which the excessive reverberation signal and the noise are removed, the above-described relationship between the amplitude and the direction is satisfied, and therefore, the electronic device may obtain the direction of arrival of a second target sound source, which is any one of the one or more target sound sources, according to the relationship between the amplitudes of the sound source signal of the second target sound source in the respective channel directions of the acoustic vector sensor.
As an example, in a three-dimensional four-channel acoustic vector microphone, the ratio between the amplitudes on the X and Y channels may yield the arctangent of the horizontal angle, and the ratio between the amplitudes on the Z and W channels may yield the arcsine of the pitch angle.
Of course, based on the relationship between the amplitude and the direction angle of each channel in the acoustic vector microphone, other relationships between the amplitudes in the directions of each channel may also be evolved, so as to obtain the horizontal angle and the pitch angle, which is not exemplified.
In the example of a two-dimensional three-way microphone array, the amplitude and direction of the microphone array are represented as follows:
Figure BDA0002569397260000201
wherein, W represents the amplitude of the signal collected by the omnidirectional microphone channel; x, Y respectively represent the amplitude of the signal acquired by the channel in two horizontally orthogonal directions of a Cartesian coordinate system, theta represents the horizontal angle of the sound source, and F represents the amplitude of the signal acquired by the omni-directional channel.
The ratio between the amplitudes of the source signal of the second target sound source on the X channel and the Y channel can obtain the arctan value of the horizontal angle.
In addition, it should be noted that, taking a two-dimensional three-way acoustic vector microphone as an example, the relationship between the amplitude and the direction angle on each channel in the acoustic vector microphone described above is a first order relationship, and in order to obtain the DOA estimation result more accurately, when the relationship between the amplitude and the phase and the direction angle is considered, a second order relationship, even a higher order relationship, may exist. Of course, the microphone array may be in other forms, such as a spherical microphone array, a ring microphone array. If the spherical microphone array and the annular microphone array are considered to be in a common point condition, the phase difference of each channel can also be not considered, if the phase and amplitude difference of each channel of the spherical microphone array or the annular microphone array needs to be considered, the relationship among the amplitude, the phase and the direction angle of each channel may have a higher order relationship, and the order of the relationship among the amplitude, the phase and the direction is not limited in the application.
The above calculation process may be performed by the electronic device on data of each frequency point of the sound source signal of the second target sound source, that is, the obtained horizontal angle and pitch angle are the horizontal angle and pitch angle of the second target sound source at each frequency point. The horizontal angle and the pitch angle of the second target sound source at each frequency point may be referred to as DOA information (e.g., the first direction information, the second direction information, and the third direction information described above) of the second target sound source.
In some embodiments, another way of obtaining first direction information of a second target sound source is also provided.
In describing the above-described second mixing model, since the amplitude of the sound source signal of the second target sound source on each channel is obtained by mixing the sound source signals by the mixing matrix, the channels in the microphone array are considered to be co-located. That is, there is no phase difference between the components of the source signal of the second target source on each channel, and therefore, the ratio between the amplitudes of the source signal of the second target source in each channel direction of the acoustic vector sensor is implied in the mixing matrix. Therefore, the electronic device may obtain the arrival direction of the sound source signal of the second target sound source from the mixing matrix of the sound source signals of the one or more target sound sources.
Taking a two-dimensional microphone as an example, the second mixed model of the sound source signal is:
Figure BDA0002569397260000202
wherein, XW、XX、XYThree channels of received audio signals, S, respectively, of a two-dimensional AVS1、S2For two source signals, N ═ AN', N is noise.
Meanwhile, as before, for one sound source signal, the following relationship exists between the amplitude and horizontal angle of each channel:
Figure BDA0002569397260000203
wherein, W represents the amplitude of the signal collected by the omnidirectional microphone channel; x, Y respectively represent the amplitude of the channel acquired signals in two horizontally orthogonal directions of a cartesian coordinate system, theta represents the horizontal angle of the sound source, and F represents the amplitude of the omni-directional channel acquired signal.
As understood from the above two formulas, the 1 st column in the mixing matrix represents the column in which the sound source signal of the first target sound source is located, and the 2 nd column in the mixing matrix represents the column in which the sound source signal of the second target sound source is located. Row 1 in the hybrid matrix represents the omni channel, row 2 represents the X channel, and row 3 represents the Y channel.
Based on the above description, the electronic device obtaining the direction of arrival of the sound source signal of the second target sound source according to the mixing matrix of the sound source signals of the one or more target sound sources includes: the electronic device determines a target column in the mixing matrix, and a first target row and a second target row in the target column, wherein the target column is a column representing a sound source signal of a second target sound source, and the first target row and the second target row are rows related to an angle of the sound source signal of the second target sound source; the electronic device obtains the arrival direction of the sound source signal of the second target sound source according to the elements of the first target row and the elements of the second target row in the target column. When the first target behavior represents a row of an X channel of the acoustic vector sensor and the second target behavior represents a row of a Y channel of the acoustic vector sensor, the arrival direction of the sound source signal of the second target sound source includes a horizontal angle of the sound source signal of the second target sound source, and the horizontal angle is an angle in a coordinate system where the acoustic vector sensor is located.
By way of example, the electronic device calculates first direction information on each frequency point of a sound source by using the mixing matrix a, and when first direction information on each frequency point of a sound source signal of an α -th target sound source is solved, any one of the following is adopted:
Figure BDA0002569397260000211
wherein A isχαAnd the element of the gamma row and the alpha column in the mixing matrix corresponding to the alpha sound source is shown, theta is the horizontal angle of the sound source signal on each frequency point, and gamma is 1, 2 and 3. That is, after the target column is determined, the horizontal angle can be calculated and obtained by the elements on the row corresponding to any two channels of the three channels of the two-dimensional three-channel acoustic vector microphone.
Taking a three-dimensional four-channel acoustic vector microphone as an example, when the pitch angle is calculated by referring to the description in the two-dimensional three-channel acoustic vector microphone in the calculation mode of the horizontal angle, firstly, a target column needs to be determined, and then, the pitch angle is obtained according to the ratio of elements in a row where a Z channel is located in a mixed matrix to elements in a row where an omnidirectional channel is located.
By way of example, when the pitch angle of the sound source signal of the α -th target sound source at each frequency point is solved, any one of the following is adopted:
Figure BDA0002569397260000212
Figure BDA0002569397260000213
wherein A isχαAnd the element of the gamma row and the alpha column in the mixing matrix corresponding to the alpha sound source is shown, theta is the horizontal angle of the sound source signal on each frequency point, gamma is 1, 2, 3, 4, and theta is the horizontal angle.
As described above, the mixing matrix and the unmixing matrix are inverse matrices, so the electronic device can also calculate the first direction information of the sound source signal of the target sound source at each frequency bin using the unmixing matrix W.
As a possible embodiment, the electronic device obtaining, according to the unmixing matrix of the sound source signals of the one or more target sound sources, the direction of arrival of the sound source signal of the second target sound source includes: the electronic equipment can obtain a horizontal angle of a sound source signal of the second target sound source according to elements of a column of an X channel representing the acoustic vector sensor and elements of a column of a Y channel representing the acoustic vector sensor in a target row representing the second target sound source in the unmixing matrix, wherein the horizontal angle is an angle in a coordinate system where the acoustic vector sensor is located; the electronic device may obtain a pitch angle of a sound source signal of the second target sound source according to an element representing a column of a Z channel of the acoustic vector sensor and an element representing a column of an omni-directional channel of the acoustic vector sensor in a target row representing the second target sound source in the unmixing matrix, where the pitch angle is an angle in a coordinate system in which the acoustic vector sensor is located.
Taking a two-dimensional three-channel acoustic vector microphone as an example, the electronic device first determines a target row, and then uses an element on a column on which an X channel is located and an element W on a column on which a Y channel is located on the target row in the unmixing matrixYThe horizontal angle is calculated.
By way of example, when the electronic device solves the first direction information on each frequency point of the α -th sound source, it adopts:
Figure BDA0002569397260000221
wherein, WγαAnd (3) representing the elements of the alpha row and the gamma column in the unmixing matrix corresponding to the alpha sound source, wherein gamma is 1, 2 and 3. As before, that is, after the target row is determined, the elements on the columns corresponding to any two channels of the three channels of the two-dimensional three-channel acoustic vector microphone can be calculated to obtain the horizontal angle.
When the microphone is a three-dimensional four-channel acoustic vector microphone array, the manner of calculating the horizontal angle refers to the above description, and the manner of calculating the pitch angle may be: firstly, a target row is determined, and then a pitch angle is calculated and obtained according to the ratio of elements on the column where the Z channel is located and elements on the column where the omnidirectional channel is located on the target row in the unmixing matrix. Of course, in practical applications, more calculation methods can be developed, and no examples are given.
Through the above-described process, it can be understood that, in practice, the dereverberation process and the blind source separation process are both estimation methods, the sound source signal obtained by the electronic device executing the above estimation methods may further include some interference factors, and in addition, for application in different application scenarios, the electronic device may further execute other processes, such as the first enhancement process, the second enhancement process, and the process of adjusting the proportional relationship among the sound source signal, the reverberation signal, and the noise, which are described later.
Referring to fig. 10, as a possible implementation, the diagram of fig. 10 includes dereverberation, blind source separation, DOA estimation, and enhancement. The dereverberation process, the blind source separation process and the DOA estimation can refer to the description in the above embodiments, and are not described herein again. The enhancement treatment may include two ways: one is that the electronic device performs a first enhancement process on the sound source signal; one is that the electronic device performs second enhancement processing on the sound source signal based on first direction information, second direction information, or third direction information of the sound source signal. The enhancement processing in the drawing shown in fig. 10 may include the second enhancement processing, and may also include the first enhancement processing. Subsequently, the first target sound source and the first direction information are taken as an example, and the first target sound source is any one of one or more target sound sources. In practical applications, the first enhancement processing, the second enhancement processing, and the proportional relationship adjustment processing described above may be performed for each target sound source
Taking a first enhancement process and a first target sound source as an example, the electronic device performs the first enhancement process on a sound source signal of the first target sound source, wherein the first enhancement process includes: the method comprises the steps of interference spectrum filtering processing and/or harmonic enhancement processing, wherein a first target sound source is any one of sound source signals of one or more target sound sources; the interference frequency spectrum filtering processing is used for filtering interference components mixed in the sound source signal of the first target sound source based on the frequency spectrum energy of the sound source signal of the first target sound source; and harmonic enhancement processing for obtaining a harmonic enhancement signal of the first target sound source, the harmonic enhancement signal being a sound source signal containing a harmonic component.
In some embodiments, the interference spectrum filtering process is a process in which the electronic device filters interference components mixed in the sound source signal of the first target sound source based on the spectral energy of the sound source signal of the first target sound source.
For example, the electronic device uses a Gaussian Mixture Model (GMM) to Model the spectral energy of the sound source signal of the first target sound source, determines a primary sound source spectral range according to the spectral energy, and then deletes the sound source signal corresponding to the spectral energy that is not within the primary sound source spectral range in the sound source signal of the first target sound source.
After the electronic equipment executes the interference spectrum filtering processing, other interference signals with larger difference with the spectrum energy of the main sound source in the sound source signal of the first target sound source can be removed, and therefore a purer sound source signal is obtained.
When an audio signal capture device converts live sound into an audio signal, the full quality of the live sound is generally not adequately recorded and converted, resulting in an audio signal that does not include many of the original harmonics. However, many sounds heard by humans with a pitch or fundamental frequency often contain harmonics that can produce, for example, a tonal quality or sound quality as emitted by a musical instrument. To enrich the sound we hear or to reproduce the real sound emitted by a musical instrument or the like, it is necessary to add harmonic components to the audio signal. Harmonic enhancement processing is a technique in which an electronic device adds harmonics to a sound source signal. Of course, harmonic components may or may not be present in the sound source signal before the harmonic enhancement processing. The purpose of the harmonic enhancement processing is to obtain a sound source signal containing harmonic components.
Taking second enhancement processing based on the first direction information of the sound source signal of the first target sound source as an example, the electronic device performs second enhancement processing on the sound source signal of the first target sound source based on the first direction information of the sound source signal of the first target sound source on different frequency points, wherein the second enhancement processing includes interference direction filtering processing and/or beam forming directional enhancement processing; the interference direction filtering processing is used for filtering frequency points of which the direction angles are not within an expected angle range in the sound source signals of the first target sound source; and a beam forming directional enhancement process for enhancing the power of the sound source signal of the first target sound source in a desired direction.
In some embodiments, the interference direction filtering process is configured to filter frequency points in the sound source signal of the first target sound source, where the direction angle is not within the desired angle range, so as to suppress sounds in directions other than the direction in which the first target sound source is located. In a specific implementation, a frequency domain mask may be performed on the first direction information of the sound source signal of the first target sound source, that is, the θ angle corresponding to the sound source signal is expanded by a certain range, the first direction information at each frequency point is compared with the range, and the component corresponding to the first direction information at the frequency point exceeding the range is removed.
For example, when the first target sound source is the sound source signal of the kth target sound source, the electronic device may set [ θ ] when performing the masking processkkkk]The mask range of the sound source signal of the kth target sound source is set. Comparing the first direction information of the sound source signal of the kth target sound source on each frequency point with the mask range, and enabling the first direction information of the kth sound source signal not to be in the mask rangeRemoving components corresponding to the first direction information in the data block, wherein k is more than or equal to 1 and is a positive integer.
The electronic equipment executes interference direction filtering processing, and removes interference signals in other directions with a far difference from the direction of the sound source signal of the first target sound source in the sound source signal of the first target sound source, so that a purer sound source signal is obtained.
The beam forming directional enhancement processing is used for enhancing the power of a sound source signal in a desired direction; in particular implementation, the electronic device adds signals related to the sound source signal of the first target sound source to be enhanced, and does not add the sound source signals and interference of other unrelated target sound sources, so that the power of the sound source signal of the first target sound source to be enhanced is enhanced.
By way of example, the electronic device may process the sound source signals using the Null steering method, which uses angles of different sound sources in space, without steering vectors, for customized beamforming. Assuming that there are four source signals, the angles are different, respectively. According to this method, signals in the direction where only the first sound source signal is obtained are kept, while signals in other directions are suppressed.
In the embodiment of the application, sound source separation, DOA estimation and enhancement processing are combined, and after the audio signal is subjected to joint iteration processing of dereverberation processing and blind source separation processing by electronic equipment, more accurate sound source signal, reverberation signal and noise are obtained. The electronic equipment determines the direction of a target sound source by a DOA estimation method, further enhances the sound source signal by enhancement processing, weakens interference components, makes the finally obtained sound source signal purer, and has stronger power or increases harmonic components and the like so as to obtain better auditory effect.
As another embodiment of the present application, in a process that an electronic device processes an audio signal by joint iteration of dereverberation and blind source separation to obtain a sound source signal of one or more target sound sources, the electronic device obtains noise and a reverberation signal of the one or more target sound sources from the audio signal;
the electronic device adjusts a proportional relationship between the noise, a sound source signal of a first target sound source, and a reverberation signal of the first target sound source, the first target sound source being any one of the one or more target sound sources.
As an example, when a signal of a KTV scene effect needs to be obtained, the electronic device may set a loudness ratio relationship among the sound source signal, the reverberation signal, and the noise to be that, the sound source signal: reverberation signal: noise is beta11:β12:β13. When a signal with a concert hall scene effect needs to be obtained, the loudness proportional relation among the sound source signal, the reverberation signal and the noise can be set as follows: sound source signal: reverberation signal: noise is beta21:β22:β23. When a signal with a field scene effect needs to be obtained, the electronic device may set a loudness proportional relationship among the sound source signal, the reverberation signal, and the noise to be: sound source signal: reverberation signal: noise is beta31:β32:β33
When the electronic device adjusts the proportional relationship between the three, the loudness proportional relationship, the power proportional relationship, and the like between the three can be adjusted, and in practical application, different adjustment parameters can be set according to specific scene effects.
The first enhancement processing, the second enhancement processing and the proportional relationship adjustment processing mentioned above are all post-processing manners, and in practical application, one of the post-processing manners may be selected, or a plurality of post-processing manners may be selected to be combined, and what kind of post-processing is specifically performed may be determined according to a specific application scenario.
For example, when a mobile phone with a microphone array is used for recording or video recording, the microphone array collects multi-channel audio signals, and the electronic device executes the method for estimating the sound source arrival direction provided by the embodiment of the present application, so that sound source signals of one or more target sound sources in a recording site can be separated, DOA estimation can be performed on the separated sound source signals of the target sound sources, and then automatic zooming of the target sound sources or changing of sound effects of the sound source signals of the target sound sources can be realized through one or more post-processes.
As another application scenario, a user may perform a video call through an electronic device, for example, a teleconference is taken as an example, a microphone array is disposed on a large screen used in the teleconference, the microphone array collects environmental sound to obtain an audio signal, and the electronic device executes any method for estimating a sound source direction of arrival provided in this application embodiment, so as to obtain sound source signals of one or more target sound sources and a direction of the sound source signal of each target sound source. The electronic device can also enhance the power of the signal in the speaker direction in a post-processing mode, or filter the interference signal in the speaker direction when the speakers speak and the positions are close, so that the voice signal of each speaker can be accurately separated, and the direction of each speaker can be accurately determined.
Of course, the microphone array may also be applied to a hearing aid, and in a complex acoustic environment, the microphone array collects ambient sound, and the hearing aid performs any of the above methods for estimating the sound source direction of arrival, so that the loudness ratio of the sound source signal can be increased, and the loudness ratio of the reverberation signal and the noise can be reduced, so that a user can obtain a clearer sound source signal when wearing the hearing aid.
The electronic device performs the processing procedures on the frequency domain signals, and after one or more combined processing in the processing procedures are finished, the electronic device can also convert the frequency domain signals into time domain signals and then send out or play the time domain signals. The process of converting a frequency domain signal into a time domain signal by an electronic device can be understood as the inverse process of time-frequency conversion. For the inverse process of time-frequency conversion, the existing method of time-frequency conversion can be referred to, and is not described herein again.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
In the embodiment of the present application, the electronic device may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation. The following description will be given by taking the case of dividing each function module corresponding to each function:
referring to fig. 11, the electronic device 1100 includes:
an audio signal acquisition unit 1101 for acquiring an audio signal comprising noise, a sound source signal of one or more target sound sources, a reverberation signal
The dereverberation processing unit 1102 is configured to perform an nth dereverberation process on the audio signal to obtain an nth prediction matrix and an nth dereverberation signal, where the nth dereverberation signal includes a signal of the audio signal other than the nth dereverberation signal, and the nth dereverberation signal is a dereverberation signal removed in the nth dereverberation process;
a blind source separation processing unit 1103, configured to perform nth blind source separation processing on the nth dereverberation signal to obtain an nth dereverberation matrix and an nth de-noise signal, where the nth de-noise signal is a signal obtained by removing noise obtained by the nth blind source separation processing from the audio signal;
a sound source signal obtaining unit 1104 for performing dereverberation processing and blind source separation processing on the nth denoised signal; when N is a preset value or the Nth prediction matrix is converged and the Nth de-mixing matrix is converged, obtaining sound source signals of one or more target sound sources according to the Nth de-mixing matrix and the Nth de-reverberation signal; wherein N is a positive integer starting from 1;
a sound source direction estimating unit 1105 for determining the direction of arrival of the sound source signals of one or more target sound sources.
As another embodiment of the present application, for a first target sound source, the first target sound source is any one of one or more target sound sources,
the direction of arrival of the sound source signal of the first target sound source includes: and the first direction information of the sound source signal of the first target sound source on different frequency points.
As another embodiment of the present application, the sound source direction estimation unit 1105 is further configured to:
according to the first direction information of the sound source signal of the first target sound source on different frequency points, performing combined processing of sound source separation and direction of arrival estimation on the sound source signal of the first target sound source to obtain second direction information of the sound source signal of the first target sound source on different frequency points, wherein the sound source separation processing comprises the following steps: joint iterative processing of dereverberation and blind source separation.
As another embodiment of the present application, the sound source direction estimation unit 1105 is further configured to:
performing smoothing filtering processing or kernel density estimation processing on first direction information of a sound source signal of a first target sound source on different frequency points to obtain third direction information of the first target sound source on different frequency points; and fusing the third direction information of the first target sound source on different frequency points to obtain the direction of the first target sound source.
As another embodiment of the present application, the sound source signal obtaining unit 1104 is further configured to:
if the p prediction matrix obtained by the p dereverberation processing is converged and the p unmixing matrix obtained by the p blind source separation processing is not converged, the electronic equipment executes the p + i blind source separation processing until the p + i unmixing matrix is converged, or alternatively executes the p + i dereverberation processing and the p + i blind source separation processing until the p + i prediction matrix and the p + i unmixing matrix are converged simultaneously, wherein p is a positive integer, and i is a positive integer starting from 1;
if the q prediction matrix obtained by the q dereverberation processing is not converged and the q de-mixing matrix obtained by the q blind source separation processing is converged, the electronic equipment executes the q + i dereverberation processing until the q + i prediction matrix is converged, or alternatively executes the q + i dereverberation processing and the q + i blind source separation processing alternately until the q + i prediction matrix and the q + i de-mixing matrix are converged simultaneously, wherein q is a positive integer and i is a positive integer starting from 1.
As another embodiment of the present application, the sound source signal obtaining unit 1104 performs a j-th dereverberation process including a process of updating the prediction matrix m times, and performs a j-th blind source separation process including a process of updating the unmixing matrix n times, where j, m, and n are positive integers.
As another embodiment of the present application, the audio signal acquiring unit 1101 is further configured to:
collecting audio signals through an acoustic vector sensor on the electronic equipment;
and receiving audio signals collected by the acoustic vector sensors on other electronic equipment.
As another embodiment of the present application, the sound source direction estimation unit 1105 is further configured to:
obtaining the arrival direction of the sound source signal of a second target sound source according to one or more of the amplitude values of the sound source signal of the second target sound source on a plurality of channels, the de-mixing matrix or the mixing matrix of the sound source signals of one or more target sound sources, wherein the second target sound source is any one of the one or more target sound sources; wherein the unmixing matrix represents a conversion relation when the audio signal is separated into the sound source signals of the one or more target sound sources, and the mixing matrix represents a conversion relation when the sound source signals of the one or more target sound sources in the audio signal are mixed into the audio signal.
As another embodiment of the present application, the sound source direction estimation unit 1105 is further configured to:
determining a target column in the mixing matrix, and a first target row and a second target row in the target column, wherein the target column is a column representing a sound source signal of a second target sound source, and the first target row and the second target row are rows related to an angle of the sound source signal of the second target sound source; and obtaining the arrival direction of the sound source signal of the second target sound source according to the elements of the first target row and the elements of the second target row in the target column.
As another embodiment of the present application, the electronic device 1100 further includes:
a post-processing unit 1106 configured to perform a first enhancement process on sound source signals of one or more target sound sources, wherein the first enhancement process includes: the first target sound source at the interference frequency spectrum filtering processing and/or harmonic enhancement is any one of sound source signals of one or more target sound sources; interference spectrum filtering processing for filtering an interference component mixed in a sound source signal of any one target sound source based on the spectrum energy of the sound source signal of any one target sound source in the sound source signals of one or more target sound sources; and harmonic enhancement processing for obtaining a harmonic enhancement signal of one or more target sound sources, the harmonic enhancement signal being a sound source signal containing harmonic components.
As another embodiment of the present application, the post-processing unit 1106 is further configured to:
performing second enhancement processing on the sound source signal of the first target sound source based on first direction information of the sound source signal of the first target sound source on different frequency points, wherein the second enhancement processing comprises interference direction filtering processing and/or beam forming directional enhancement processing; the interference direction filtering processing is used for filtering frequency points of which the direction angles are not within an expected angle range in the sound source signals of the first target sound source; and a beam forming directional enhancement process for enhancing the power of the sound source signal of the first target sound source in a desired direction.
As another embodiment of the present application, the sound source signal obtaining unit 1104 may further obtain: noise and reverberation signals of one or more target sound sources;
the electronic device 1100 further comprises:
a scene effect processing unit 1107, configured to adjust a proportional relationship between the noise, the sound source signal of the first target sound source, and the reverberation signal of the first target sound source, where the first target sound source is any one of the one or more target sound sources.
It should be noted that, because the contents of information interaction, execution process, and the like between the electronic devices/units are based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to specifically in the method embodiment section, and are not described herein again.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the electronic device is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments may be implemented.
Embodiments of the present application further provide a computer program product, which when run on an electronic device, enables the electronic device to implement the steps in the above method embodiments.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the methods described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to an electronic device, a recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
An embodiment of the present application further provides a chip system, where the chip system includes a processor, the processor is coupled to the memory, and the processor executes a computer program stored in the memory to implement the steps of any of the method embodiments of the present application. The chip system may be a single chip or a chip module composed of a plurality of chips.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (16)

1. A method of estimating a direction of arrival of a sound source, comprising:
an electronic device acquires an audio signal, the audio signal comprising: noise, sound source signals of one or more target sound sources, reverberation signals;
the electronic equipment performs an Nth dereverberation processing on the audio signal to obtain an Nth prediction matrix and an Nth dereverberation signal, wherein the Nth dereverberation signal comprises signals except the Nth dereverberation signal in the audio signal, and the Nth dereverberation signal is a dereverberation signal removed in the Nth dereverberation processing; the electronic equipment performs blind source separation processing on the Nth dereverberation signal for the Nth time to obtain an Nth de-mixing matrix and an Nth de-noise signal, wherein the Nth de-noise signal is a signal obtained by removing noise obtained by blind source separation processing for the Nth time from the audio signal;
the electronic equipment continues to perform dereverberation processing and blind source separation processing on the Nth de-noised signal;
when N is a preset value or the Nth prediction matrix is converged and the Nth de-mixing matrix is converged, the electronic equipment obtains sound source signals of the one or more target sound sources according to the Nth de-mixing matrix and the Nth de-reverberation signal;
wherein N is a positive integer starting from 1;
the electronic device determines a direction of arrival of a sound source signal of the one or more target sound sources.
2. The method of claim 1, wherein for a first target sound source, the first target sound source is any one of the one or more target sound sources,
the direction of arrival of the sound source signal of the first target sound source includes: and the first direction information of the sound source signal of the first target sound source on different frequency points.
3. The method of claim 2, further comprising:
the electronic equipment performs joint processing of sound source separation and direction of arrival estimation on the sound source signal of the first target sound source according to first direction information of the sound source signal of the first target sound source on different frequency points to obtain second direction information of the sound source signal of the first target sound source on different frequency points, wherein the sound source separation processing comprises: joint iterative processing of the dereverberation and blind source separation.
4. The method of claim 2, wherein the method further comprises:
the electronic equipment performs smoothing filtering processing or kernel density estimation processing on first direction information of the sound source signal of the first target sound source on the different frequency points to obtain third direction information of the sound source signal of the first target sound source on the different frequency points;
and the electronic equipment fuses the third direction information of the sound source signal of the first target sound source on the different frequency points to obtain the direction of the first target sound source.
5. The method of claim 1, wherein the electronic device continuing to perform the dereverberation process and the blind source separation process on the nth denoised signal comprises:
if the p prediction matrix obtained by the p dereverberation processing is converged and the p unmixing matrix obtained by the p blind source separation processing is not converged, the electronic equipment executes the p + i blind source separation processing until the p + i unmixing matrix is converged, or alternatively executes the p + i dereverberation processing and the p + i blind source separation processing alternately until the p + i prediction matrix and the p + i unmixing matrix are converged simultaneously, wherein p is a positive integer, and i is a positive integer starting from 1;
if the q prediction matrix obtained by the q dereverberation processing is not converged and the q de-mixing matrix obtained by the q blind source separation processing is converged, the electronic equipment executes the q + i dereverberation processing until the q + i prediction matrix is converged, or alternatively executes the q + i dereverberation processing and the q + i blind source separation processing alternately until the q + i prediction matrix and the q + i de-mixing matrix are converged simultaneously, wherein q is a positive integer and i is a positive integer starting from 1.
6. The method of claim 1, wherein the electronic device performing the j-th dereverberation process comprises a process of updating the prediction matrix m times, and wherein the electronic device performing the j-th blind source separation process comprises a process of updating the unmixing matrix n times, wherein j, m, and n are positive integers.
7. The method of claim 1, wherein the electronic device acquiring an audio signal comprises: the electronic equipment acquires audio signals through an acoustic vector sensor on the electronic equipment;
or the electronic equipment receives audio signals collected by acoustic vector sensors on other equipment.
8. The method of claim 7, wherein the electronic device determining a direction of arrival of a sound source signal of a second target sound source comprises:
the electronic equipment obtains the arrival direction of the sound source signal of the second target sound source according to one or more of the amplitude values of the sound source signal of the second target sound source on a plurality of channels, the de-mixing matrix and the mixing matrix of the sound source signal of the one or more target sound sources, wherein the second target sound source is any one of the one or more target sound sources;
wherein the unmixing matrix represents a transition relationship when the audio signal is separated into the sound source signals of the one or more target sound sources, and the mixing matrix represents a transition relationship when the sound source signals of the one or more target sound sources in the audio signal are mixed into the audio signal.
9. The method of claim 8, wherein the electronic device deriving the direction of arrival of the sound source signal of the second target sound source from the mixing matrix of the sound source signals of the one or more target sound sources comprises:
the electronic device determining a target column in the mixing matrix, wherein the target column is a column representing a sound source signal of the second target sound source, and a first target row and a second target row in the target column, wherein the first target row and the second target row are rows related to an angle of the sound source signal of the second target sound source;
and the electronic equipment obtains the arrival direction of the sound source signal of the second target sound source according to the elements of the first target row and the elements of the second target row in the target column.
10. The method according to claim 9, wherein when the first target row represents a row of a first channel of the acoustic vector sensor and the second target row represents a row of a second channel of the acoustic vector sensor, a direction of arrival of a sound source signal of the second target sound source includes a horizontal angle of the sound source signal of the second target sound source, the horizontal angle being an angle in a coordinate system in which the acoustic vector sensor is located;
and/or the presence of a gas in the gas,
when the first target behavior represents a row of a third channel of the acoustic vector sensor and the second target behavior represents a row of an omnidirectional channel of the acoustic vector sensor, a direction of arrival of a sound source signal of the second target sound source includes a pitch angle of the sound source signal of the second target sound source, and the pitch angle is an angle in a coordinate system in which the acoustic vector sensor is located.
11. The method of any one of claims 1 to 10, wherein after the electronic device obtains the sound source signals of the one or more target sound sources according to the nth downmix matrix and the nth de-noised signals, the method further comprises:
the electronic device performs a first enhancement process on sound source signals of one or more target sound sources, wherein the first enhancement process includes: interference spectrum filtering processing and/or harmonic enhancement processing;
the interference frequency spectrum filtering processing is used for filtering interference components mixed in the sound source signal of any one target sound source based on the frequency spectrum energy of the sound source signal of any one target sound source in the sound source signals of the one or more target sound sources;
and the harmonic enhancement processing is used for obtaining harmonic enhancement signals of the one or more target sound sources, wherein the harmonic enhancement signals are sound source signals containing harmonic components.
12. The method of claim 2, 3 or 4, further comprising:
the electronic equipment executes second enhancement processing on the sound source signal of the first target sound source based on first direction information of the sound source signal of the first target sound source on different frequency points, wherein the second enhancement processing comprises interference direction filtering processing and/or beam forming directional enhancement processing;
the interference direction filtering processing is used for filtering frequency points of which the direction angles in the sound source signals of the first target sound source are not within an expected angle range;
the beamforming directional enhancement processing is used for enhancing the power of the sound source signal of the first target sound source in a desired direction.
13. The method according to any one of claims 1 to 10, wherein when N is a preset value or the nth prediction matrix converges and the nth demixing matrix converges, further comprising:
the electronic device deriving the noise and reverberation signals of the one or more target sound sources from the audio signals;
the electronic device adjusts a proportional relationship between the noise, a sound source signal of a first target sound source, and a reverberation signal of the first target sound source, the first target sound source being any one of the one or more target sound sources.
14. An electronic device, comprising:
an audio signal acquisition unit for acquiring an audio signal including noise, a sound source signal of one or more target sound sources, a reverberation signal;
the dereverberation processing unit is used for carrying out the Nth dereverberation processing on the audio signal to obtain an Nth prediction matrix and an Nth dereverberation signal, wherein the Nth dereverberation signal comprises signals except the Nth dereverberation signal in the audio signal, and the Nth dereverberation signal is a dereverberation signal removed in the Nth dereverberation processing;
a blind source separation processing unit, configured to perform an nth blind source separation processing on the nth dereverberation signal to obtain an nth dereverberation matrix and an nth de-noise signal, where the nth de-noise signal is a signal obtained by removing noise from the audio signal and obtained by the nth blind source separation processing;
a sound source signal obtaining unit for performing dereverberation processing and blind source separation processing on the Nth de-noised signal; when N is a preset value or the Nth prediction matrix is converged and the Nth de-mixing matrix is converged, obtaining sound source signals of the one or more target sound sources according to the Nth de-mixing matrix and the Nth de-reverberation signal; wherein N is a positive integer starting from 1;
a sound source direction estimation unit for determining a direction of arrival of a sound source signal of the one or more target sound sources.
15. An electronic device, characterized in that the electronic device comprises a processor for executing a computer program stored in a memory for implementing the method according to any of claims 1 to 13.
16. A chip system, characterized in that the chip system comprises a processor coupled with a memory for executing a computer program stored in the memory for implementing the method according to any of claims 1 to 13.
CN202010643053.0A 2020-07-03 2020-07-03 Method for estimating direction of arrival of sound source, electronic equipment and chip system Pending CN113889135A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010643053.0A CN113889135A (en) 2020-07-03 2020-07-03 Method for estimating direction of arrival of sound source, electronic equipment and chip system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010643053.0A CN113889135A (en) 2020-07-03 2020-07-03 Method for estimating direction of arrival of sound source, electronic equipment and chip system

Publications (1)

Publication Number Publication Date
CN113889135A true CN113889135A (en) 2022-01-04

Family

ID=79013380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010643053.0A Pending CN113889135A (en) 2020-07-03 2020-07-03 Method for estimating direction of arrival of sound source, electronic equipment and chip system

Country Status (1)

Country Link
CN (1) CN113889135A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497500A (en) * 2022-11-14 2022-12-20 北京探境科技有限公司 Audio processing method and device, storage medium and intelligent glasses
WO2024016793A1 (en) * 2022-07-20 2024-01-25 深圳Tcl新技术有限公司 Voice signal processing method and apparatus, device, and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024016793A1 (en) * 2022-07-20 2024-01-25 深圳Tcl新技术有限公司 Voice signal processing method and apparatus, device, and computer readable storage medium
CN115497500A (en) * 2022-11-14 2022-12-20 北京探境科技有限公司 Audio processing method and device, storage medium and intelligent glasses

Similar Documents

Publication Publication Date Title
CN107534725B (en) Voice signal processing method and device
EP3189521B1 (en) Method and apparatus for enhancing sound sources
CN111044973B (en) MVDR target sound source directional pickup method for microphone matrix
KR20050115857A (en) System and method for speech processing using independent component analysis under stability constraints
CN106887239A (en) For the enhanced blind source separation algorithm of the mixture of height correlation
CN112567763B (en) Apparatus and method for audio signal processing
WO2015184893A1 (en) Mobile terminal call voice noise reduction method and device
CN105264911A (en) Audio apparatus
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN108028979A (en) Cooperate audio frequency process
CN109270493B (en) Sound source positioning method and device
US10917718B2 (en) Audio signal processing method and device
CN107181845A (en) A kind of microphone determines method and terminal
CN113889135A (en) Method for estimating direction of arrival of sound source, electronic equipment and chip system
WO2023108864A1 (en) Regional pickup method and system for miniature microphone array device
US11636866B2 (en) Transform ambisonic coefficients using an adaptive network
CN112802490B (en) Beam forming method and device based on microphone array
JP5190859B2 (en) Sound source separation device, sound source separation method, sound source separation program, and recording medium
CN114220454B (en) Audio noise reduction method, medium and electronic equipment
WO2023118644A1 (en) Apparatus, methods and computer programs for providing spatial audio
US20230319469A1 (en) Suppressing Spatial Noise in Multi-Microphone Devices
US12051429B2 (en) Transform ambisonic coefficients using an adaptive network for preserving spatial direction
CN114093380B (en) Voice enhancement method, electronic equipment, chip system and readable storage medium
CN112863525B (en) Method and device for estimating direction of arrival of voice and electronic equipment
CN113808606B (en) Voice signal processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination