CN112466325B

CN112466325B - Sound source positioning method and device and computer storage medium

Info

Publication number: CN112466325B
Application number: CN202011340094.9A
Authority: CN
Inventors: 陈喆; 胡宁宁; 曹冰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2024-06-04
Anticipated expiration: 2040-11-25
Also published as: CN112466325A

Abstract

The embodiment of the application discloses a sound source positioning method and device and a computer storage medium, wherein the method comprises the following steps: the method comprises the steps that a first voice signal and a second voice signal corresponding to a sound source to be positioned are respectively collected through a first sound collecting module and a second sound collecting module; frequency domain conversion processing is respectively carried out on the first voice signal and the second voice signal, and a first frequency domain signal and a second frequency domain signal are obtained; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

Description

Sound source positioning method and device and computer storage medium

Technical Field

The present invention relates to the field of speech recognition technology, and in particular, to a sound source localization method and apparatus, and a computer storage medium.

Background

With the rise of intelligent voice, microphone array pickup technology has gradually developed into a popular technology in the voice recognition process. The sound source positioning method based on the microphone array is widely applied to video conferences, voice enhancement, intelligent robots, intelligent home furnishings, vehicle-mounted communication equipment and the like. For example, in a video conferencing system, sound source localization may enable real-time camera alignment to the speaker; when applied to a hearing aid device, sound source position information may be provided to a hearing impaired person.

However, in order to more accurately identify the sound source target, the positioning accuracy is ensured, and the sound source positioning method in the related art cannot be well balanced between the positioning accuracy and the calculated amount, so that the calculated amount is large, the calculation complexity is high, and the method is not suitable for small electronic equipment.

Disclosure of Invention

The embodiment of the application provides a sound source positioning method and device and a computer storage medium, which can reduce the calculation complexity on the premise of ensuring the positioning accuracy and further realize the good balance between the positioning accuracy and the calculation amount.

The technical scheme of the embodiment of the application is realized as follows:

In a first aspect, an embodiment of the present application provides a sound source positioning method, where the method includes:

The method comprises the steps that a first voice signal and a second voice signal corresponding to a sound source to be positioned are respectively collected through a first sound collecting module and a second sound collecting module;

Frequency domain conversion processing is respectively carried out on the first voice signal and the second voice signal, so as to obtain a first frequency domain signal and a second frequency domain signal;

If the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals;

Determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

In a second aspect, an embodiment of the present application provides a sound source localization apparatus, which includes an acquisition unit, a conversion unit, and a determination unit,

The acquisition unit is used for respectively acquiring a first voice signal and a second voice signal corresponding to a sound source to be positioned through the first sound acquisition module and the second sound acquisition module;

the conversion unit is used for performing frequency domain conversion processing on the first voice signal and the second voice signal respectively to obtain a first frequency domain signal and a second frequency domain signal;

the determining unit is configured to determine the first frequency domain signal and the second frequency domain signal as sound source characteristic signals if the first frequency domain signal and the second frequency domain signal include a speech signal; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

In a third aspect, an embodiment of the present application provides a sound source localization apparatus, including a processor, a memory storing instructions executable by the processor, which when executed by the processor, implements a sound source localization method as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a sound source localization method as described above.

The embodiment of the application provides a sound source positioning method and device, and a computer storage medium, wherein the sound source positioning device respectively acquires a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module; frequency domain conversion processing is respectively carried out on the first voice signal and the second voice signal, and a first frequency domain signal and a second frequency domain signal are obtained; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles. That is, in the embodiment of the present application, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first voice signal and the second voice signal, the sound source positioning device may perform frequency domain conversion processing on the voice signals to convert the time domain voice signals into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signals comprising the voice signals as sound source characteristic signals, so that a target azimuth angle corresponding to the sound source to be positioned is further determined based on the sound source characteristic signals, a preset positioning model capable of determining probability values of different azimuth angles of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is achieved. Therefore, the sound source positioning method provided by the application is to perform frequency domain conversion processing on the voice signals obtained by different sound collecting channels, and after the converted frequency domain signals comprise the voice signals, the sound source positioning device is combined with the preset positioning model to further determine the target azimuth angle corresponding to the sound source to be positioned, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision is ensured, and good balance between the positioning precision and the calculated amount is further realized. And because the calculation amount of the sound source positioning method is small, the method is applicable to various small-sized devices, and the application flexibility of the positioning method is higher.

Drawings

Fig. 1 is a schematic diagram of an implementation flow of a sound source localization method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a sound source positioning device according to an embodiment of the present application;

fig. 3 is a second schematic implementation flow chart of the sound source positioning method according to the embodiment of the present application;

fig. 4 is a schematic diagram of a third implementation flow of a sound source localization method according to an embodiment of the present application;

fig. 5 is a schematic diagram of an implementation flow of a sound source localization method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a sound source localization process according to an embodiment of the present application;

Fig. 7 is a schematic diagram of a sound source localization process according to a second embodiment of the present application;

Fig. 8 is a schematic diagram of a sound source positioning device according to an embodiment of the present application;

fig. 9 is a schematic diagram of a sound source positioning device according to a second embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to be limiting. It should be noted that, for convenience of description, only a portion related to the related application is shown in the drawings.

Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.

With the rise of intelligent voice, microphone array pickup technology has gradually developed into a popular technology in the voice recognition process. The sound source positioning method based on the microphone array is an important research direction in the voice recognition processing process, and has wide application occasions such as video conferences, voice enhancement, intelligent robots, intelligent home, vehicle-mounted communication equipment and the like. For example, in a video conferencing system, sound source localization may enable real-time camera alignment to the speaker; when applied to a hearing aid device, sound source position information may be provided to a hearing impaired person.

At present, sound source localization methods can be divided into two main categories according to the difference of localization parameters, including: inter-aural difference based positioning and head related transfer function based positioning. However, on one hand, poor positioning between ears is easy to be interfered by reverberation and noise in an actual environment, and the positioning performance is poor; on the other hand, although the positioning based on the head related transfer function realizes good sound source positioning in the three-dimensional space, the method has overlarge calculation complexity, and the individuality of the head related transfer function is stronger, and when different individuals or surrounding environments are different, the actual transfer function and the original transfer function can be possibly caused to have differences, so that accurate positioning cannot be realized.

Further, in order to solve the above problems, a sound source target is identified more accurately, positioning accuracy is guaranteed, a sound source positioning method based on a neural network is proposed in the related art, but the sound source positioning method has the problem that the positioning accuracy and the calculated amount cannot be balanced well, and the positioning accuracy is high, but the calculated amount is often large, the calculated complexity is high, and the method is not suitable for small electronic equipment.

In order to solve the problems of the existing sound source positioning mechanism, the embodiment of the application provides a sound source positioning method and device and a computer storage medium. Specifically, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first voice signal and the second voice signal, the sound source positioning device can firstly perform frequency domain conversion processing on the voice signals respectively so as to convert the voice signals in the time domain into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signals comprising the voice signals as sound source characteristic signals, so that a target azimuth angle corresponding to the sound source to be positioned is further determined based on the sound source characteristic signals, a preset positioning model capable of determining probability values of different azimuth angles of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is achieved. Therefore, the sound source positioning method provided by the application is to perform frequency domain conversion processing on the voice signals obtained by different sound collecting channels, and after the converted frequency domain signals comprise the voice signals, the sound source positioning device is combined with the preset positioning model to further determine the target azimuth angle corresponding to the sound source to be positioned, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision is ensured, and good balance between the positioning precision and the calculated amount is further realized. And because the calculation amount of the sound source positioning method is small, the method is applicable to various small-sized devices, and the application flexibility of the positioning method is higher.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

An embodiment of the present application provides a sound source positioning method, fig. 1 is a schematic implementation flow diagram of the sound source positioning method provided in the embodiment of the present application, as shown in fig. 1, in the embodiment of the present application, the method for performing sound source positioning by a sound source positioning device may include the following steps:

step 101, a first voice signal and a second voice signal corresponding to a sound source to be positioned are respectively acquired through a first sound acquisition module and a second sound acquisition module.

In the embodiment of the application, the sound source positioning device can acquire and process the audio frames sent by the sound source to be positioned through the first sound acquisition module and the second sound acquisition module respectively, so as to obtain the first voice signal and the second voice signal.

It will be appreciated that in embodiments of the present application, the sound source localization apparatus may be any electronic device configured with a sound sensor capable of achieving sound source localization through the "pick-up" function of the sound sensor. Mobile phones, notebook computers, tablet computers, desktop computers, portable game devices, vehicle-mounted devices, wearable devices (e.g., smart glasses), etc., which are not limited to home appliances and consumer electronics; or may be a security robot, a service robot, and a teleconferencing system capable of recognizing a service object that emits sound, or may be a submarine and warship in the field of the military.

In the embodiment of the present application, the sound source positioning device is an electronic device provided with a sound sensor array. Specifically, the sound source positioning device is provided with a first sound collecting module and a second sound collecting module, and the two sound collecting modules form a sound sensor array.

In particular, the first sound collection module and the second sound collection module may be disposed at different locations on the terminal, i.e. at different locations corresponding to the space. Optionally, the first sound collecting module and the second sound collecting module are disposed at symmetrical positions of the sound source positioning device, and may be disposed at upper and lower symmetrical positions, left and right symmetrical positions, or front and rear symmetrical positions.

In an exemplary embodiment of the present application, fig. 2 is a schematic diagram of a sound source positioning device according to an embodiment of the present application, and as shown in fig. 2, assuming that the sound source positioning device is augmented reality (Augmented Reality, AR) glasses and the sound collecting module may be a microphone, the sound source positioning device may set the first microphone 10 and the second microphone 20 at symmetrical positions on the left and right sides of the AR glasses, where the first microphone 10 and the second microphone 20 form a microphone array.

It should be understood that in the embodiment of the present application, the sound source localization apparatus is not limited to include only the first sound collection module and the second sound collection module, and the sound source localization apparatus may include L sound collection modules; wherein L is equal to or greater than 2. Further, the sound sensor array formed by the plurality of sound collecting modules can be an array which corresponds to different positions in space and is arranged according to a certain shape rule.

It should be noted that, in the embodiment of the present application, the sound source positioning device may perform audio frame acquisition from the same sound source, i.e. the to-be-positioned sound source, through the first sound acquisition module and the second sound acquisition module that are disposed at different positions, i.e. the same audio frame sent by the to-be-positioned sound source is acquired by microphones at different positions in the microphone array.

It should be appreciated that microphones at different spatial locations are not located the same distance from the same sound source, and thus the same audio frame emanating from the same sound source, the speech information collected by microphones at different locations is different. Therefore, in the embodiment of the present application, after an audio frame is sent out at a certain time of the sound source to be positioned, the sound source positioning device acquires a first voice signal through a first sound acquisition module in the sound sensor array, and acquires a second voice signal through a second sound acquisition module, where information of the two voice signals may be different. That is, in the embodiment of the present application, after the sound source positioning device collects the same audio frame sent by the sound source to be positioned through different sound collection channels, different first voice signals and second voice signals can be obtained.

Further, in the embodiment of the present application, after the sound source positioning device collects the first voice signal and the second voice signal corresponding to the sound source to be positioned through the first sound collecting module and the second sound collecting module, the sound source positioning device may further perform frequency domain conversion processing on the voice signals.

Step 102, frequency domain conversion processing is performed on the first voice signal and the second voice signal respectively, so as to obtain a first frequency domain signal and a second frequency domain signal.

In the embodiment of the application, after the sound source positioning device obtains the first voice signal and the second voice signal through the first voice acquisition module and the second voice acquisition module, the sound source positioning device can further perform frequency domain conversion processing on the two paths of voice signals, so as to obtain a first frequency domain signal and a second frequency domain signal.

It should be noted that, in the embodiment of the present application, the voice signal is a voice digital signal. Specifically, the sound source positioning device performs analog/digital conversion of the voice signal during the acquisition of the signal before the frequency domain conversion, that is, when the two sound acquisition modules acquire and process the audio frames sent by the sound source to be positioned respectively, the voice signal is converted into an electric signal, sampled and subjected to a/D conversion, so as to obtain a voice digital signal.

Further, the sound source positioning device performs frequency domain conversion processing on the two paths of voice digital signals, namely converting the voice digital signals from a time domain to a frequency domain.

It should be noted that, in the embodiment of the present application, the sound source positioning device may perform analog/digital conversion and frequency domain conversion on the voice analog signal by using a hardware processor, for example, a functional processor such as a digital signal Processing technology (DIGITAL SIGNAL Processing, DSP) or (ADVANCED RISC MACHINE, ARM).

Specifically, fig. 3 is a second implementation flow chart of a sound source localization method according to an embodiment of the present application, as shown in fig. 3, a method for performing frequency domain conversion processing on a first speech signal and a second speech signal by a sound source localization device to obtain a first frequency domain signal and a second frequency domain signal, where the method includes:

Step 102a, performing windowing processing on the first voice signal and the second voice signal respectively to obtain a windowed first signal and a windowed second signal.

And 102b, performing Fourier transform processing on the windowed first signal and the windowed second signal respectively to obtain a first frequency domain signal and a second frequency domain signal.

Specifically, in the embodiment of the present application, the sound source positioning device may perform windowing processing on two paths of voice digital signals obtained after the analog/digital conversion, so as to obtain two paths of windowed voice signals, and then, the sound source positioning device continues to perform fast fourier transform (Fast Fourier transform, FFT) processing on the two paths of windowed voice signals, so as to further obtain two paths of frequency domain signals.

It should be understood that, in the embodiment of the present application, the sound source positioning device performs the processing of the voice signal in real time based on the frame division manner, that is, performs the real-time acquisition sequentially frame by frame according to the preset frame length N, and performs the frequency domain conversion processing sequentially frame by frame.

The sound source positioning device respectively collects audio frames sent by a sound source to be positioned through two paths of microphones, wherein the length of a frame signal is N, and N is an integer greater than zero. Obtaining a first voice signal x ₁ (n) and a second voice signal x ₂ (n) through sampling and A/D conversion; further, the sound source positioning device performs windowing processing on the x ₁ (n) and the x ₂ (n) by using a windowing function omega (n) through a DSP module or an ARM module to obtain a first windowed signal x ₁ (n) omega (n) and a second windowed signal x ₂ (n) omega (n); then, the sound source positioning device continues to perform FFT conversion processing on the two paths of windowed signals to obtain

First frequency domain signal:

X _1,m(k)＝FFT[x_1,m (n) ω (n) ] (equation 1)

Second frequency domain signal:

X _2,m(k)＝FFT[x_2,m (n) ω (n) ] (equation 2)

In formulas 1 and 2, ω (N) is a window function, k is a frequency bin, m is a frame number, and n=1, 2.

Further, in the embodiment of the present application, after performing frequency domain conversion processing on the first voice signal and the second voice signal to obtain the first frequency domain signal and the second frequency domain signal, the sound source positioning device may further determine whether the frequency domain signal includes the voice signal, and further execute corresponding subsequent processing according to the determination result.

Step 103, if the first frequency domain signal and the second frequency domain signal include voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals.

In the embodiment of the present application, after performing frequency domain conversion processing on the first voice signal and the second voice signal to obtain the first frequency domain signal and the second frequency domain signal, if the sound source positioning device determines that the first frequency domain signal and the second frequency domain signal include the voice signal, the sound source positioning device may further determine the two paths of frequency domain signals as sound source characteristic signals.

It will be appreciated that there may be periods of silence during which the sound source to be positioned is playing audio, i.e. for some particular reason the sound source to be positioned stops playing audio for some period of time. For example, during a call, the speaker is listening to the other party, or the speaker pauses for a session due to thinking, a brief message, etc., or pauses between utterances (e.g., hesitation, stuttering, breathing, etc.), so that the current frame does not contain any speech information.

Therefore, in the embodiment of the present application, the sound source positioning device may perform a voice endpoint detection process on the frequency domain signal to determine whether the audio frame sent by the sound source to be positioned includes a voice signal, that is, identify a silence period in the audio frame signal.

Alternatively, in an embodiment of the present application, the sound source localization device may implement recognition of the silence period in the audio frame signal through a voice activity detection (Voice Activity Detection, VAD) module.

Further, in the embodiment of the present application, after performing the voice endpoint detection processing on the frequency domain signal, if the detection result is that the frequency domain signal does not include a voice signal, that is, the sound source to be positioned does not currently emit an audio frame and is in a silence period. The sound source positioning device can discard the two paths of frequency domain signals, and continues to collect the next audio frame, convert the frequency domain signals and judge whether the audio frame comprises voice signals or not.

On the other hand, if the detection result is that the frequency domain signal includes a voice signal, that is, the sound source to be positioned currently emits an audio frame, the audio frame is not in a silence period. The sound source localization device can then determine the two frequency domain signals as sound source characteristic signals.

It should be noted that, in the embodiment of the present application, the sound source characteristic signal refers to two paths of frequency domain signals including a voice signal, and may be used as a target parameter for determining the sound source position finally.

Therefore, the frequency domain signals which do not comprise voice signals and belong to silence are discarded, and only two paths of frequency domain signals comprising voice signals are used as sound source characteristic signals, so that the silence period is identified and eliminated from the audio frame signals of the sound source to be positioned, the telephone traffic resources and bandwidth resources are saved under the condition that the service quality is not reduced, and the end-to-end time delay felt by a user is reduced.

Further, in the embodiment of the present application, after determining the frequency domain signal including the voice signal as the sound source characteristic signal, the sound source positioning device may further implement accurate positioning of the sound source to be positioned according to the sound source characteristic signal, the preset positioning model and the preset angle calculation model.

Step 104, determining a target azimuth angle corresponding to the sound source to be positioned according to the preset positioning model, the sound source characteristic signal and the preset angle calculation model.

In the embodiment of the application, after the sound source positioning device determines the frequency domain signal including the voice signal as the sound source characteristic signal, the sound source positioning device can accurately position the sound source to be positioned according to the sound source characteristic signal, the preset positioning model and the preset angle calculation model.

It should be noted that, in the embodiment of the present application, the target azimuth angle refers to a spatial position corresponding to the sound source to be positioned, that is, an offset angle of the sound source to be positioned relative to the sound source positioning device.

Optionally, the sound source positioning device may determine that the two paths of frequency domain signals corresponding to the current audio frame include speech signals, that is, after determining that the two paths of frequency domain signals are sound source feature signals, the sound source positioning device may determine an azimuth angle corresponding to the sound source to be positioned directly based on one sound source feature signal and a preset positioning model, so as to further realize accurate positioning of the sound source to be positioned, that is, "frame positioning".

Optionally, the sound source positioning device may also determine the azimuth angle corresponding to the sound source to be positioned according to multiple sets of frequency domain signals corresponding to multiple moments and determined as the sound source characteristic signals, a preset positioning model and a preset angle calculation model, so as to achieve accurate positioning of the sound source to be positioned, that is, "multi-frame positioning".

Specifically, in the "multi-frame positioning", the sound source positioning device may preset the number of target frames for performing sound source positioning, and when the number corresponding to the frequency domain signals determined as the sound source characteristic signals satisfies the number of target frames, the sound source positioning device may determine the target azimuth according to the plurality of sound source characteristic signals, the preset positioning model and the preset angle calculation model.

It should be noted that, in the embodiment of the present application, the preset positioning model is used to determine probability values corresponding to different azimuth angles of the sound source to be positioned, that is, probability values of the sound source at each possible position, where the probability values corresponding to different positions are not the same. Furthermore, the sound source positioning device can further combine a plurality of angle probability values and a preset angle calculation model to realize accurate positioning of the sound source to be positioned.

The embodiment of the application provides a sound source positioning method, wherein a sound source positioning device respectively acquires a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module; frequency domain conversion processing is respectively carried out on the first voice signal and the second voice signal, and a first frequency domain signal and a second frequency domain signal are obtained; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles. That is, in the embodiment of the present application, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first voice signal and the second voice signal, the sound source positioning device may perform frequency domain conversion processing on the voice signals to convert the time domain voice signals into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signals comprising the voice signals as sound source characteristic signals, so that a target azimuth angle corresponding to the sound source to be positioned is further determined based on the sound source characteristic signals, a preset positioning model capable of determining probability values of different azimuth angles of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is achieved. Therefore, the sound source positioning method provided by the application is to perform frequency domain conversion processing on the voice signals obtained by different sound collecting channels, and after the converted frequency domain signals comprise the voice signals, the sound source positioning device is combined with the preset positioning model to further determine the target azimuth angle corresponding to the sound source to be positioned, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision is ensured, and good balance between the positioning precision and the calculated amount is further realized. And because the calculation amount of the sound source positioning method is small, the method is applicable to various small-sized devices, and the application flexibility of the positioning method is higher.

Further, fig. 4 is a schematic diagram of a third implementation flow chart of a sound source localization method according to an embodiment of the present application, as shown in fig. 4, after a sound source localization device obtains a first frequency domain signal and a second frequency domain signal, i.e. after step 102, and if the first frequency domain signal and the second frequency domain signal include a speech signal, the method for performing sound source localization by the sound source localization device includes:

Step 105, performing a voice endpoint detection process on the first frequency domain signal and the second frequency domain signal, to obtain a first voice energy value corresponding to the first voice signal and a second voice energy value corresponding to the second voice signal.

Step 106, if the first speech energy value is greater than or equal to the preset energy threshold and the second speech energy value is greater than or equal to the preset energy threshold, determining that the first frequency domain signal and the second frequency domain signal include speech signals.

It should be noted that, in the embodiment of the present application, when the sound source positioning device performs the voice endpoint detection processing on the two paths of frequency domain signals to determine whether the corresponding audio frame includes the voice signal, the sound source positioning device may implement the above determination process by calculating the energy value of the voice signal.

Specifically, the calculation method of the voice energy value is shown in the following formula:

In equation 3, X _m (k) is a frequency domain signal corresponding to a current audio frame, X _m-1 (k) is a frequency domain signal corresponding to a previous audio frame, N is a frame length, and VAD (k, m) is a speech signal energy value.

It should be appreciated that when the speech signal energy is below a certain threshold value, the current mute state is considered, whereas if it is above, the speech signal is considered to be currently included. Thus, in an embodiment of the application, the sound source localization means is pre-set with a speech signal energy threshold Then it is determined that the voice signal is included. Therefore, after calculating the energy value of the voice signal based on the frequency domain signal, the voice energy value can be compared with a preset energy threshold value, and whether the voice signal is included or not is determined according to the comparison result, namely whether the frequency domain signal can be used as a sound source characteristic signal or not is determined.

Specifically, after performing voice endpoint detection processing on two paths of frequency domain signals, the sound source positioning device calculates to obtain a first voice energy value corresponding to the first frequency domain signal and a second voice energy value corresponding to the second frequency domain signal; if the first speech energy value is greater than the preset energy threshold and the second speech energy value is also greater than the preset energy threshold, the sound source positioning device can determine that the audio frames corresponding to the two paths of frequency domain signals comprise speech signals.

On the other hand, if the first speech energy value is smaller than the preset energy threshold, or the second speech energy value is smaller than the preset speech energy value, the sound source localization device may determine that the audio frame corresponding to the frequency domain signal does not include the speech signal.

The application provides a sound source positioning method, a sound source positioning device can discard a frequency domain signal which does not comprise a voice signal and belongs to silence, only uses two paths of frequency domain signals comprising the voice signal as sound source characteristic signals, realizes the identification and elimination of a silence period from an audio frame signal of a sound source to be positioned, saves telephone traffic resources and bandwidth resources under the condition of not reducing service quality, and is beneficial to reducing end-to-end time delay felt by a user.

Further, fig. 5 is a schematic diagram of a realization flow chart of a sound source positioning method according to an embodiment of the present application, as shown in fig. 5, a method for determining a target azimuth angle corresponding to a sound source to be positioned by a sound source positioning device according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model includes:

104a, inputting a plurality of sound source characteristic signals corresponding to a plurality of moments into a preset positioning model, and outputting a plurality of angle probability combinations corresponding to the plurality of sound source characteristic signals; wherein the angular probability combination characterizes the correspondence of azimuth and probability.

Step 104b, determining the target azimuth based on the multiple angle probability combinations and the preset angle calculation model.

It should be noted that, in the embodiment of the present application, when the sound source positioning device performs "multi-frame positioning", a target buffer area is set first for pre-storing the frequency domain signal determined as the sound source characteristic signal, and the number of preset target frames corresponding to the preset buffer area is M.

Specifically, the sound source positioning device may store the frequency domain signal currently determined as the sound source characteristic signal into the target buffer area, and when the number of signals in the target buffer area is equal to M, input multiple frames of frequency domain signals into the preset positioning model. Wherein, the feature matrix formed by two paths of frequency domain signals of each audio frame is as follows:

In the formula 4, imag represents extraction of an imaginary part, real represents extraction of a Real part, and m represents what number of audio frames.

Further, the feature matrix corresponding to the M audio frames is shown in the following formula:

At this time, in equation 5, the dimension of the M audio frame configuration matrix is 10×2052.

Further, the feature matrix shown in the formula 5 is subjected to convolution, pooling and other processing by a preset positioning model, a plurality of initial probability values corresponding to a plurality of azimuth angles are determined when each audio frame of the sound source to be positioned is determined, then, the initial azimuth angle corresponding to the maximum initial probability value is continuously selected from the plurality of initial probability values by the preset positioning model to serve as an alternative azimuth angle corresponding to the audio frame, and the corresponding relation between the alternative azimuth angle and the corresponding probability value is used as an angle probability combination alternative when the target azimuth angle is determined; correspondingly, the M audio frames correspond to M alternative angular probability combinations.

It should be noted that, in the embodiment of the present application, the preset positioning model is constructed based on a convolutional neural network (Convolutional Neural Network, CNN).

For example, in the embodiment of the present application, assuming that the number M of preset target frames is equal to 10, the neural network corresponding to the preset positioning model sequentially sets one input layer, 3 convolution layers and 3 pooling layers, two full connection layers and one output layer. When the number of frame caches in the target buffer area is equal to 10, a plurality of sound source characteristic signals are input into a preset positioning model. The input feature parameters of the input layer, that is, the frequency domain signal containing the speech signal determined in step 103, that is, the feature matrix composed of the real part and the imaginary part of the frequency domain signal shown in the above formula 5. Furthermore, the convolution layers all adopt 64 filters of 2 multiplied by 2, the convolution step length is not limited, zero filling is carried out on the output of the previous layer before convolution, so that the feature size before and after convolution is ensured not to be reduced, and the activation function adopts a ReLU function; the pooling layers are all in 2 x 2 maximum pooling, the step length is not limited, and zero filling is carried out on the output of the upper layer before pooling. After three-layer rolling and pooling, the multidimensional output of the last pooling layer is unfolded into one-dimensional output, namely, one-dimensional characteristics of 512 multiplied by 1 are unfolded, and then the characteristic results are mapped into 10 results, namely, results corresponding to 10 audio frames through the first full-connection layer and the second full-connection layer. Further, the preset positioning model converts the preset positioning model into probability through Softmax, and represents probability values corresponding to 10 different azimuth angles, namely, the maximum probability value of each audio frame in 10 audio frames is selected from the corresponding relation between the azimuth angle and the maximum angle probability value, namely, 10 angle probability combinations.

Furthermore, the sound source positioning device determines the azimuth corresponding to the maximum probability value in the multiple angle probability combinations as the target azimuth corresponding to the sound source to be positioned through a preset calculation model. Specifically, the specific method for determining the target azimuth angle is as follows:

Wherein, θ _i in the formula 6 represents an angle value of sound source localization, and p (θ _c|φ_i) represents a maximum probability value corresponding to an i-th audio frame in the input signal.

The embodiment of the application provides a sound source positioning method, wherein a sound source positioning device performs frequency domain conversion processing on a voice signal acquired through a sound sensor array, and after the acquired frequency domain signal comprises the voice signal, a target azimuth angle corresponding to a sound source to be positioned is further determined by combining the frequency domain signal, a preset positioning model and a preset angle calculation model of the sound source positioning device, so that accurate positioning of the sound source is realized, data to be processed by the preset positioning model is greatly reduced, positioning precision can be ensured, and good balance between positioning precision and calculated amount is further realized. And because the calculation amount of the sound source positioning method is small, the method is applicable to various small-sized devices, and the application flexibility of the positioning method is higher.

Further, by way of example, fig. 6 is a schematic diagram of a sound source localization process according to an embodiment of the present application, as shown in fig. 6, assuming that the sound source localization device is smart glasses, the first sound collection module and the second sound collection module are microphones, where the two microphones are symmetrically disposed on a left lens frame and a right lens frame of the smart glasses.

Specifically, the terminal collects current audio frames sent by the sound source to be positioned through the first microphone and the second microphone respectively, and further collects first voice signals through the first microphone, and collects second voice signals through the second microphone (step S1); further, the sound source positioning device performs windowing and FFT conversion processing on the obtained two paths of voice signals, namely, converting the time domain signals into frequency domain signals, so as to obtain a first frequency domain signal and a second frequency domain signal (step S2). The sound source positioning device performs a voice detection process on the two paths of frequency domain signals through a voice endpoint detection module, namely a VAD module (step S3), so as to judge whether the current audio frame comprises a voice signal according to the voice energy values respectively corresponding to the two paths of frequency domain signals (step S4).

Further, if it is determined that the first speech energy value corresponding to the first frequency domain signal and the speech energy value corresponding to the second frequency domain signal are both greater than the preset energy threshold, then it is determined that the speech signals are currently included, then the sound source positioning device may input the two paths of frequency domain signals into a preset positioning model (step S5), and further obtain initial probability values corresponding to different azimuth angles of the current audio frame through the preset positioning model, and further determine, by using the preset positioning model, the azimuth angle corresponding to the maximum probability value in the initial probability value as the target azimuth angle when the current audio frame is to be sent by the sound source to be positioned.

Further, exemplary, fig. 7 is a schematic diagram of a sound source positioning process according to the embodiment of the present application, as shown in fig. 7, assuming that the sound source positioning device is smart glasses, the first sound collecting module and the second sound collecting module are microphones, where the two microphones are symmetrically disposed on a left lens frame and a right lens frame of the smart glasses.

Further, if it is determined that the first frequency domain signal and the second frequency domain signal both include a speech signal, the sound source positioning device may buffer the two paths of frequency domain signals corresponding to the same frame in a preset storage space (step S6). Otherwise, if the voice signal is not included, the next frame of voice signal is continuously collected, namely, the step S1 is executed in a jumping manner.

It should be noted that, the frame signals buffered in the preset storage space are limited, so the sound source positioning device needs to determine whether the buffered frame signals in the preset storage space are equal to the target frame number, for example, 10 frames, after buffering the two paths of frequency domain signals in the preset storage space (step S7). If the frame is equal to 10 frames, the sound source positioning device can input the 10-frame frequency domain signal into a preset positioning model so as to obtain probability values corresponding to different azimuth angles when a plurality of audio frames are obtained, namely a plurality of angle probability combinations (step S8); further, the multiple angle probability combinations are input into a preset angle calculation model (step S9), and then the azimuth corresponding to the maximum probability value is determined to be the target azimuth corresponding to the sound source to be positioned, namely the accurate position of the sound source is determined. On the other hand, if it is not equal to 10 frames, the sound source localization device may continue to wait for buffering of the frequency domain signal until the signal of the preset storage space satisfies 10 frames.

The embodiment of the application provides a sound source positioning method, wherein after a sound source positioning device acquires synchronous first voice signals and synchronous second voice signals from the same sound source to be positioned through different sound acquisition modules, the sound source positioning device can respectively perform frequency domain conversion processing on the voice signals so as to convert the time domain voice signals into first frequency domain signals and second frequency domain signals. Furthermore, the sound source positioning device can input the frequency domain signals comprising the voice signals into a preset positioning model for determining probability values of different azimuth angles of the sound source, and meanwhile, the target azimuth angles corresponding to the sound source to be positioned are further determined by combining with the preset calculation model, so that the determination of the sound source position is realized. Therefore, the sound source positioning method provided by the application is to perform frequency domain conversion processing on the voice signal, and after the converted frequency domain signal is determined to comprise the voice signal, the sound source positioning device is combined with the preset positioning model to further determine the target azimuth angle corresponding to the sound source to be positioned, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision is ensured, and good balance between the positioning precision and the calculated amount is further realized. And because the calculation amount of the sound source positioning method is small, the method is applicable to various small-sized devices, and the application flexibility of the positioning method is higher.

Based on the above embodiment, in another embodiment of the present application, fig. 8 is a schematic diagram of the composition structure of the sound source positioning device according to the embodiment of the present application, as shown in fig. 8, the sound source positioning device 30 according to the embodiment of the present application may include an acquisition unit 31, a conversion unit 32, a determination unit 33, a detection unit 34 and an execution unit 35,

The acquisition unit 31 is configured to acquire a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module respectively;

the converting unit 32 is configured to perform frequency domain conversion processing on the first voice signal and the second voice signal, so as to obtain a first frequency domain signal and a second frequency domain signal;

The determining unit 33 is configured to determine the first frequency domain signal and the second frequency domain signal as sound source characteristic signals if the first frequency domain signal and the second frequency domain signal include a speech signal; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

Further, in an embodiment of the present application, the first sound collection module and the second sound collection module are disposed at symmetrical positions.

Further, in the embodiment of the present application, the converting unit 32 is specifically configured to perform windowing processing on the first voice signal and the second voice signal, so as to obtain a windowed first signal and a windowed second signal; and performing fast Fourier FFT (fast Fourier transform) conversion processing on the windowed first signal and the windowed second signal respectively to obtain the first frequency domain signal and the second frequency domain signal.

Further, in the embodiment of the present application, the detecting unit 34 is configured to, after obtaining a first frequency domain signal and a second frequency domain signal, and before determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals if the first frequency domain signal and the second frequency domain signal include a speech signal, perform a speech endpoint detection process on the first frequency domain signal and the second frequency domain signal, respectively, to obtain a first speech energy value corresponding to the first speech signal and a second speech energy value corresponding to the second speech signal.

Further, in an embodiment of the present application, the determining unit 33 is further configured to determine that the first frequency domain signal and the second frequency domain signal include the speech signal if the first speech energy value is greater than or equal to a preset energy threshold and the second speech energy value is greater than or equal to the preset energy threshold.

Further, in the embodiment of the present application, the determining unit 33 is specifically configured to input a plurality of sound source feature signals corresponding to a plurality of moments into the preset positioning model, and output a plurality of angle probability combinations corresponding to the plurality of sound source feature signals; wherein, the angle probability combination characterizes the corresponding relation between azimuth and probability; and determining the target azimuth based on the plurality of angle probability combinations and the preset angle calculation model.

Further, in an embodiment of the present application, the determining unit 33 is further specifically configured to determine, by using the angle calculation model, an azimuth angle corresponding to a maximum probability in the plurality of angle probability combinations as the target azimuth angle.

Further, in the embodiment of the present application, the executing unit 35 is configured to, after performing frequency domain conversion processing on the first voice signal and the second voice signal, obtain a first frequency domain signal and a second frequency domain signal, if the first frequency domain signal and the second frequency domain signal do not include a voice signal, continue to perform acquisition processing, frequency domain conversion processing, and voice endpoint detection processing of a next audio frame.

In an embodiment of the present application, further, fig. 9 is a schematic diagram of a second structure of a sound source positioning device according to an embodiment of the present application, as shown in fig. 9, the sound source positioning device 30 according to an embodiment of the present application may further include a processor 36, a memory 37 storing executable instructions of the processor 36, further, the sound source positioning device 30 may further include a communication interface 38, and a bus 39 for connecting the processor 36, the memory 37 and the communication interface 38.

In an embodiment of the present application, the Processor 36 may be at least one of an Application SPECIFIC INTEGRATED Circuit (ASIC), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), a digital signal processing device (DIGITAL SIGNAL Processing Device, DSPD), a programmable logic device (ProgRAMmable Logic Device, PLD), a field programmable gate array (Field ProgRAMmable GATE ARRAY, FPGA), a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, and a microprocessor. It will be appreciated that the electronics for implementing the above-described processor functions may be other for different devices, and embodiments of the present application are not particularly limited. The sound source positioning device 30 may further comprise a memory 37, which memory 37 may be connected to the processor 36, wherein the memory 37 is adapted to store executable program code comprising computer operating instructions, the memory 37 may comprise a high speed RAM memory, and may further comprise a non-volatile memory, e.g. at least two disk memories.

In an embodiment of the application, a bus 39 is used to connect the communication interface 38, the processor 36 and the memory 37 and the communication between these devices.

In an embodiment of the application, memory 37 is used to store instructions and data.

Further, in the embodiment of the present application, the processor 36 is configured to collect, by using a first sound collecting module and a second sound collecting module, a first voice signal and a second voice signal corresponding to a sound source to be positioned, respectively; frequency domain conversion processing is respectively carried out on the first voice signal and the second voice signal, so as to obtain a first frequency domain signal and a second frequency domain signal; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

In practical applications, the Memory 37 may be a volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a hard disk (HARD DISK DRIVE, HDD) or a Solid state disk (Solid-state-STATE DRIVE, SSD); or a combination of the above types of memories and provides instructions and data to the processor 36.

In addition, each functional module in the present embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional modules.

The integrated units, if implemented in the form of software functional modules, may be stored in a computer-readable storage medium, if not sold or used as separate products, and based on this understanding, the technical solution of the present embodiment may be embodied essentially or partly in the form of a software product, or all or part of the technical solution may be embodied in a storage medium, which includes several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The embodiment of the application provides a sound source positioning device, which respectively acquires a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module; frequency domain conversion processing is respectively carried out on the first voice signal and the second voice signal, and a first frequency domain signal and a second frequency domain signal are obtained; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles. That is, in the embodiment of the present application, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first voice signal and the second voice signal, the sound source positioning device may perform frequency domain conversion processing on the voice signals to convert the time domain voice signals into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signals comprising the voice signals as sound source characteristic signals, so that a target azimuth angle corresponding to the sound source to be positioned is further determined based on the sound source characteristic signals, a preset positioning model capable of determining probability values of different azimuth angles of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is achieved. Therefore, the sound source positioning method provided by the application is to perform frequency domain conversion processing on the voice signals obtained by different sound collecting channels, and after the converted frequency domain signals comprise the voice signals, the sound source positioning device is combined with the preset positioning model to further determine the target azimuth angle corresponding to the sound source to be positioned, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision is ensured, and good balance between the positioning precision and the calculated amount is further realized. And because the calculation amount of the sound source positioning method is small, the method is applicable to various small-sized devices, and the application flexibility of the positioning method is higher.

An embodiment of the present application provides a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the sound source localization method as described above.

Specifically, the program instructions corresponding to one sound source localization method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, or a usb disk, and when the program instructions corresponding to one sound source localization method in the storage medium are read or executed by an electronic device, the method includes the following steps:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block and/or flow of the flowchart illustrations and/or block diagrams, and combinations of blocks and/or flow diagrams in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of sound source localization, the method comprising:

determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles;

The determining, according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model, a target azimuth angle corresponding to the sound source to be positioned includes:

Inputting a plurality of sound source characteristic signals corresponding to a plurality of moments into the preset positioning model, and outputting a plurality of angle probability combinations corresponding to the plurality of sound source characteristic signals; wherein, the angle probability combination characterizes the corresponding relation between azimuth and probability;

And determining the target azimuth angle based on the plurality of angle probability combinations and the preset angle calculation model.

2. The method of claim 1, wherein the first sound collection module and the second sound collection module are disposed in symmetrical positions.

3. The method of claim 1, wherein performing frequency domain conversion processing on the first speech signal and the second speech signal to obtain a first frequency domain signal and a second frequency domain signal, respectively, comprises:

Windowing is carried out on the first voice signal and the second voice signal respectively to obtain a windowed first signal and a windowed second signal;

And performing Fourier transform processing on the windowed first signal and the windowed second signal respectively to obtain the first frequency domain signal and the second frequency domain signal.

4. The method of claim 1, wherein after the obtaining the first frequency domain signal and the second frequency domain signal, and if the first frequency domain signal and the second frequency domain signal comprise speech signals, before determining the first frequency domain signal and the second frequency domain signal as sound source signature signals, the method further comprises:

Respectively performing voice endpoint detection processing on the first frequency domain signal and the second frequency domain signal to obtain a first voice energy value corresponding to the first voice signal and a second voice energy value corresponding to the second voice signal;

and if the first voice energy value is greater than or equal to a preset energy threshold value and the second voice energy value is greater than or equal to the preset energy threshold value, determining that the first frequency domain signal and the second frequency domain signal comprise the voice signal.

5. The method of claim 1, wherein the determining the target azimuth angle based on the plurality of angle probability combinations and the preset angle calculation model comprises:

And determining an azimuth corresponding to the maximum probability in the multiple angle probability combinations as the target azimuth by using the preset angle calculation model.

6. The method according to any one of claims 1 to 4, wherein after performing frequency domain conversion processing on the first speech signal and the second speech signal, respectively, to obtain a first frequency domain signal and a second frequency domain signal, the method further comprises:

And if the first frequency domain signal and the second frequency domain signal do not comprise voice signals, continuing to execute acquisition processing, frequency domain conversion processing and voice endpoint detection processing of the next audio frame.

7. A sound source positioning device is characterized by comprising a collecting unit, a converting unit and a determining unit,

The determining unit is configured to determine the first frequency domain signal and the second frequency domain signal as sound source characteristic signals if the first frequency domain signal and the second frequency domain signal include a speech signal; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles;

Wherein,

The determining unit is specifically configured to input a plurality of sound source characteristic signals corresponding to a plurality of moments into the preset positioning model, and output a plurality of angle probability combinations corresponding to the plurality of sound source characteristic signals; wherein, the angle probability combination characterizes the corresponding relation between azimuth and probability;

8. A sound source localization device comprising a processor, a memory storing instructions executable by the processor, which when executed by the processor, implement the method of any one of claims 1-6.

9. A computer readable storage medium having stored thereon a program for use in a sound source localization device, wherein the program, when executed by a processor, implements the method according to any of claims 1-6.