CN112509584A - Sound source position determining method and device and electronic equipment - Google Patents

Sound source position determining method and device and electronic equipment Download PDF

Info

Publication number
CN112509584A
CN112509584A CN202011405877.0A CN202011405877A CN112509584A CN 112509584 A CN112509584 A CN 112509584A CN 202011405877 A CN202011405877 A CN 202011405877A CN 112509584 A CN112509584 A CN 112509584A
Authority
CN
China
Prior art keywords
sound
mixed
signals
source position
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011405877.0A
Other languages
Chinese (zh)
Inventor
陈孝良
冯大航
常乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202011405877.0A priority Critical patent/CN112509584A/en
Publication of CN112509584A publication Critical patent/CN112509584A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the disclosure discloses a sound source position determining method, a sound source position determining device, electronic equipment and a computer readable storage medium. The sound source position determining method comprises the following steps: acquiring a plurality of mixed sound signals collected by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal; performing sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to a sound source; performing preset sound detection on the plurality of sound signals; determining a sound source position of the sound signal according to at least one sound signal of which a preset sound is detected. According to the method, the mixed sound is separated, and the sound source position is determined according to the preset sound detected by the separated sound, so that the technical problem that the sound source position cannot be accurately judged in the prior art is solved.

Description

Sound source position determining method and device and electronic equipment
Technical Field
The present disclosure relates to the field of speech recognition, and in particular, to a method and an apparatus for determining a sound source position, an electronic device, and a computer-readable storage medium.
Background
As a man-machine interaction means, the voice recognition acquisition technology is significant in the aspect of liberation of both hands of human beings. More and more intelligent devices are added with the voice wake-up trend, and become a bridge for communication between people and devices, so that the voice wake-up (KWS) technology becomes more and more important.
The vehicle-mounted voice intelligent interaction market is getting larger and larger, and during driving, calling, sending short messages, listening to music, navigating and the like can be controlled through intelligent voice. In vehicle-mounted voice interaction, microphones for collecting sound need to be arranged in an automobile, the arrangement of the current vehicle-mounted microphones is various, a mode that only one microphone is used for picking up a voice signal of a machine is adopted, a mode that a microphone array consisting of a plurality of microphones is used for picking up the voice signal is also adopted, and different signal processing modes are adopted in each arrangement.
Because a vehicle can generate a lot of noises in the driving process, such as vehicle external tire noise and wind noise in the driving process, vehicle internal air conditioning noise, engine noise and other driving environment noises, the existing signal processing mode generally performs noise suppression on the sound collected by a microphone to increase the identification accuracy. However, in the prior art, only whether the voice includes the awakening word can be recognized, and the position of the awakening word cannot be accurately judged; in addition, as a plurality of persons possibly speak in the vehicle, the detection of the awakening word is not accurate enough. Therefore, how to obtain the position of the preset sound from the mixed sound signal becomes an urgent problem to be solved.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, an embodiment of the present disclosure provides a sound source position determining method, including:
acquiring a plurality of mixed sound signals collected by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
performing sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to a sound source;
performing preset sound detection on the plurality of sound signals;
and determining the sound source position of the sound signal according to at least one sound signal of the detected preset sound.
Further, the acquiring a plurality of mixed sound signals collected by a plurality of microphones includes:
acquiring N original sound signals collected by N microphones, wherein each of the N sound signals is a mixed original sound signal of N sound sources;
and denoising the N original sound signals to obtain M mixed sound signals corresponding to M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers greater than 1, and M is less than or equal to N.
Further, the sound separation of the mixed speech signals to obtain a plurality of sound signals includes:
and multiplying the plurality of mixed sound signals by a preset de-mixing matrix to obtain a plurality of sound signals, wherein each sound signal is the sum of products of the plurality of mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
Further, the voice separating the plurality of mixed sound signals to obtain a plurality of sound signals includes:
and multiplying the M mixed sound signals by a preset de-mixing matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
Further, the determining a sound source position of the sound signal according to at least one sound signal in which a preset sound is detected includes:
in response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
and taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
Further, the calculating, in response to detecting a preset sound in the plurality of sound signals, an energy value of a sound signal corresponding to the preset sound includes:
calculating an energy value of each sound signal at each time point in the time domain, and storing the energy values in a memory;
in response to detecting the preset sound in the plurality of sound signals, acquiring a start time point and an end time point of the preset sound;
-retrieving from said memory an energy value between said start and end time points.
Further, the determining, as the sound source position of the sound signal, the position of the sound source corresponding to the sound signal having the energy value higher than the energy threshold includes:
screening sound signals with the confidence coefficient of preset sound larger than a preset sound threshold value from the voice signals with the energy value higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the preset sound confidence coefficient larger than the preset sound threshold value as the sound source position of the sound signal.
Further, the unmixing matrix is obtained in advance by calculating the following steps:
playing a test sound signal at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone;
converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals by a short time fourier transform;
acquiring a calculation function of the unmixing matrix;
adding direction constraint into the calculation function of the unmixing matrix and selecting a prior probability density function to obtain a cost function;
and performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a de-mixing matrix which enables the value of the cost function to be minimum as the preset de-mixing matrix.
Further, the number of microphones is equal to the number of sound sources.
Further, the sound signal is a voice signal, the preset sound is a wake-up word, and a sound source position of the sound signal is a wake-up position.
Further, the method further comprises:
and executing the functional instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
In a second aspect, an embodiment of the present disclosure provides an apparatus for determining a sound source position, including:
the mixed sound signal acquisition module is used for acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
the mixed sound signal separation module is used for carrying out sound separation on the mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to one sound source;
the preset sound detection module is used for carrying out preset sound detection on the plurality of sound signals;
and the sound source position determining module is used for determining the sound source position of the sound signal according to at least one sound signal of the detected preset sound.
Further, the mixed sound signal obtaining module is further configured to:
acquiring N original sound signals collected by N microphones, wherein each of the N sound signals is a mixed original sound signal of N sound sources;
and denoising the N original sound signals to obtain M mixed sound signals corresponding to M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers greater than 1, and M is less than or equal to N.
Further, the mixed sound signal separation module is further configured to:
and multiplying the plurality of mixed sound signals by a preset de-mixing matrix to obtain a plurality of sound signals, wherein each sound signal is the sum of products of the plurality of mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
Further, the mixed sound signal separation module is further configured to:
and multiplying the M mixed sound signals by a preset de-mixing matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
Further, the sound source position determining module is further configured to:
in response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
and taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
Further, the sound source position determining module is further configured to:
calculating an energy value of each sound signal at each time point in the time domain, and storing the energy values in a memory;
in response to detecting the preset sound in the plurality of sound signals, acquiring a start time point and an end time point of the preset sound;
-retrieving from said memory an energy value between said start and end time points.
Further, the sound source position determining module is further configured to:
screening sound signals with the confidence coefficient of preset sound larger than a preset sound threshold value from the voice signals with the energy value higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the preset sound confidence coefficient larger than the preset sound threshold value as the sound source position of the sound signal.
Further, the unmixing matrix is obtained in advance by calculating the following steps:
playing a test sound signal at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone;
converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals by a short time fourier transform;
acquiring a calculation function of the unmixing matrix;
adding direction constraint into the calculation function of the unmixing matrix and selecting a prior probability density function to obtain a cost function;
and performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a de-mixing matrix which enables the value of the cost function to be minimum as the preset de-mixing matrix.
Further, the number of microphones is equal to the number of sound sources.
Further, the sound signal is a voice signal, the preset sound is a wake-up word, and a sound source position of the sound signal is a wake-up position.
Further, the sound source position determination apparatus further includes:
and the function execution module is used for executing a function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding first aspects.
In a fourth aspect, the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the method of any one of the foregoing first aspects.
The embodiment of the disclosure discloses a sound source position determining method, a sound source position determining device, electronic equipment and a computer readable storage medium. The sound source position determining method comprises the following steps: acquiring a plurality of mixed sound signals collected by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal; performing sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to a sound source; performing preset sound detection on the plurality of sound signals; determining a sound source position of the sound signal according to at least one sound signal of which a preset sound is detected. According to the method, the mixed sound is separated, and the sound source position is determined according to the preset sound detected by the separated sound, so that the technical problem that the sound source position cannot be accurately judged in the prior art is solved.
The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
Fig. 1 is a schematic view of an application scenario of an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a sound source position determining method according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a specific implementation manner of step S201 of a sound source position determining method provided by the embodiment of the present disclosure;
fig. 4 is a calculation step of a unmixing matrix of the sound source position determination method of the embodiment of the present disclosure;
fig. 5 is a schematic diagram of a specific implementation manner of step S204 of a sound source position determining method provided by the embodiment of the disclosure;
fig. 6 is a schematic structural diagram of an embodiment of a voice wake-up apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
Fig. 1 is a schematic view of an application scenario of the embodiment of the present disclosure. Fig. 1 shows the layout of the vehicle-mounted microphones, and in this exemplary layout, there are 4 microphones arranged in the automobile above the main driving seat, the assistant driving seat and the two seats in the rear row of the automobile. The 4 microphones simultaneously collect sound signals in the vehicle, when the voice input is detected, the voice is sent to the voice recognition device for recognition, when the target voice (namely, the awakening word) is recognized, the awakening position is determined according to the position of the sound source of the recognized target voice, and the function corresponding to the awakening word is executed, so that the position of the executed function can be determined by combining the position of the awakening word.
Fig. 2 is a flowchart of an embodiment of a sound source position determining method provided in this disclosure, which may be executed by a sound source position determining device implemented as software or as a combination of software and hardware, and which may be integrated in a certain device in a sound source position determining system, such as a sound source position determining server or a sound source position determining terminal device. As shown in fig. 2, the method comprises the steps of:
step S201, acquiring a plurality of mixed sound signals collected by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
illustratively, the plurality of microphones are a plurality of microphones arranged in a specific space, such as a plurality of microphones arranged in a room or a vehicle. The embodiment of the present disclosure takes a microphone disposed in a vehicle as an example, but it is understood that the plurality of microphones may be a plurality of microphones disposed in any space, and is not limited herein.
In the embodiment of the present disclosure, a voice signal is used as a sound signal for description, but the sound signal in the technical scheme in the present disclosure is not limited to the voice signal, and any other sound signal may determine the sound source position using the scheme in the present disclosure, and is not described herein again.
In an embodiment of the present disclosure, the sound signal collected by each microphone is a mixed voice signal of voice signals received from a plurality of sound sources. In the present disclosure, the number of microphones is equal to the number of sound sources. The mixed speech signal can be represented by the following formula (1) in the time domain:
Figure BDA0002814133900000091
wherein xj(t) represents the mixed speech signal received by the jth microphone at time t; a isjiIs a weighting coefficient, which is determined by the impulse response function; si(t) is the sound source signal; m is the number of microphones and sound sources.
As can be seen from the above formula, the mixed speech signal is determined in the time domain by each sound source signal and its weighting coefficient, which are unknown as well as the sound source signal.
Optionally, the step S201 further includes:
step S301, acquiring N original sound signals collected by N microphones, wherein each of the N sound signals is a mixed original signal of N sound sources;
step S302, performing noise reduction processing on the N original sound signals to obtain M mixed sound signals corresponding to M microphones, where each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers greater than 1, and M is less than or equal to N.
In this embodiment, after acquiring the mixed original signals of the plurality of sound sources, it is necessary to perform noise reduction processing first. In the embodiment including N microphones and N sound sources, after the noise reduction processing, M mixed speech signals mixed by signals of the M sound sources can be obtained, and this time, there are two cases: if the sound area is not distinguished, the mixed voice signal after the noise reduction processing comprises signals of M sound sources; if the sound regions are distinguished, the noise reduction processing can be carried out to filter out the sound with smaller sound as noise, only part of the signals of the sound sources are reserved, and the number of the reserved sound sources is equal to the number of the used microphones. For example, 1 microphone is respectively arranged on 4 seats in an automobile, 4 microphones are totally arranged, each microphone receives signals of sound sources of 4 seats, under the condition that sound areas are not distinguished, for each microphone, the 4 sound source signals are subjected to noise reduction processing to obtain mixed signals of the 4 sound source signals, and the number of the microphones and the number of the sound sources are both 2; under the condition of distinguishing the sound zone, can be with the car branch of events front-seat region and back-seat region, this moment through setting up the noise filtering model, the signal of the back-seat sound source that two microphones of front-seat received is considered the noise and is filtered, only keep the signal of two sound sources of front-seat, it is also the same to the back-seat, regard the sound source signal of front-seat as the noise and filter, only keep the signal of two sound sources of back-seat, to two microphones of front-seat or back-seat this moment, only include the signal of front-seat sound source or back-seat sound source in the mixed speech signal that it gathered, microphone and the quantity of sound source are 2 this moment.
It will be appreciated that the noise reduction process may use a pre-trained noise reduction model. Illustratively, the noise reduction model is a DNN (deep neural network) model trained by actually measuring the in-vehicle noise data, and can effectively eliminate the in-vehicle environmental noises such as the out-vehicle tire noise, the wind noise, the in-vehicle air-conditioning sound, the engine sound and the smaller human voice.
Step S202, carrying out sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to a sound source;
in this step, a plurality of voice signals, which are voice signals corresponding to a sound source, are separated from the plurality of mixed voice signals.
Optionally, the step S202 includes:
and multiplying the plurality of mixed sound signals by a preset de-mixing matrix to obtain a plurality of sound signals, wherein each sound signal is the sum of products of the plurality of mixed sound signals and a de-mixing coefficient in the de-mixing matrix. The above formula (1) represents the mixed speech signal in the time domain, and in actual processing, it is usually necessary to convert it into the frequency domain, assuming that the number of microphones and the number of sound sources are both K, and the time frame
Figure BDA0002814133900000111
Where N is the total number of frames for a segment of speech. The frequency index is
Figure BDA0002814133900000112
The matrix of speech signals collected by the microphones can be expressed as formula (2):
xf,n=Afsf,n (2)
wherein xf,nAs a result of the short-time Fourier transform of x, sf,nAs a result of the short-time fourier transform of s,
Figure BDA0002814133900000113
is a matrix of the acoustic transfer function at frequency f. Suppose AfReversible, then there is a matrix WfSo that:
yf,n=Wfxf,n (3)
then W isfIs xf,nOf the unmixing matrix yf,nIs the unmixed sound source signal.
Thus, the unmixing matrix W is calculated in advancefIn this case, each of the mixed speech signals may be separated using the unmixing matrix. Wherein the content of the first and second substances,
Figure BDA0002814133900000114
k-th pair of K unmixing filtersThe corresponding vector is recorded as
Figure BDA0002814133900000115
Figure BDA0002814133900000116
Whereby a plurality of mixed speech signals xf,nWith the unmixing matrix WfCoefficient of unmixing in
Figure BDA0002814133900000121
The sum of the products of (a) is the speech signal y of the separated sound sourcef,n
Optionally, the step S202 includes:
and multiplying the M mixed sound signals by a preset de-mixing matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
In this alternative embodiment, the preset unmixing matrix is plural, and each unmixing matrix corresponds to one sound region, and corresponds to M microphones and M sound sources. Exemplarily, the microphones are 4 microphones arranged above 4 seats of a car, the sound sources are 4 sound sources on the 4 seats, the sound area is divided into 2, the two microphones in the front row are one sound area, and a de-mixing matrix w1 is used; the two microphones in the back row are one sound zone and use the unmixing matrix w 2. Where w1 and w2 are both pre-calculated unmixing matrices for a particular microphone and a particular sound source location.
Optionally, the step of pre-calculating the unmixing matrix is as follows:
step S401, playing a test sound signal at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone;
step S402, converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals through short-time Fourier transform;
step S403, acquiring a calculation function of the unmixing matrix;
step S404, adding direction constraint in the calculation function of the unmixing matrix and selecting a prior probability density function to obtain a cost function;
step S405, according to the plurality of frequency domain mixed test sound signals and preset iteration times, performing iterative computation to obtain a de-mixing matrix which enables the value of the cost function to be minimum as the preset de-mixing matrix.
In step S401, playing a test speech signal, such as a clean speech signal, at each sound source position, so as to obtain a plurality of mixed test speech signals collected by a plurality of microphones, where the speech signals are time domain signals, as shown in the above formula (1); in step S402, a mixed test speech signal in a plurality of frequency domains is generated on the frequency domain to which the time domain signal obtained in step S401 is converted by short-time fourier transform, as shown in the above formula (2); in step S403, an estimated value of the unmixing matrix is obtained by an approximation method, where the estimated value is an approximate calculation function of the unmixing matrix, and the calculation function of the unmixing matrix obtained according to the maximum a posteriori probability optimization problem is, for example:
Figure BDA0002814133900000131
where omega is the set of unmixing matrices,
Figure BDA0002814133900000132
introducing a sound source model G (y)k,n)=-log p(yk,n)。
In step S404, after adding directional constraint to the above formula (4) and selecting the prior probability density function, a cost function of the unmixing matrix is obtained:
Figure BDA0002814133900000133
wherein
Figure BDA0002814133900000134
dkIs the distance between the microphones, c is the speed of sound propagation in air, and θ is the direction of the sound source.
Figure BDA0002814133900000135
A typical value is 1, a preset value.
In step S405, a predetermined method is used to iteratively calculate a downmix matrix that minimizes the value of equation (5) as the preset downmix matrix according to the plurality of frequency domain mixed test speech signals and a preset number of iterations. The predetermined method may typically be a gradient descent method, or the like, or in the present disclosure, the unmixing matrix may be calculated by performing iteration using the following iterative method:
let the covariance matrix of the weighted microphone signals be:
Figure BDA0002814133900000136
wherein
Figure BDA0002814133900000137
Then the process of the first step is carried out,
Figure BDA0002814133900000141
wherein ekRepresenting a canonical unit vector with the k-th bit being 1.
By initialisation
Figure BDA0002814133900000142
Can calculate
Figure BDA0002814133900000143
The iteration is always performed by equation (7), where l is the number of iterations.
Finally, the method comprises the following steps:
Figure BDA0002814133900000144
can calculate out
Figure BDA0002814133900000145
Thereby obtaining Wf
Step S203, carrying out preset sound detection on the plurality of sound signals;
in the embodiment of the present disclosure, the wake-up word is used as the preset sound for description, but the preset sound in the technical scheme of the present disclosure is not limited to the wake-up word, and any other preset sound may determine the sound source position by using the scheme of the present disclosure, which is not described herein again.
After the voice signal corresponding to the sound source is separated in step S202, the wakeup word detection is performed for each voice signal. Illustratively, the detection can be performed through a pre-trained awakening word detection model, and the awakening word detection model is obtained by training a large number of target awakening words; optionally, the output of the wake word detection model includes whether a wake word is included; optionally, the awakening word detection model may be a two-classification model, where the awakening word detection model may detect whether a certain specific awakening word is included in the speech, and an output result of the detection model is that the awakening word is included or not included; or, the wake word detection model may also be a multi-classification model, which may detect whether the speech includes one of a plurality of wake words, and output a result that the speech includes the wake word and a type of the wake word or does not include the wake word.
In one embodiment, when the output result of the wake word detection model is that a wake word is included, it also outputs the confidence of the wake word, i.e. the probability that the voice includes the wake word. When the awakening word detection model is classified, classification is carried out according to the calculated confidence coefficient of the awakening word, if the model is set to indicate that the voice contains the awakening word when the confidence coefficient of a certain awakening word is larger than 70%, the recognition result is output to identify the awakening word, and the specific value of the confidence coefficient of the awakening word is 80%.
Step S204, determining the sound source position of the sound signal according to at least one sound signal of the detected preset sound.
The sound signal and the sound source have a corresponding relation, and the corresponding relation can be preset, for example, the separated first sound signal corresponds to the driving seat, the separated second sound signal corresponds to the co-driving seat, and so on. Therefore, the sound source position of the sound signal can be determined according to the sound signal of the detected preset sound, namely, the voice awakening position can be determined according to the detected voice signal.
However, in the scenario of the vehicle-mounted microphone, since the sound signal is complicated, a situation may occur in which the preset sound is erroneously recognized. To solve this problem, optionally, the step S204 includes:
step S501, responding to the detection of a preset sound in at least one sound signal, and calculating the energy value of the sound signal corresponding to the preset sound;
in step S502, the sound source position corresponding to the sound signal having the energy value higher than the energy threshold is set as the sound source position of the sound signal.
The speech signal has energy, and the larger the speech sound is, the larger the energy of the sound is. Therefore, in order to prevent the misrecognition, whether a person sends out voice can be judged by detecting the energy of the voice signal, and the voice signal with low energy is taken as the voice signal of the misrecognition to be filtered, so that the recognition accuracy is improved. Illustratively, the energy value of a signal may be calculated in the time domain, e.g., xtIs the amplitude of the speech signal at the time point t, the energy of the speech signal at the time point is
Figure BDA0002814133900000151
And when the energy in the time period corresponding to the awakening word is larger than a preset energy threshold value, determining that the awakening word is effective, and using the position of the sound source corresponding to the voice signal as the voice awakening position.
Optionally, the step S501 further includes:
calculating an energy value of each sound signal at each time point in the time domain, and storing the energy values in a memory;
in response to detecting the preset sound in the plurality of sound signals, acquiring a start time point and an end time point of the preset sound;
-retrieving from said memory an energy value between said start and end time points.
In this embodiment, the energy value of each speech signal at each time point in the time domain is calculated in real time and stored in a memory such as a buffer memory. When the voice signal is detected to include the awakening word, the position of the awakening word in the voice signal, namely the starting time point and the ending time point, can be acquired at this time, then the energy value corresponding to each time point between the two time points is acquired from the memory according to the starting time point and the ending time point, and the sum of the energy values is calculated to serve as the energy value of the awakening word.
In another embodiment, even if the energy value is higher than the preset energy threshold, the detection of the wake-up word may be incorrect due to misjudgment of detection of the wake-up word, and therefore, optionally, the step S502 further includes:
screening sound signals with the confidence coefficient of preset sound larger than a preset sound threshold value from the voice signals with the energy value higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the preset sound confidence coefficient larger than the preset sound threshold value as the sound source position of the sound signal.
For example, if a microphone collects a mixed voice signal of voice signals of 4 sound sources, at this time, a passenger sitting on the left side of the rear row speaks a wakeup word aloud, because the sound is relatively loud, the wakeup word is also detected in the voice signals corresponding to the positions of the driving seat and the right side of the rear row, but the confidence degrees of the detected wakeup words are different, if the confidence degree of the wakeup word detected at the driving seat is 70%, the confidence degree of the wakeup word detected at the right side of the rear row is 75%, and the confidence degree of the wakeup word detected at the left side of the rear row is 90%, at this time, a wakeup threshold of the confidence degree of the wakeup word can be set, if 85%, the positions of the driving seat and the right side of the rear row can be filtered through the wakeup threshold, and finally, the position on the left side of the rear row is.
It is understood that when there are a plurality of wake-up words in the voice signal that satisfy the condition of the energy value or satisfy the condition of the energy value and the confidence level, a plurality of positions can be determined as the wake-up positions of the voice.
Through the step S204, the emitting position of the preset sound can be more accurately determined, so that the function corresponding to the preset sound can be more accurately executed.
Further, the method further comprises: and executing the functional instruction related to the sound source position in the sound signal according to the sound source position of the sound signal. If in the voice control scene of the automobile, when a user in the automobile says 'open a window', the window corresponding to the awakening position can be opened by determining the awakening position of the awakening voice instead of opening all windows; when the user says "turn down the air conditioner temperature", can judge that the user is located the front row of car or back row to carry out zone control to the temperature.
The embodiment of the disclosure discloses a sound source position determining method, which comprises the following steps: acquiring a plurality of mixed sound signals collected by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal; performing sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to a sound source; performing preset sound detection on the plurality of sound signals; determining a sound source position of the sound signal according to at least one sound signal of which a preset sound is detected. According to the method, the mixed sound is separated, and the sound source position is determined according to the preset sound detected by the separated sound, so that the technical problem that the sound source position cannot be accurately judged in the prior art is solved.
In the above, although the steps in the above method embodiments are described in the above sequence, it should be clear to those skilled in the art that the steps in the embodiments of the present disclosure are not necessarily performed in the above sequence, and may also be performed in other sequences such as reverse, parallel, and cross, and further, on the basis of the above steps, other steps may also be added by those skilled in the art, and these obvious modifications or equivalents should also be included in the protection scope of the present disclosure, and are not described herein again.
Fig. 6 is a schematic structural diagram of an embodiment of a sound source position determining apparatus provided in an embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 includes: a mixed sound signal obtaining module 601, a mixed sound signal separating module 602, a preset sound detecting module 603, and a sound source position determining module 604. Wherein the content of the first and second substances,
a mixed sound signal obtaining module 601, configured to obtain multiple mixed sound signals collected by multiple microphones, where each microphone corresponds to one mixed sound signal;
a mixed sound signal separation module 602, configured to perform sound separation on the multiple mixed sound signals to obtain multiple sound signals, where each sound signal corresponds to a sound source;
a preset sound detection module 603, configured to perform preset sound detection on the multiple sound signals;
a sound source position determining module 604, configured to determine a sound source position of the sound signal according to at least one sound signal detected as the preset sound.
Further, the mixed sound signal obtaining module 601 is further configured to:
acquiring N original sound signals collected by N microphones, wherein each of the N original sound signals is a mixed original signal of N sound sources;
and denoising the N original sound signals to obtain M mixed sound signals corresponding to M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers greater than 1, and M is less than or equal to N.
Further, the mixed sound signal separation module 602 is further configured to:
and multiplying the plurality of mixed sound signals by a preset de-mixing matrix to obtain a plurality of sound signals, wherein each sound signal is the sum of products of the plurality of mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
Further, the mixed sound signal separation module 602 is further configured to:
and multiplying the M mixed sound signals by a preset de-mixing matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
Further, the sound source position determining module 604 is further configured to:
in response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
and taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
Further, the sound source position determining module 604 is further configured to:
calculating an energy value of each sound signal at each time point in the time domain, and storing the energy values in a memory;
in response to detecting the preset sound in the plurality of sound signals, acquiring a start time point and an end time point of the preset sound;
-retrieving from said memory an energy value between said start and end time points.
Further, the sound source position determining module 604 is further configured to:
screening sound signals with the confidence coefficient of preset sound larger than a preset sound threshold value from the voice signals with the energy value higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the preset sound confidence coefficient larger than the preset sound threshold value as the sound source position of the sound signal.
Further, the unmixing matrix is obtained in advance by calculating the following steps:
playing a test sound signal at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone;
converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals by a short time fourier transform;
acquiring a calculation function of the unmixing matrix;
adding direction constraint into the calculation function of the unmixing matrix and selecting a prior probability density function to obtain a cost function;
and performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a de-mixing matrix which enables the value of the cost function to be minimum as the preset de-mixing matrix.
Further, the number of microphones is equal to the number of sound sources.
Further, the sound signal is a voice signal, the preset sound is a wake-up word, and a sound source position of the sound signal is a wake-up position.
Further, the sound source position determination apparatus further includes:
and the function execution module is used for executing a function instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
The apparatus shown in fig. 6 can perform the method of the embodiment shown in fig. 2-5, and the detailed description of this embodiment can refer to the related description of the embodiment shown in fig. 2-5. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 2 to fig. 5, and are not described herein again.
Referring now to FIG. 7, shown is a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: the sound source position determination method described above is performed.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (14)

1. A method for determining a location of a sound source, comprising:
acquiring a plurality of mixed sound signals collected by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
performing sound separation on the plurality of mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to a sound source;
performing preset sound detection on the plurality of sound signals;
and determining the sound source position of the sound signal according to at least one sound signal of the detected preset sound.
2. The sound source position determination method according to claim 1, wherein the acquiring a plurality of mixed sound signals collected by a plurality of microphones comprises:
acquiring N original sound signals collected by N microphones, wherein each of the N original sound signals is a mixed original signal of N sound sources;
and denoising the N original sound signals to obtain M mixed sound signals corresponding to M microphones, wherein each of the M mixed sound signals is a mixed sound signal of M sound sources, M and N are positive integers greater than 1, and M is less than or equal to N.
3. The sound source position determination method according to claim 1, wherein the acoustically separating the plurality of mixed speech signals into a plurality of sound signals comprises:
and multiplying the plurality of mixed sound signals by a preset de-mixing matrix to obtain a plurality of sound signals, wherein each sound signal is the sum of products of the plurality of mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
4. The sound source position determination method according to claim 2, wherein the voice-separating the plurality of mixed sound signals into a plurality of sound signals comprises:
and multiplying the M mixed sound signals by a preset de-mixing matrix corresponding to the M mixed sound signals to obtain M sound signals, wherein each of the M sound signals is the sum of products of the M mixed sound signals and a de-mixing coefficient in the de-mixing matrix.
5. The sound source position determination method according to claim 1, wherein the determining of the sound source position of the sound signal based on at least one sound signal in which a preset sound is detected, comprises:
in response to detecting a preset sound in at least one sound signal, calculating an energy value of the sound signal corresponding to the preset sound;
and taking the sound source position corresponding to the sound signal with the energy value higher than the energy threshold value as the sound source position of the sound signal.
6. The sound source position determination method according to claim 5, wherein the calculating, in response to detection of a preset sound in a plurality of the sound signals, an energy value of a sound signal to which the preset sound corresponds includes:
calculating an energy value of each sound signal at each time point in the time domain, and storing the energy values in a memory;
in response to detecting the preset sound in the plurality of sound signals, acquiring a start time point and an end time point of the preset sound;
-retrieving from said memory an energy value between said start and end time points.
7. The sound source position determination method according to claim 6, wherein the regarding a position of a sound source corresponding to the sound signal having the energy value higher than the energy threshold as the sound source position of the sound signal comprises:
screening sound signals with the confidence coefficient of preset sound larger than a preset sound threshold value from the voice signals with the energy value higher than the energy threshold value;
and taking the sound source position corresponding to the sound signal with the preset sound confidence coefficient larger than the preset sound threshold value as the sound source position of the sound signal.
8. The sound source position determination method according to claim 3, wherein the unmixing matrix is obtained in advance by calculating:
playing a test sound signal at each sound source position to obtain a plurality of mixed test sound signals collected by a microphone;
converting the plurality of mixed test sound signals into a plurality of frequency domain mixed test sound signals by a short time fourier transform;
acquiring a calculation function of the unmixing matrix;
adding direction constraint into the calculation function of the unmixing matrix and selecting a prior probability density function to obtain a cost function;
and performing iterative computation according to the plurality of frequency domain mixed test sound signals and preset iteration times to obtain a de-mixing matrix which enables the value of the cost function to be minimum as the preset de-mixing matrix.
9. The sound source position determination method according to claim 1, wherein the number of microphones is equal to the number of sound sources.
10. The sound source position determination method according to any one of claims 1 to 9, wherein the sound signal is a voice signal, the preset sound is a wake-up word, and the sound source position of the sound signal is a wake-up position.
11. The sound source position determination method according to claim 1, characterized by further comprising:
and executing the functional instruction related to the sound source position in the sound signal according to the sound source position of the sound signal.
12. An apparatus for determining a position of a sound source, comprising:
the mixed sound signal acquisition module is used for acquiring a plurality of mixed sound signals acquired by a plurality of microphones, wherein each microphone corresponds to one mixed sound signal;
the mixed sound signal separation module is used for carrying out sound separation on the mixed sound signals to obtain a plurality of sound signals, wherein each sound signal corresponds to one sound source;
the preset sound detection module is used for carrying out preset sound detection on the plurality of sound signals;
and the sound source position determining module is used for determining the sound source position of the sound signal according to at least one sound signal of the detected preset sound.
13. An electronic device, a memory to store computer readable instructions; and
a processor for executing the computer readable instructions such that the processor when executed implements the method of any of claims 1-11.
14. A non-transitory computer readable storage medium storing computer readable instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-11.
CN202011405877.0A 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment Pending CN112509584A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011405877.0A CN112509584A (en) 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011405877.0A CN112509584A (en) 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112509584A true CN112509584A (en) 2021-03-16

Family

ID=74969946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011405877.0A Pending CN112509584A (en) 2020-12-03 2020-12-03 Sound source position determining method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112509584A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223548A (en) * 2021-05-07 2021-08-06 北京小米移动软件有限公司 Sound source positioning method and device
CN113362864A (en) * 2021-06-16 2021-09-07 北京字节跳动网络技术有限公司 Audio signal processing method, device, storage medium and electronic equipment
CN113380267A (en) * 2021-04-30 2021-09-10 深圳地平线机器人科技有限公司 Method and device for positioning sound zone, storage medium and electronic equipment
CN114678021A (en) * 2022-03-23 2022-06-28 小米汽车科技有限公司 Audio signal processing method and device, storage medium and vehicle
CN115346527A (en) * 2022-08-08 2022-11-15 科大讯飞股份有限公司 Voice control method, device, system, vehicle and storage medium
CN115440208A (en) * 2022-04-15 2022-12-06 北京罗克维尔斯科技有限公司 Vehicle control method, device, equipment and computer readable storage medium
DE102021120246A1 (en) 2021-08-04 2023-02-09 Bayerische Motoren Werke Aktiengesellschaft voice recognition system
WO2024061372A1 (en) * 2022-09-23 2024-03-28 中国第一汽车股份有限公司 Data transmission method and device, storage medium, and target vehicle

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308909A (en) * 2018-11-06 2019-02-05 北京智能管家科技有限公司 A kind of signal separating method, device, electronic equipment and storage medium
CN109410978A (en) * 2018-11-06 2019-03-01 北京智能管家科技有限公司 A kind of speech signal separation method, apparatus, electronic equipment and storage medium
CN110673096A (en) * 2019-09-30 2020-01-10 北京地平线机器人技术研发有限公司 Voice positioning method and device, computer readable storage medium and electronic equipment
CN110992977A (en) * 2019-12-03 2020-04-10 北京声智科技有限公司 Method and device for extracting target sound source
US20200184994A1 (en) * 2018-12-07 2020-06-11 Nuance Communications, Inc. System and method for acoustic localization of multiple sources using spatial pre-filtering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308909A (en) * 2018-11-06 2019-02-05 北京智能管家科技有限公司 A kind of signal separating method, device, electronic equipment and storage medium
CN109410978A (en) * 2018-11-06 2019-03-01 北京智能管家科技有限公司 A kind of speech signal separation method, apparatus, electronic equipment and storage medium
US20200184994A1 (en) * 2018-12-07 2020-06-11 Nuance Communications, Inc. System and method for acoustic localization of multiple sources using spatial pre-filtering
CN110673096A (en) * 2019-09-30 2020-01-10 北京地平线机器人技术研发有限公司 Voice positioning method and device, computer readable storage medium and electronic equipment
CN110992977A (en) * 2019-12-03 2020-04-10 北京声智科技有限公司 Method and device for extracting target sound source

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113380267A (en) * 2021-04-30 2021-09-10 深圳地平线机器人科技有限公司 Method and device for positioning sound zone, storage medium and electronic equipment
CN113380267B (en) * 2021-04-30 2024-04-19 深圳地平线机器人科技有限公司 Method and device for positioning voice zone, storage medium and electronic equipment
CN113223548A (en) * 2021-05-07 2021-08-06 北京小米移动软件有限公司 Sound source positioning method and device
CN113362864A (en) * 2021-06-16 2021-09-07 北京字节跳动网络技术有限公司 Audio signal processing method, device, storage medium and electronic equipment
CN113362864B (en) * 2021-06-16 2022-08-02 北京字节跳动网络技术有限公司 Audio signal processing method, device, storage medium and electronic equipment
DE102021120246A1 (en) 2021-08-04 2023-02-09 Bayerische Motoren Werke Aktiengesellschaft voice recognition system
CN114678021A (en) * 2022-03-23 2022-06-28 小米汽车科技有限公司 Audio signal processing method and device, storage medium and vehicle
CN114678021B (en) * 2022-03-23 2023-03-10 小米汽车科技有限公司 Audio signal processing method and device, storage medium and vehicle
CN115440208A (en) * 2022-04-15 2022-12-06 北京罗克维尔斯科技有限公司 Vehicle control method, device, equipment and computer readable storage medium
CN115346527A (en) * 2022-08-08 2022-11-15 科大讯飞股份有限公司 Voice control method, device, system, vehicle and storage medium
WO2024061372A1 (en) * 2022-09-23 2024-03-28 中国第一汽车股份有限公司 Data transmission method and device, storage medium, and target vehicle

Similar Documents

Publication Publication Date Title
CN112509584A (en) Sound source position determining method and device and electronic equipment
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
CN105448303B (en) Voice signal processing method and device
CN113986187B (en) Audio region amplitude acquisition method and device, electronic equipment and storage medium
US20220139389A1 (en) Speech Interaction Method and Apparatus, Computer Readable Storage Medium and Electronic Device
CN108335694B (en) Far-field environment noise processing method, device, equipment and storage medium
US9311930B2 (en) Audio based system and method for in-vehicle context classification
CN111343410A (en) Mute prompt method and device, electronic equipment and storage medium
CN114678021B (en) Audio signal processing method and device, storage medium and vehicle
CN111883135A (en) Voice transcription method and device and electronic equipment
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN110767215A (en) Method and device for training voice recognition model and recognizing voice
CN113205820A (en) Method for generating voice coder for voice event detection
CN115331656A (en) Non-instruction voice rejection method, vehicle-mounted voice recognition system and automobile
US10757248B1 (en) Identifying location of mobile phones in a vehicle
CN116913258B (en) Speech signal recognition method, device, electronic equipment and computer readable medium
US10991363B2 (en) Priors adaptation for conservative training of acoustic model
CN112382266A (en) Voice synthesis method and device, electronic equipment and storage medium
CN109243457B (en) Voice-based control method, device, equipment and storage medium
CN111653271B (en) Sample data acquisition and model training method and device and computer equipment
JP2019124976A (en) Recommendation apparatus, recommendation method and recommendation program
CN113763976B (en) Noise reduction method and device for audio signal, readable medium and electronic equipment
CN117063229A (en) Interactive voice signal processing method, related equipment and system
CN110941455B (en) Active wake-up method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination