CN108538306B - Method and device for improving DOA estimation of voice equipment - Google Patents

Method and device for improving DOA estimation of voice equipment Download PDF

Info

Publication number
CN108538306B
CN108538306B CN201711498690.8A CN201711498690A CN108538306B CN 108538306 B CN108538306 B CN 108538306B CN 201711498690 A CN201711498690 A CN 201711498690A CN 108538306 B CN108538306 B CN 108538306B
Authority
CN
China
Prior art keywords
voice
microphone signals
frame
signal
estimation value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711498690.8A
Other languages
Chinese (zh)
Other versions
CN108538306A (en
Inventor
朱振岭
陈孝良
冯大航
苏少炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN201711498690.8A priority Critical patent/CN108538306B/en
Publication of CN108538306A publication Critical patent/CN108538306A/en
Application granted granted Critical
Publication of CN108538306B publication Critical patent/CN108538306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/802Systems for determining direction or deviation from predetermined direction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • General Physics & Mathematics (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention provides a method for improving DOA estimation of voice equipment, which comprises the following steps: collecting microphone signals when voice equipment is awakened, and determining voice awakening confidence of each frame; determining a broadband azimuth spectrum function of each frame according to the microphone signals, and determining an angle estimation value of each frame; and determining the statistical result of each angle estimation value according to the voice awakening confidence coefficient, wherein the angle estimation value with the largest statistical result is the DOA estimation result. The angle estimation value with the largest statistical result is the DOA estimation result, and the statistical result of each angle estimation value refers to the statistical result that the voice awakening confidence coefficient of each frame of microphone signals is larger than the threshold value, or the sum of the voice awakening confidence coefficients of frames corresponding to the same angle estimation value, or the sum of the products of the voice awakening confidence coefficients of the frames corresponding to the same angle estimation value and the broadband azimuth spectrum function, so that the azimuth angle of a voice signal source can be determined more accurately, the signal-to-noise ratio is improved, and voice damage is reduced.

Description

Method and device for improving DOA estimation of voice equipment
Technical Field
The invention relates to the field of voice processing, in particular to a method and a device for improving DOA estimation of voice equipment.
Background
The current intelligent voice hardware equipment is more and more widely applied, such as intelligent sound equipment, robots and the like. These smart speech devices generally perform speech recognition after signal processing by a microphone array, thereby improving the speech recognition rate under far-field conditions. The devices generally have functions including waking up with a certain keyword, finding the direction of a speaker after waking up, performing voice enhancement on the direction of the speaker, and interrupting the speaker with a wake-up word under the condition of playing music or voice. Techniques that are therefore mainly involved include echo cancellation techniques, direction of arrival estimation techniques, beamforming techniques, dereverberation techniques, etc.
One problem with these current intelligent voice interaction devices is the low far-field recognition rate. The recognition of the intelligent voice interaction equipment depends on the quality of the voice signals, the recognition rate of the equipment to the received clean voice signals is high, and the recognition rate to the far-field voice signals affected by reverberation, noise and interference is low. This is because the current processing method is to perform DOA estimation during wake-up, and when multiple sound sources exist simultaneously or when the device is away from strong reflection surfaces such as a wall and a display screen, the wake-up time DOA estimation (estimation of the sound wave arrival direction based on an array) is not accurate, so that the subsequent beam forming signal processing erroneously cancels the voice of the speaker as noise, and the device cannot understand the instruction of the speaker.
Disclosure of Invention
Technical problem to be solved
The present invention is directed to a method and an apparatus for improving DOA estimation of a speech device, so as to solve at least one of the above technical problems.
(II) technical scheme
In one aspect of the present invention, a method for improving DOA estimation of a speech device is provided, including:
collecting microphone signals when a voice device is awakened, and determining the voice awakening confidence coefficient of each frame of microphone signals;
determining a broadband azimuth spectrum function of each frame according to the microphone signals, and determining an angle estimation value of each frame of microphone signals; and
and determining the statistical result of each angle estimation value according to the voice awakening confidence coefficient, wherein the angle estimation value with the largest statistical result is the DOA estimation result.
In some embodiments of the present invention, determining the statistical result of each angle estimation value according to the voice wakeup confidence level comprises the sub-steps of:
setting a threshold value;
removing the frame microphone signals with the voice awakening confidence coefficient smaller than the threshold value, and reserving the frame microphone signals with the voice awakening confidence coefficient larger than or equal to the threshold value so as to determine reserved frame microphone signals; and
and determining the statistical result of the frame microphone signals corresponding to the same angle estimation value in the reserved frame microphone signals.
In some embodiments of the present invention, the statistical result of each angle estimation value refers to a sum of the voice wakeup confidences of the frame microphone signals corresponding to the same angle estimation value, or a sum of products of the voice wakeup confidences of the frame microphone signals corresponding to the same angle estimation value and the wideband azimuth spectrum function.
In some embodiments of the present invention, the microphone signals are obtained by a microphone array of a speech device, and the number of microphones of the microphone array is N, and the microphone signals are
X(t)=[x1(t),...,xN(t)]T
Wherein T is a time domain sequence, T represents transposition, and N is more than or equal to 1.
In some embodiments of the invention, before determining the wideband azimuth spectral function from the microphone signals, the method further comprises the steps of:
performing Fourier transform on the microphone signals to determine frequency domain microphone signals:
X(k)=[x1(k),...,xN(k)]T,k=1,...,K
wherein K is a frequency domain sequence, and K is more than or equal to 1.
In some embodiments of the present invention, the number of the voice signal sources of the microphone signal is a, the number of the interference signal sources is D-a, where D is the number of the signal sources, D ≧ a, the microphone signal includes a noise signal, a voice signal, and an interference signal, and the microphone signal is recorded as a noise signal
X(k)=A(k,ΘD)S(k)+N(k)
Wherein, the N X D dimension array manifold matrix A (k, theta)D)=[a(k,θ1),...,a(k,θm),...,a(k,θD)],a(k,θm) Is array manifold vector, m is less than or equal to D, and signal source signal is S (k) ═ s (k, theta)1),...,s(k,θm),...,s(k,θD)]T,ΘD=[θ1,...,θm,...θD]Representing a set of D signal source azimuths, n (k) ═ n1(k),...,nm(k),...,nN(k)]TIs a noise signal.
In some embodiments of the invention, determining a wideband azimuth spectral function from the microphone signals comprises the sub-steps of:
determining a data covariance matrix according to the microphone signals;
decomposing the data covariance matrix to determine a voice subspace and a noise subspace; and
and determining the broadband azimuth spectrum function according to the voice subspace and the noise subspace.
In some embodiments of the invention, the data covariance matrix is:
R(k)xx=E{X(k)X(k)H}=R(k)SS+R(k)nn
wherein, R (k)SS=E{S(k)S(k)H} and R (k)nn=E{N(k)N(k)HRespectively is a speech signal covariance matrix and a noise signal covariance matrix, and H refers to conjugate transpose;
decomposing the data covariance matrix to obtain R (k)xx=EΛEHΛ is a diagonal matrix of eigenvalues in descending order, e (k) ═ e (k)S,E(k)n]Is the corresponding feature vector, E (k)S、E(k)nThe signal subspace and the noise subspace are respectively composed of eigenvectors corresponding to larger D eigenvalues and smaller N-D eigenvalues;
the frequency azimuth spectrum function corresponding to the frequency k is:
P(k,θm)=1/aH(k,θm)E(k)nE(k)n Ha(k,θm),θm∈ΘD
averaging all frequency azimuth spectrums to obtain a broadband azimuth spectrum function:
Figure BDA0001533126560000031
in some embodiments of the present invention, the angle estimation value of each frame of the microphone signals is determined by calculating a wideband azimuth spectrum function of each frame of the microphone signals, and θ corresponding to the largest wideband azimuth spectrum functionmI.e. the angle estimate of each frame of the microphone signal.
In another aspect of the present invention, an apparatus for improving DOA estimation of a speech device is further provided, including:
a memory for storing operating instructions;
and the processor is used for executing the method for improving the DOA estimation of the voice equipment according to the operation instructions in the memory.
(III) advantageous effects
Compared with the prior art, the method and the device for improving the DOA estimation of the voice equipment have at least one of the following advantages:
1. the voice awakening confidence coefficient is introduced into the calculation of the DOA estimation, so that the DOA estimation result is more accurate, the azimuth angle of the voice signal source can be more accurately determined, the influence of interference signals and noise signals on the voice signals is reduced, more accurate direction information is provided for a subsequent beam forming algorithm, the signal-to-noise ratio is improved, the voice damage is reduced, and the recognition rate can be further improved.
2. The statistics of the angle estimates can be determined by three algorithms: the voice awakening confidence of each frame of microphone signal is greater than the statistical result of the threshold value, or the sum of the voice awakening confidence of the frame of microphone signals corresponding to the same angle estimation value, or the sum of the product of the voice awakening confidence of the frame of microphone signals corresponding to the same angle estimation value and the broadband azimuth spectrum function, and the angle estimation value with the largest statistical result is the DOA estimation result. The method can select different DOA estimation algorithms according to the requirements of users.
Drawings
Fig. 1 is a schematic diagram illustrating steps of a method for improving DOA estimation of a speech device according to an embodiment of the present invention.
Fig. 2 is a schematic diagram illustrating specific steps of determining a broadband azimuth spectrum function according to the microphone signal according to the embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an apparatus for improving DOA estimation of a speech device according to an embodiment of the present invention.
FIG. 4A is a spectrogram of a speech signal according to an embodiment of the present invention.
Fig. 4B is a spectrogram of a microphone signal according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating DOA estimation of all frames of a speech signal and a microphone signal according to an embodiment of the invention.
Fig. 6 is a schematic diagram illustrating a voice wakeup confidence curve according to an embodiment of the present invention.
Detailed Description
At present, there are two conventional processing methods, the first is to count the angle estimation value θestThe angle estimation value with the largest statistical result is the target azimuth, for example, the number of times of occurrence near 30 degrees is the largest, 30 degrees is the target direction, and the second is to count the estimation values theta of different anglesestCorresponding azimuth spectrum P (theta)est) And the angle corresponding to the maximum value is the target azimuth angle. When a plurality of sound sources exist simultaneously, the situation of estimation angle error is easily caused in some frames, so that the deviation of the final statistical angle result occurs.
In view of this, the invention introduces voice wake-up confidence into the calculation of DOA estimation, so that the DOA estimation result can be more accurate, the azimuth angle of the voice signal source can be more accurately determined, the influence of interference signals and noise signals on the voice signal is avoided, the signal-to-noise ratio is improved, the voice damage is reduced, and the recognition rate is further improved. In addition, the invention can determine the statistical result of each angle estimation value through three algorithms: the statistical result that the voice awakening confidence of each frame of microphone signals (including voice signals, interference signals and noise signals) is greater than the threshold value, or the sum of the voice awakening confidence of the frame of microphone signals corresponding to the same angle estimation value, or the sum of the product of the voice awakening confidence of the frame of microphone signals corresponding to the same angle estimation value and the broadband azimuth spectrum function, wherein the angle estimation value with the largest statistical result is the result of the DOA estimation. The method can select different DOA estimation algorithms according to the requirements of users, thereby determining the direction of a voice signal source (generating a voice signal capable of waking up a voice device).
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
Fig. 1 is a schematic diagram illustrating steps of a method for improving DOA estimation of a speech device according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
s1, collecting microphone signals when the voice equipment is awakened, and determining the voice awakening confidence of each frame of microphone signals;
s2, determining a broadband azimuth spectrum function of each frame according to the microphone signals, and determining an angle estimation value of each frame of microphone signals; and
and S3, determining the statistical result of each angle estimation value according to the voice awakening confidence coefficient, wherein the angle estimation value with the largest statistical result is the DOA estimation result.
The following describes each step of the method for improving DOA estimation of voice devices according to the embodiment of the present invention in detail.
And S1, collecting microphone signals when the voice equipment is awakened, and determining the voice awakening confidence of each frame of microphone signals.
The microphone signal may be obtained by a microphone array of the speech device, the number of microphones of the microphone array is N, and the microphone signal is
X(t)=[x1(t),...,xN(t)]T
Wherein T is a time domain sequence, T represents transposition, and N is more than or equal to 1.
It should be noted that, because the acquired microphone signal is a time-domain signal, and the subsequent signal processing is mainly performed in the frequency domain, and an overlap-add speech processing frame is adopted, the time-domain signal needs to be subjected to FFT conversion to the frequency domain to obtain a frequency-domain microphone signal:
X(k)=[x1(k),...,xN(k)]T,k=1,...,K
wherein K is a frequency domain sequence, and K is more than or equal to 1.
The number of the voice signal sources of the microphone signals is A, the number of the interference signal sources is D-A, wherein D is the number of the signal sources, D is larger than or equal to A, and the microphone signals are superposition of the voice signals, noise and interference, so that the microphone signals have the formula:
X(k)=A(k,ΘD)S(k)+N(k)
wherein, the N X D dimension array manifold matrix A (k, theta)D)=[a(k,θ1),...,a(k,θm),...,a(k,θD)],a(k,θm) Is an array manifold vector, m is less than or equal to D, the signal source signal is a D x 1 dimensional vector, and the formula is S (k) ═ s (k, theta)1),...,s(k,θm),...,s(k,θD)]T,ΘD=[θ1,...,θm,...θD]Representing a set of D signal source azimuths, n (k) ═ n1(k),...,nm(k),...,nN(k)]TIs a noise signal.
The voice awakening confidence c of each frame of microphone signal can be determined by modeling and training the awakening word and inputting a microphone signaliAnd i represents the number of frames. Generally, the microphone signal of the frame with higher confidence level of voice wakeup is a voice signal, and the microphone signal of the frame with lower confidence level of voice wakeup is an interference signal, an interval of the voice signal, or a noise signal.
And S2, determining a broadband azimuth spectrum function according to the microphone signals, and determining an angle estimation value of each frame of microphone signals.
Fig. 2 is a schematic diagram of specific steps of determining a wideband azimuth spectrum function according to the microphone signals according to the embodiment of the present invention, and as shown in fig. 2, the determining the wideband azimuth spectrum function according to the microphone signals specifically includes the following sub-steps:
s21, determining a data covariance matrix according to the microphone signals:
R(k)xx=E{X(k)X(k)H}=R(k)SS+R(k)nn
wherein, R (k)SS=E{S(k)S(k)H} and R (k)nn=E{N(k)N(k)HRespectively is a speech signal covariance matrix and a noise signal covariance matrix, and H refers to conjugate transpose;
s22, decomposing the data covariance matrix to determine a voice subspace and a noise subspace;
decomposing the data covariance matrix to obtain R (k)xx=EΛEHΛ is a diagonal matrix of eigenvalues in descending order, e (k) ═ e (k)S,E(k)n]Is the corresponding feature vector, E (k)S、E(k)nThe signal subspace and the noise subspace are respectively composed of eigenvectors corresponding to larger D eigenvalues and smaller N-D eigenvalues;
s23, determining the broadband orientation spectrum function according to the voice subspace and the noise subspace;
the MUSIC frequency azimuth spectrum function corresponding to the frequency k is:
P(k,θm)=1/aH(k,θm)E(k)nE(k)n Ha(k,θm),θm∈ΘD
averaging all frequency azimuth spectrums to obtain broadband azimuth spectrum functions of different azimuth angles:
Figure BDA0001533126560000071
it should be noted that the angle estimation value of each frame of the microphone signals is determined by calculating the wideband azimuth spectrum function of each frame of the microphone signals, and the largest wideband azimuth spectrum function corresponds to θmI.e. the angle estimate, theta, of each frame of the microphone signalestIs a set of angle estimates.
And S3, determining the statistical result of each angle estimation value according to the voice awakening confidence coefficient, wherein the angle estimation value with the largest statistical result is the DOA estimation result.
In order to provide a method for estimating DOA which can be selected according to the actual requirements of users, the statistical result of each angle estimation value in the invention has three algorithms:
(1) setting a threshold value, removing the frame microphone signals with the voice awakening confidence coefficient smaller than the threshold value, reserving the frame microphone signals with the voice awakening confidence coefficient larger than or equal to the threshold value to determine reserved frame microphone signals, and then determining the statistical result of the frame microphone signals corresponding to the same angle estimation value in the reserved frame microphone signals;
(2) the sum of the voice awakening confidence degrees of the frame microphone signals corresponding to the same angle estimation value;
(3) and the sum of the products of the voice awakening confidence degrees of the frame microphone signals corresponding to the same angle estimation value and the broadband azimuth spectrum function.
It should be noted that, in other embodiments, the ratio of the sum in the method (2) to the sum of the voice wakeup confidences of the frame microphone signals corresponding to all the angle estimation values, or the ratio of the sum in the method (3) to the sum of the products of the voice wakeup confidences of the frame microphone signals corresponding to all the angle estimation values and the wideband azimuth spectrum function may also be obtained, and the ratio is used as a statistical result, or other similar methods may also be used, which are not described herein again.
For example, there are two conventional processing methods, the first is to count θestThe angle estimation value with the largest statistical result is the target azimuth (DOA estimation result), for example, if the number of times of occurrence near 30 degrees is the largest, 30 degrees is the target azimuth, and the second method is to count the estimation values theta of different anglesestCorresponding azimuth spectrum P (theta)est) And the angle estimation value corresponding to the maximum value is the direction of the voice signal source. When a plurality of sound sources exist simultaneously, the situation of estimation angle error is easily caused in some frames, so that the deviation of the final statistical angle result occurs.
The invention introduces voice wake-up confidenceDegree ciThe method for improving the DOA estimation of the voice equipment more accurately is provided.
Corresponding to the algorithm (1), a threshold value c is setlimIf the voice wake-up confidence c of the i-th frame of the microphone signali<climThen, the angle estimation value of the frame of microphone signal is discarded, and the voice awakening confidence coefficient is kept to be larger than or equal to the threshold value (namely c)i≥Clim) And determining the reserved frame microphone signals, and then determining the statistical result of the frame microphone signals corresponding to the same angle estimation value in the reserved frame microphone signals, wherein the angle estimation value with the largest statistical result is the direction of the voice signal source.
Corresponding to algorithm (2), confidence c is awakened using voiceiThe angle estimates are weighted. For example, when the angle estimation value θ 1 occurs in two frames 1 and 6, and θ 2 occurs in three frames 4, 8, and 10, respectively, the statistical result of θ 1 is denoted as P1 ═ c1+c6The statistical result of θ 2 is denoted as P2 ═ c4+c8+c10Calculating the statistical result of other target azimuth angles if thetamIs the largest, then theta ismI.e. the target azimuth. Compared with the prior art that P1 is recorded as 2 and P2 is recorded as 3, the algorithm of the invention is obviously more reasonable by adding the voice awakening confidence level.
Corresponding to the algorithm (3), similarly, assuming that the statistical timing θ 1 occurs in the 1 st and 6 th frames, and θ 2 occurs in the 4 th, 8 th, and 10 th frames, respectively, P1 is P (θ 1, 1) × c1+P(θ1,6)*c6、P2=P(θ2,4)*c4+P(θ2,8)*c8+P(θ2,10)*c10And when a certain frame of voice is not a keyword or a keyword polluted by noise, the voice awakening confidence coefficient of the frame is low, so that the influence of the frame on the final statistical result is weakened. Compared with the traditional statistical method, the method has the advantages that P1 is P (theta 1, 1) + P (theta 1, 6), P2 is P (theta 2, 4) + P (theta 2, 8) + P (theta 2, 10), and the voice awakening confidence is introduced, so that the condition that a certain frame of voice is not a keyword or a keyword polluted by a noise signal and an interference signal is avoided, and the method can be used for solving the problem that the voice of the certain frame is not a keyword or a keyword polluted by a noise signal and an interference signalThe azimuth angle of the voice signal source is estimated more accurately.
In another aspect of the embodiments of the present invention, a device for improving DOA estimation of a speech device is further provided, and fig. 3 is a schematic structural diagram of the device for improving DOA estimation of a speech device according to the embodiments of the present invention, as shown in fig. 3, the device includes:
a memory 31 for storing operation instructions;
and the processor 32 is used for executing the method for improving the DOA estimation of the voice equipment according to the operation instructions in the memory 31.
Next, the effects of the present invention will be described with reference to the drawings.
Fig. 4A is a spectrogram of a speech signal according to an embodiment of the present invention, and fig. 4B is a spectrogram of a microphone signal according to an embodiment of the present invention, as shown in fig. 4A and fig. 4B, the line spectrum of fig. 4A is clearer because the microphone signal in fig. 4B includes an interference signal and a noise signal in addition to the speech signal, two pieces of speech are both taken from the first channel of the microphone array, the target sound source is located at 70 °, and the interference speech comes from 120 °.
Fig. 5 is a schematic diagram of the DOA estimation of all frames of the speech signal and the microphone signal according to the embodiment of the present invention, where the frame number of the microphone signal is 61, as shown in fig. 5, it can be seen that the DOA estimation of the microphone signal is greatly deviated from the direction of the speech signal, and the direction of the speech signal source is about 71 ° as a final statistical result, and the direction of the speech signal source estimated by the microphone signal is about 80 °.
Fig. 6 is a schematic diagram of a voice wake-up confidence curve according to an embodiment of the present invention, and as shown in fig. 6, the voice wake-up confidence of the voice signal is high, and the voice wake-up confidence of the noise signal and the interference signal is low, and by the method for improving the DOA estimation of the voice device of the present invention, the direction of the voice signal source that can be counted is about 73 degrees
In summary, the method and apparatus of the present invention determine the angle estimation value with the largest statistical result as the result of DOA estimation, and the statistical result of each angle estimation value refers to the statistical result that the voice wakeup confidence of each frame of microphone signal is greater than the threshold, or the sum of the voice wakeup confidence of the frame of microphone signal corresponding to the same angle estimation value, or the sum of the product of the voice wakeup confidence of the frame of microphone signal corresponding to the same angle estimation value and the wideband azimuth spectrum function, so that the DOA estimation result is more accurate, the azimuth of the voice signal source can be determined more accurately, the influence of the interference signal and the noise signal on the voice signal is avoided, the signal-to-noise ratio can be improved, the voice damage is reduced, and the recognition rate is further improved.
Unless otherwise indicated, the numerical parameters set forth in the specification and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by the present invention. In particular, all numbers expressing quantities of ingredients, reaction conditions, and so forth used in the specification and claims are to be understood as being modified in all instances by the term "about". Generally, the expression is meant to encompass variations of ± 10% in some embodiments, 5% in some embodiments, 1% in some embodiments, 0.5% in some embodiments by the specified amount.
Furthermore, "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A method of improving DOA estimation for a voice device, comprising:
collecting microphone signals when a voice device is awakened, and determining a voice awakening confidence coefficient of each frame of microphone signals, wherein the microphone signals are acquired through a microphone array of the voice device;
determining each frame wideband from the microphone signalsAzimuth spectrum function and determining the angle estimation value of each frame of microphone signals, wherein the angle estimation value of each frame of microphone signals is determined by calculating the broadband azimuth spectrum function of each frame of microphone signals, and the maximum broadband azimuth spectrum function corresponds to thetamI.e. the angle estimation value of each frame of microphone signals, the wideband azimuthal spectrum function is:
Figure FDA0002423074900000011
wherein K is a frequency domain sequence, K is 1., K is more than or equal to 1; a (k, theta)m) Is the array manifold vector; e (k)nThe noise subspace is composed of characteristic vectors corresponding to larger D characteristic values and smaller N-D characteristic values, and D is the number of signal sources; n is the number of the microphones of the microphone array, and N is more than or equal to 1; h refers to conjugate transposition; and
determining a statistical result of each angle estimation value according to the voice awakening confidence coefficient, wherein the angle estimation value with the largest statistical result is a DOA estimation result;
wherein, the determination of the statistical result of each angle estimation value according to the voice awakening confidence comprises one of the following three algorithms:
the algorithm one comprises the substeps of:
setting a threshold value;
removing the frame microphone signals with the voice awakening confidence coefficient smaller than the threshold value, and reserving the frame microphone signals with the voice awakening confidence coefficient larger than or equal to the threshold value so as to determine reserved frame microphone signals; and
determining a statistical result of frame microphone signals corresponding to the same angle estimation value in the reserved frame microphone signals;
the second algorithm comprises the following steps: the statistical result of each angle estimation value refers to the sum of the voice awakening confidence degrees of the frame microphone signals corresponding to the same angle estimation value; and
the third algorithm comprises the following steps: and the sum of the products of the voice awakening confidence degrees of the frame microphone signals corresponding to the same angle estimation value and the broadband azimuth spectrum function.
2. The method of claim 1, wherein the microphone signal is
X(t)=[x1(t),...,xN(t)]T
Where T is a time domain sequence and T represents transposition.
3. The method of claim 2, wherein prior to determining a wideband azimuth spectral function from the microphone signals, further comprising the steps of:
performing Fourier transform on the microphone signals to determine frequency domain microphone signals:
X(k)=[x1(k),...,xN(k)]T,k=1,...,K。
4. the method according to claim 3, wherein the number of the voice signal sources of the microphone signals is A, the number of the interference signal sources is D-A, wherein D ≧ A, the microphone signals include noise signals, voice signals, and interference signals, and the microphone signals are denoted as X (k) ═ A (k, Θ)D)S(k)+N(k)
Wherein, the N * D dimension array manifold matrix A (k, theta)D)=[a(k,θ1),...,a(k,θm),...,a(k,θD)]M is less than or equal to D, and the signal source signal is S (k) ═ s (k, theta)1),...,s(k,θm),...,s(k,θD)]T,ΘD=[θ1,...,θm,...θD]Representing a set of D signal source azimuths, n (k) ═ n1(k),...,nm(k),...,nN(k)]TIs a noise signal.
5. A method according to claim 4, wherein determining a broadband azimuthal spectral function from the microphone signals comprises the sub-steps of:
determining a data covariance matrix according to the microphone signals;
decomposing the data covariance matrix to determine a voice subspace and a noise subspace; and
and determining the broadband azimuth spectrum function according to the voice subspace and the noise subspace.
6. The method of claim 5, wherein the data covariance matrix is:
R(k)xx=E{X(k)X(k)H}=R(k)ss+R(k)nn
wherein, R (k)ss=E{S(k)S(k)H} and R (k)nn=E{N(k)N(k)HThe covariance matrix of the speech signal and the covariance matrix of the noise signal are respectively;
decomposing the data covariance matrix to obtain R (k)xx=EΛEHΛ is a diagonal matrix of eigenvalues in descending order, e (k) ═ e (k)S,E(k)n]Is the corresponding feature vector, E (k)sThe signal subspace is composed of eigenvectors corresponding to larger D eigenvalues and smaller N-D eigenvalues;
the frequency azimuth spectrum function corresponding to the frequency k is:
P(k,θm)=1/aH(k,θm)E(k)nE(k)n Ha(k,θm),θm∈ΘD
averaging all frequency azimuth spectrums to obtain a broadband azimuth spectrum function.
7. An apparatus for improving DOA estimation for a speech device, comprising:
a memory for storing operating instructions;
a processor configured to execute the method for improving DOA estimation of a speech device according to any one of claims 1 to 6 according to the operating instructions in the memory.
CN201711498690.8A 2017-12-29 2017-12-29 Method and device for improving DOA estimation of voice equipment Active CN108538306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711498690.8A CN108538306B (en) 2017-12-29 2017-12-29 Method and device for improving DOA estimation of voice equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711498690.8A CN108538306B (en) 2017-12-29 2017-12-29 Method and device for improving DOA estimation of voice equipment

Publications (2)

Publication Number Publication Date
CN108538306A CN108538306A (en) 2018-09-14
CN108538306B true CN108538306B (en) 2020-05-26

Family

ID=63489870

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711498690.8A Active CN108538306B (en) 2017-12-29 2017-12-29 Method and device for improving DOA estimation of voice equipment

Country Status (1)

Country Link
CN (1) CN108538306B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110223708B (en) * 2019-05-07 2023-05-30 平安科技(深圳)有限公司 Speech enhancement method based on speech processing and related equipment
CN111103568A (en) * 2019-12-10 2020-05-05 北京声智科技有限公司 Sound source positioning method, device, medium and equipment
CN111883162B (en) * 2020-07-24 2021-03-23 杨汉丹 Awakening method and device and computer equipment
KR20230146605A (en) * 2021-12-20 2023-10-19 썬전 샥 컴퍼니 리미티드 Voice activity detection method, system, voice enhancement method and system
CN114639398B (en) * 2022-03-10 2023-05-26 电子科技大学 Broadband DOA estimation method based on microphone array

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101865758A (en) * 2010-06-12 2010-10-20 南京航空航天大学 Impact load location method based on multiple signal classification algorithm
CN102866385A (en) * 2012-09-10 2013-01-09 上海大学 Multi-sound-source locating method based on spherical microphone array
CN104599679A (en) * 2015-01-30 2015-05-06 华为技术有限公司 Speech signal based focus covariance matrix construction method and device
CN104931928A (en) * 2015-07-01 2015-09-23 西北工业大学 Signal source positioning method and apparatus
CN105096939A (en) * 2015-07-08 2015-11-25 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN106950542A (en) * 2016-01-06 2017-07-14 中兴通讯股份有限公司 The localization method of sound source, apparatus and system
CN107159435A (en) * 2017-05-25 2017-09-15 洛阳语音云创新研究院 Method and device for adjusting working state of mill
CN107316648A (en) * 2017-07-24 2017-11-03 厦门理工学院 A kind of sound enhancement method based on coloured noise
JP2017228978A (en) * 2016-06-23 2017-12-28 キヤノン株式会社 Signal processing apparatus, signal processing method, and program
CN107976651A (en) * 2016-10-21 2018-05-01 杭州海康威视数字技术股份有限公司 A kind of sound localization method and device based on microphone array
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101865758A (en) * 2010-06-12 2010-10-20 南京航空航天大学 Impact load location method based on multiple signal classification algorithm
CN102866385A (en) * 2012-09-10 2013-01-09 上海大学 Multi-sound-source locating method based on spherical microphone array
CN104599679A (en) * 2015-01-30 2015-05-06 华为技术有限公司 Speech signal based focus covariance matrix construction method and device
CN104931928A (en) * 2015-07-01 2015-09-23 西北工业大学 Signal source positioning method and apparatus
CN105096939A (en) * 2015-07-08 2015-11-25 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN106950542A (en) * 2016-01-06 2017-07-14 中兴通讯股份有限公司 The localization method of sound source, apparatus and system
JP2017228978A (en) * 2016-06-23 2017-12-28 キヤノン株式会社 Signal processing apparatus, signal processing method, and program
CN107976651A (en) * 2016-10-21 2018-05-01 杭州海康威视数字技术股份有限公司 A kind of sound localization method and device based on microphone array
CN107159435A (en) * 2017-05-25 2017-09-15 洛阳语音云创新研究院 Method and device for adjusting working state of mill
CN107316648A (en) * 2017-07-24 2017-11-03 厦门理工学院 A kind of sound enhancement method based on coloured noise
CN108122563A (en) * 2017-12-19 2018-06-05 北京声智科技有限公司 Improve voice wake-up rate and the method for correcting DOA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于麦克风阵列声源定位技术的研究;赵秀粉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140115;36-59 *

Also Published As

Publication number Publication date
CN108538306A (en) 2018-09-14

Similar Documents

Publication Publication Date Title
CN108538306B (en) Method and device for improving DOA estimation of voice equipment
CN108122563B (en) Method for improving voice awakening rate and correcting DOA
US10901063B2 (en) Localization algorithm for sound sources with known statistics
Zhang et al. A speech enhancement algorithm by iterating single-and multi-microphone processing and its application to robust ASR
Nesta et al. Convolutive BSS of short mixtures by ICA recursively regularized across frequencies
US10123113B2 (en) Selective audio source enhancement
US9100734B2 (en) Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation
Liu et al. Neural network based time-frequency masking and steering vector estimation for two-channel MVDR beamforming
Wang et al. Noise power spectral density estimation using MaxNSR blocking matrix
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
Kim Hearing aid speech enhancement using phase difference-controlled dual-microphone generalized sidelobe canceller
CN110992977B (en) Method and device for extracting target sound source
Vincent An experimental evaluation of Wiener filter smoothing techniques applied to under-determined audio source separation
Sharma et al. Adaptive and hybrid Kronecker product beamforming for far-field speech signals
WO2020078210A1 (en) Adaptive estimation method and device for post-reverberation power spectrum in reverberation speech signal
Tammen et al. Complexity reduction of eigenvalue decomposition-based diffuse power spectral density estimators using the power method
Kim et al. Sound source separation using phase difference and reliable mask selection selection
Lee et al. Deep neural network-based speech separation combining with MVDR beamformer for automatic speech recognition system
CN114242104A (en) Method, device and equipment for voice noise reduction and storage medium
CN113223552A (en) Speech enhancement method, speech enhancement device, speech enhancement apparatus, storage medium, and program
Jukić et al. Speech dereverberation with convolutive transfer function approximation using MAP and variational deconvolution approaches
McCowan et al. Multi-channel sub-band speech recognition
Malek et al. Speaker extraction using LCMV beamformer with DNN-based SPP and RTF identification scheme
Meng et al. A robust maximum likelihood distortionless response beamformer based on a complex generalized Gaussian distribution
US20240212701A1 (en) Estimating an optimized mask for processing acquired sound data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant