CN117630818A - Feature preprocessing and extracting method for multi-channel audio positioning - Google Patents

Feature preprocessing and extracting method for multi-channel audio positioning Download PDF

Info

Publication number
CN117630818A
CN117630818A CN202311601140.XA CN202311601140A CN117630818A CN 117630818 A CN117630818 A CN 117630818A CN 202311601140 A CN202311601140 A CN 202311601140A CN 117630818 A CN117630818 A CN 117630818A
Authority
CN
China
Prior art keywords
dsb
channel
audio
delay
mel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311601140.XA
Other languages
Chinese (zh)
Inventor
贺长江
程思瑶
刘劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202311601140.XA priority Critical patent/CN117630818A/en
Publication of CN117630818A publication Critical patent/CN117630818A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract

The invention relates to a characteristic preprocessing and extracting method for multi-channel audio positioning. The invention relates to the technical field of audio preprocessing, which comprises the steps of extracting LogMel characteristics and IV characteristics from DSB characteristics, carrying out global normalization on the extracted characteristics, and training and testing normalized data on deep learning; and (3) performing acceleration processing by using a DSB edge end acceleration method, and using the extracted features to train a CRNN model for estimating the direction of the sound source. The invention extracts the LogMel characteristic and the IV characteristic of the audio frequency processed by the DSB. In the feature extraction process, the invention utilizes the edge-side GPU to accelerate the processing on the Nano. Finally, a CRNN is trained using the extracted features. Compared with the feature of not performing DSB, the CRNN trained by the DSB method can effectively reduce DoA errors and still can run at the edge in real time.

Description

Feature preprocessing and extracting method for multi-channel audio positioning
Technical Field
The invention relates to the technical field of audio preprocessing, in particular to a characteristic preprocessing and extracting method for multi-channel audio positioning.
Background
With the increasing computing power and the popularity of network connections, algorithms based on artificial intelligence (Artificial intelligence, AI) are increasingly being used in industrial environments. In modern factories, the manufacture of products is largely dependent on the correct and reliable operation of machines and tools. However, in the event of machine and tool failure, the cost of maintenance and production outages can result in significant revenue losses. Therefore, state-based maintenance and early failure detection are critical in today's factories. Sensors have been widely studied and used in equipment failure detection. Although the image is very effective in fault detection, it is prone to information leakage problems, thereby limiting use in sensitive fields such as the defense industry. Therefore, non-image sensing data, such as current and voltage information, is often the preferred choice. However, such data typically either rely on sensors onboard the machine or require expensive instrumentation to acquire. In contrast, acoustic methods using microphones provide a low cost way of data collection for machine diagnostics. These microphones may be installed near the machine or placed in a factory floor. The malfunctioning sound source may come from various directions, so that the position determination of the sound source becomes crucial. Sound source localization (Sound source localization, SSL) typically uses multiple microphone audio channels to analyze the position of one or more sound sources relative to a microphone. In many cases, the SSL problem is reduced to an estimate of the direction of arrival (Directions of arrival, DOA) of the sound source. The method for solving the DOA problem mainly uses a deep neural network to learn the correlation between multi-channel audios. However, existing machine learning based DOA estimation solutions typically use time and spectral features as inputs. These features only implicitly capture spatial information, e.g. using inter-time differences, which makes it difficult for convolutional neural networks (Convolutional Neural Networks, CNN) to efficiently extract the effective information of the DOA.
Disclosure of Invention
The present invention may find application in many fields for multi-source sound localization, including auditory scene analysis, fault detection and diagnosis in manufacturing, augmented reality, and the like. In the far field, three-dimensional sound source localization corresponds to finding the DOA of the sound source, i.e., the azimuth and elevation of the sound source. Recent DOA estimation pipelines employ multi-channel audio input, extracting spectral features from each channel, which are then fed into a deep neural network. Unfortunately, spectral features contain only time-frequency information of the audio signal, whereas spatial information is only implicitly captured in the signals of the different channels, which is highly dependent on the geometry of the acoustic array. In order to embed spatial information of sound sources into a spectral feature representation, a spatial mapping method based on delay-and-sum (Delay and Sum Beamforing, DSB) is proposed for encoding sound source location information. It can be combined with different feature extraction methods and machine learning models to make DOA estimation. On the basis, a redundancy elimination method is provided to accelerate DSB calculation speed, so that the pipeline can run on the embedded gpu of NVidia Jeston Nano and the like in real time. The present invention uses two neural network models and DSB methods to conduct extensive experiments on two data sets. Experiments show that DOA errors can be effectively reduced by adopting a DSB method. When the DSB is combined for feature extraction, DOA error can be reduced by 19.24%. In addition, the speed of the feature extraction process was increased by 30.42% after redundancy removal was applied. Based on the method, the invention provides a characteristic preprocessing and extracting method for multi-channel audio positioning.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention provides a feature preprocessing and extracting method for multichannel audio positioning, which provides the following technical scheme:
a method for feature preprocessing and extraction for multi-channel audio localization, the method comprising the steps of:
step 1: extracting LogMel characteristics and IV characteristics from DSB characteristics, performing global normalization on the extracted characteristics, and training and testing normalized data on deep learning;
step 2: and (3) performing acceleration processing by using a DSB edge end acceleration method, and using the extracted features to train a CRNN model for estimating the direction of the sound source.
Preferably, the step 1 specifically includes:
is provided withIs the set of H positions on a unit sphere centered at the center of the array. For polar coordinate system (θ hh ) The assumed sound source at which the delay on the array is given by m (t), m=1, 2, …, M, DSB of the h-th channel is represented by the following formula:
preferably, when P h Coincident with source i, then x i (t) uniformly accumulating in the array output.
Preferably, the DSB-based spatial mapping is as follows:
selecting an H virtual sound source position in the monitored area, denoted as P 1 ,P 2 ,…,P H The points are uniformly distributed on a unit sphereApplying;
for each virtual source P h At (theta) hh ) Calculating its array vector
Given array input y m (k),m=1,2,…M,k=0,1,…,T·f s Calculation ofWhere fs is the sampling frequency, k samples are exponential, T is the number of samples rounded to the nearest sample at each (windowed) segment and fractional delay;
DSB-based spatial mapping toThe set of virtual sound source positions in (a) outputs an H-channel time sequence, called an H-channel, and frequency domain audio features can be further extracted for data in each channel.
Preferably, the amplitude and phase spectra are determined and a discrete short time fourier transform, STFT, is applied to the windowed data to calculate the amplitude and phase spectra, giving a discrete STFT of data in the h-channel expressed by:
where w (·) is a window function. In an implementation, the present invention uses a hamming window of length 2048.
Determining a Log-Mel spectrogram, wherein the Mel spectrogram is obtained by simulating an incoming signal Y through a group of filter banks of the front end of human audio h (k, ω) transfer, the log Mel spectrum being calculated by taking the logarithm of the Mel spectrum, the l-th log Mel spectrum of the h channel being:
wherein W is mel (ω, l) is a mel-filter bank.
The intensity vector IV is determined, the intensity vector of the multi-channel audio signal is obtained by calculating the correlation between the signal in each channel and the reference channel, for the multi-channel signal after DSB, the received signal is directly accumulated without delay to obtain the reference channel audio, and for any channel h, the initial intensity vector of the audio after DSB is represented by the following formula:
wherein,is the STFT feature of channel h, +.>Is the STFT feature of the reference audio, is the complex conjugate of the element return, re is the real part of the complex of the element return; the extracted intensity vector features may be further normalized by dividing each element by the square root of the sum of the squares of the audio after all DSBs; the square root of the sum of the squares of the audio after DSB is:
the luminance vector features are transformed into the Mel domain by multiplication with a Mel filter bank, and after DSB, the audio intensity vector for any channel h is:
wherein at w mel (ω, l) is a mel-filter bank.
Preferably, the step 2 specifically includes:
calculating delay summation, SFTF and LogMel characteristics on the GPU, and using a copy library; in delay and sum, when the summation is delayed by the same amount of timeThe resulting summation is also delayed by the same amount of time, taking into account that it is locatedAnd->Is received by the two microphones of (a) 1 (t) and y 2 (t) for any position P where the unit sphere intersects the satisfied hyperbola h The delay sum is in the form of:
for different P h Delay and moveThese partial sums are stored and reused in different locations to further accelerate the computation of DSBs.
Preferably, when there are four microphones and two points P in the space h And P j (i is more than or equal to 1 is less than or equal to j is more than or equal to H), and through delay calculation, the corresponding delays of the two points are [ delta ] i1i2i3i4 ]And [ delta ] j1j2j3j4 ]Redundancy exists when:
case 1: when it is [ …, delta ix ,…,Δ iy ,…]=[…,Δ jx ,…,Δ jy ,…]Calculate P i The delay sums of the x and y microphones of the point are stored in the memory, and then the delay results of other microphones are added; at the calculation point P j When directly calling P i The result of the sum of the x and y microphones at that location, and then adding the delay results of the other microphones;
case 2: when [ …, delta ix ,…,Δ iy ,…,Δ iz ,…]=[…,Δ jx ,…,Δ jy ,…,Δ jz ,…]The calculation process is similar to case 1, and the calculation of case 1 is prioritized in the overall calculation process, and the situationCondition 2 may also call the results in case 1 directly.
A multi-channel audio localization oriented feature preprocessing and extraction system, the system comprising:
the feature extraction module is used for extracting the LogMel features and the IV features from the DSB features, global normalization is carried out on the extracted features, and training and testing are carried out on the normalized data on deep learning;
and the sound source estimation module performs acceleration processing through a DSB edge end acceleration method, and uses the extracted features to train a CRNN model for estimating the direction of the sound source.
A computer readable storage medium having stored thereon a computer program for execution by a processor for implementing a method of strain sensor fault diagnosis for a wind tunnel strain balance.
A computer device comprising a memory storing a computer program and a processor implementing a method for diagnosing a strain sensor fault of a wind tunnel strain balance when executing the computer program.
The invention has the following beneficial effects:
compared with the prior art, the invention has the advantages that:
the invention provides an audio feature preprocessing and extracting method for multi-sound source localization. This approach uses DSB algorithms to beamform multi-channel audio at multiple points in space. Meanwhile, in order to accelerate the beam forming process, the invention adopts a redundancy removing algorithm in the DSB edge accelerating algorithm, thereby reducing unnecessary calculation in the beam forming. Subsequently, the present invention performs log mel feature and IV feature extraction on the DSB-processed audio. In the feature extraction process, the invention utilizes the edge-side GPU to accelerate the processing on the Nano. Finally, a CRNN is trained using the extracted features. Compared with the feature of not performing DSB, the CRNN trained by the DSB method can effectively reduce DoA errors and still can run at the edge in real time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is an example of sound source localization in space.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
The present invention will be described in detail with reference to specific examples.
First embodiment:
according to the embodiment shown in fig. 1, the specific optimization technical scheme adopted by the invention for solving the technical problems is as follows: the invention relates to a feature preprocessing and extracting method for multichannel audio positioning.
Step 1: extracting LogMel characteristics and IV characteristics from DSB characteristics, performing global normalization on the extracted characteristics, and training and testing normalized data on deep learning;
step 2: and (3) performing acceleration processing by using a DSB edge end acceleration method, and using the extracted features to train a CRNN model for estimating the direction of the sound source.
The step 1 specifically comprises the following steps:
is provided withIs the set of H positions on a unit sphere centered at the center of the array. For polar coordinate system (θ hh ) The assumed sound source at which the delay on the array is given by m (t), m=1, 2, …, M, DSB of the h-th channel is represented by the following formula:
when P h Coincident with source i, then x i (t) uniformly accumulating in the array output.
The DSB-based spatial mapping is as follows:
selecting an H virtual sound source position in the monitored area, denoted as P 1 ,P 2 ,…,P H The points are uniformly distributed on a unit sphere;
for each virtual sourceP h At (theta) hh ) Calculating its array vector
Given array input y m (k),m=1,2,…M,k=0,1,…,T·f s Calculation ofWhere fs is the sampling frequency, k samples are exponential, T is the number of samples rounded to the nearest sample at each (windowed) segment and fractional delay;
DSB-based spatial mapping toThe set of virtual sound source positions in (a) outputs an H-channel time sequence, called an H-channel, and frequency domain audio features can be further extracted for data in each channel.
Determining amplitude and phase spectra, applying a discrete short time fourier transform, STFT, to the windowed data to calculate the amplitude and phase spectra, gives a discrete STFT of data in the h-channel expressed by:
where w (·) is a window function. In an implementation, the present invention uses a hamming window of length 2048.
Determining a Log-Mel spectrogram, wherein the Mel spectrogram is obtained by simulating an incoming signal Y through a group of filter banks of the front end of human audio h (k, ω) transfer, the log Mel spectrum being calculated by taking the logarithm of the Mel spectrum, the l-th log Mel spectrum of the h channel being:
wherein W is mel (ω, l) is a mel-filter bank.
The intensity vector IV is determined, the intensity vector of the multi-channel audio signal is obtained by calculating the correlation between the signal in each channel and the reference channel, for the multi-channel signal after DSB, the received signal is directly accumulated without delay to obtain the reference channel audio, and for any channel h, the initial intensity vector of the audio after DSB is represented by the following formula:
wherein,is the STFT feature of channel h, +.>Is the STFT feature of the reference audio, is the complex conjugate of the element return, re is the real part of the complex of the element return; the extracted intensity vector features may be further normalized by dividing each element by the square root of the sum of the squares of the audio after all DSBs; the square root of the sum of the squares of the audio after DSB is:
the luminance vector features are transformed into the Mel domain by multiplication with a Mel filter bank, and after DSB, the audio intensity vector for any channel h is:
wherein at W mel (ω, l) is a mel-filter bank.
The step 2 specifically comprises the following steps:
calculating delay summation, SFTF and LogMel characteristics on the GPU, and using a copy library; in delay and sum, when the summation is delayed by the same amount of time, the resulting summation is also delayed by the same amount of time, taking into account that the summation is locatedAnd->Is received by the two microphones of (a) 1 (t) and y 2 (t) for any position P where the unit sphere intersects the satisfied hyperbola h The delay sum is in the form of:
for different P h Delay and moveThese partial sums are stored and reused in different locations to further accelerate the computation of DSBs.
When there are four microphones and two points P in the space h And P j (i is more than or equal to 1 is less than or equal to j is more than or equal to H), and through delay calculation, the corresponding delays of the two points are [ delta ] i1i2i3i4 ]And [ delta ] j1j2j3j4 ]Redundancy exists when:
case 1: when it is [ …, delta ix ,…,Δ iy ,…]=[…,Δ jx ,…,Δ jy ,…]Calculate P i The delay sums of the x and y microphones of the point are stored in the memory, and then the delay results of other microphones are added; at the calculation point P j When directly calling P i The result of the sum of the x and y microphones at that location, and then adding the delay results of the other microphones;
case 2: when [ …, delta ix ,…,Δ iy ,…,Δ iz ,…]=[…,Δ jx ,…,Δ jy ,…,Δ jz ,…]The calculation process is similar to case 1, with case 1 being prioritized in the overall calculation process, while case 2 may also call the results in case 1 directly.
Specific embodiment II:
the second embodiment of the present application differs from the first embodiment only in that:
description of the DoA problem
Assume that M microphones are locatedIn order not to lose generality, the present invention assumes that the centroid of the microphone is located at the origin of the 3D coordinate space, i.e.>Let N be N less than or equal to N be time [0, T ]](unknown) number of internal active sound sources. The sound source set is expressed as s= { S 1 ,S 2 ,…,S n }. Because of the apertures of commercially available speaker microphone arrays, their apertures are typically on the order of a few center meters. Sound sources several meters away from the center of the array may be considered to be in the far field. Thus, the sound waves arriving at the array are approximated as plane waves. In other words, the DOA of the sound waves from a single sound source is the same at each microphone. In far field environments, only the DOA of the sound source can be estimated, while its range is still indeterminate.
Thus, useIt is sufficient to represent the azimuth and elevation of the sound source in polar coordinates. Let x be i (t) wave propagating from sound source i, y m (t) waves received by microphone m, i.e.
Wherein at τ i,m Is the time of propagation delay from source i to microphone m, i.ec is the speed at which sound propagates in air. Note that under the plane wave assumption and time-invariant channel, the sound wave from i is a on all microphones i Attenuation. Let delta be i Is from the sourcei delay to the center of the array. When the source i is far-field
The last term is derived from the transformation of polar coordinates to Cartesian coordinates. An important implication of the above is that the delay difference of the different microphones depends only on the DOA of the sound source and the geometry of the array (i.e.' s), rather than the range of sound sources, which range is at delta i Is absorbed and is not determinable. Thus, the present invention can define a vector +.>Wherein the mth element is composed ofAnd the form of the vectors in the following,
as shown in FIG. 1, at t ε [0, T]Four microphones capture acoustic signals from three different spatial directions during the time interval. The acquired acoustic data is then feature extracted in real time using an embedded device and then predicted using a neural network. Finally, the embedded device provides an output indicating the number of sounds detected and their corresponding directional positioning. In view of y m (t),t∈[0,T]M=1, 2, …, M, it sounds that the multi-source localization problem is to determine the number of n sound sources and their DOAs (θ ii ),i=1,2,…,n。
Audio feature extraction
Traditional sound preprocessing is mainly to remove noise and reverberation of audio, where noise refers to unwanted additional sound components that interfere with the original sound signal. Noise may be generated by the environment, equipment, or other electromagnetic interference, common noise including background noise, power supply noise, electromagnetic interference, and the like. Noise removal may be by static noise removal, adaptive filtering, and spectral subtraction filtering. Reverberation is the delay and frequency response change caused by reflection and diffraction of sound within an enclosed space, often adding tail sounds to the sound or making it sound hollow. The removal of reverberation may be by fourier transforming the audio signal and converting the time domain signal into a frequency domain signal, which may facilitate the analysis and processing of the reverberation component. Digital filters can also be used to compensate for the effects of reverberation on the sound spectrum, so that it restores the properties of the original signal as much as possible. With the development of deep learning, current voice preprocessing is generally used before feature extraction, and is not separately processed. Taking log mel spectral features (LogMel spectrum) as an example, the original audio signal is first pre-processed in order to extract more meaningful features in a subsequent step. Preprocessing typically includes removing silence segments, reducing noise, eliminating direct current components, normalizing audio amplitude, and the like. The preprocessed audio signal is split into short frames, each frame typically having a duration of 20-30 milliseconds. This is done to assume that the characteristics of sound are stable in each period of time and short-time variations in speech can be captured. Window functions are applied to each frame, with hamming windows, hanning windows, etc. being common window functions. This step helps to reduce the effects of spectral leakage. A Fast Fourier Transform (FFT) is performed on the windowed audio frame to convert the time domain signal to a frequency domain signal. This will produce a spectrogram for each frame. A set of mel filters is applied on the frequency domain signal. Mel-filter banks are a set of triangular filters that are dense in the low frequency region and sparse in the high frequency region. This is done because the human ear is more sensitive to low frequency sounds, while high frequency sounds change faster, requiring higher resolution. The filter bank is filtered to obtain the result of logarithm. This is because the human auditory perception is closer to a logarithmic response to frequency, and the use of logarithms can better mimic the perception of the human ear.
DSB feature extraction
DSB is a well-known array processing technique that aligns the delays of received signals to a target direction and then sums the corresponding delayed signals across the array. Signals from the target direction are coherently superimposed, while signals from other directions have distortion and incoherent aggregation. The DSB algorithm can be used directly for sound source localization, searching all possible candidate sound source directions, and finding the sound source direction of maximum power. Although the DSB algorithm is simple, its computation amount increases linearly with the increase of the search area. The present invention uses DSBs to generate a spatial representation of a signal in a fixed set of directions. These directions are typically not aligned with the actual source locations, but by introducing different delayed versions of the original signal, spatial information is embedded directly into the resulting representation.
Specifically, it is provided withIs the set of H positions on a unit sphere centered at the center of the array. For polar coordinate system (θ hh ) The assumed sound source at which its delay on the array is given by the following equation. Given a received signal y m (t), m=1, 2, …, M, h channel DSB is calculated as follows
The effect of DSB is evident if P h Coincident with source i, then x i (t) uniformly accumulating in the array output.
Slight complications arise in DSB implementations using discrete samples. To handle fractional delays that are not multiples of the sampling interval, one approach is to interpolate the received waveforms. However, at a sampling frequency of 48KHz, it is sufficient to round the resulting delay to the nearest sampling interval without significant impairment of the final performance. As an envelope back check, a 1/48000 second arrival time difference corresponds approximately to a DOA difference of 0.0071rad or 0.41 for sound sources located on the same side of the linear array. The steps of dsb-based spatial mapping are as follows:
selecting an H virtual sound source position in the monitored area, denoted P 1 ,P 2 ,…,P H The points are uniformly distributed on a unit sphere.
For each virtual source P h At (theta) hh ) Calculating its array vector
Given array input y m (k),m=1,2,…M,k=0,1,…,T·f s Calculation ofWhere fs is the sampling frequency, k samples are exponential, T is the number of samples rounded to the nearest sample at each (windowed) segment and fractional delay.
DSB-based spatial mapping toThe set of virtual sound source positions in (a) outputs an H-channel time series (called H-channel). Frequency domain audio features may be further extracted for the data in each channel.
(1) Amplitude and phase spectrograms a discrete Short Time Fourier Transform (STFT) may be applied to the windowed data to calculate the amplitude and phase spectra. Discrete STFT of data in h channel is given
Where w (·) is a window function. In an implementation, the present invention uses a hamming window of length 2048.
(2) Log-Mel spectrogram the Mel spectrogram is obtained by using a group of filter banks simulating human audio front end to transfer the incoming signal Y h (k, ω) transfer. The logarithmic mel profile is then calculated by taking the logarithm of the mel profile. Specifically, the l-th log Mel spectrum of the h channel is
Wherein at W mel (ω, l) is a mel-filter bank.
(3) Intensity Vector (IV) the intensity vector of the multi-channel audio signal is obtained by calculating the correlation between the signal in each channel and the reference channel. And for the multipath signals after DSB, directly carrying out delay-free accumulation on the received signals to obtain the reference channel audio. For any channel h, the initial intensity vector of the post DSB audio is:
wherein the method comprises the steps ofIs the STFT feature of channel h, +.>Is the STFT feature of the reference audio, is the complex conjugate of the element return, and Re is the real part of the complex of the element return. The extracted intensity vector features may be further normalized by dividing each element by the square root of the sum of the squares of the audio after all DSBs. The square root of the sum of the squares of the audio after DSB is:
finally, the luminance vector features are transformed to the Mel domain by multiplication with a Mel filter bank. After DSB, the audio intensity vector for any channel h is:
wherein at W mel (ω, l) is a mel-filter bank.
The LogMel and IV features may be extracted after DSB of the multi-channel audio. The extracted features need global normalization, and the normalized data are trained and tested on deep learning. Experimental results show that under the condition of the data scale and the model with the same size, the DSB and IV characteristics can effectively reduce DOA errors.
DSB edge end acceleration algorithm
In order for the DoA to run on an embedded device in real-time or near real-time, the present invention utilizes NVidia Jetson Nano GPU to accelerate spatial mapping and feature extraction and proposes a redundancy removal algorithm.
First, to speed up computation on Nano devices, the present invention computes delay-sum, SFTF, and LogMel features on its GPU using a copy library. Copy is an open source array library for gpu acceleration computation. Second, in delay and sum, when the summation is delayed by the same amount of time, the resulting summation is also delayed by the same amount of time (assuming negligible attenuation difference). Consider the locationAnd->Is received by the two microphones of (a) 1 (t) and y 2 (t). For unit ball and satisfy
Arbitrary position P at the intersection of hyperbolas of (c) h The form of the delay sum is
For different P h Delay and moveThus, the present invention can store and reuse these partial sums in different locations to further accelerate the computation of DSBs.
For example, assume that there are four microphones and two points P in space h And P j (1.ltoreq.i.ltoreq.j.ltoreq.H). Through delay calculation, the corresponding delays of the two points are [ delta ] respectively i1i2i3i4 ]And [ delta ] j1j2j3j4 ]. Redundancy exists when:
(1) If it is [ … ], deltaA ix ,…,Δ iy ,…]=[…,Δ jx ,…,Δ jy ,…]Calculate P i The sum of the delays of the x and y microphones of the point is stored in the memory, and then the delay results of the other microphones are added. At the calculation point P j When directly calling P i The result of the sum of the x and y microphones at that point is then added to the delay results of the other microphones.
(2) If [ …, delta ix ,…,Δ iy ,…,Δ iz ,…]=[…,Δ jx ,…,Δ jy ,…,Δ jz ,…]The calculation process is similar to (1), and the calculation (1) is prioritized in the whole calculation process, and the result in (1) can be directly called by (2).
Third embodiment:
the difference between the third embodiment and the second embodiment of the present application is only that:
the present invention uses microphones to collect multi-channel audio data. The present invention then applies DSB algorithms to beamform these multi-channel audio data to improve the accuracy of sound source localization. In order to accelerate the speed of beam forming, the invention introduces a DSB edge acceleration algorithm, which effectively removes redundant calculation and optimizes the calculation efficiency. Next, the present invention performs feature extraction on the DSB-processed audio data, including calculation of LogMel features and IV features. The feature extraction process utilizes the edge GPU to ensure real-time performance, and is particularly suitable for embedded equipment such as Nano. Finally, the present invention uses these extracted features to train a CRNN model for estimation of the direction of the sound source. Compared with a method without using DSB, the novel method can remarkably reduce the DOA error, thereby improving the accuracy of sound source positioning. More importantly, the method has the capability of running the edge end in real time, and is suitable for audio processing scenes needing quick response, such as fault detection in industrial environment. The comprehensive method realizes high-efficiency and high-quality sound source localization by optimizing pretreatment, feature extraction and model training of multi-sound source localization.
Fourth embodiment:
the fourth embodiment of the present application differs from the third embodiment only in that:
the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing a strain sensor fault diagnosis method for a wind tunnel strain balance.
Fifth embodiment:
the fifth embodiment differs from the fourth embodiment only in that:
the invention provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes a fault diagnosis method of a strain sensor of a wind tunnel strain balance when executing the computer program.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise. Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention. Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
The above description is only a preferred implementation manner of the feature preprocessing and extracting method for multi-channel audio positioning, and the protection scope of the feature preprocessing and extracting method for multi-channel audio positioning is not limited to the above embodiments, and all technical solutions under the concept belong to the protection scope of the present invention. It should be noted that modifications and variations can be made by those skilled in the art without departing from the principles of the present invention, which is also considered to be within the scope of the present invention.

Claims (10)

1. A characteristic preprocessing and extracting method for multi-channel audio positioning is characterized in that: the method comprises the following steps:
step 1: extracting LogMel characteristics and IV characteristics from DSB characteristics, performing global normalization on the extracted characteristics, and training and testing normalized data on deep learning;
step 2: and (3) performing acceleration processing by using a DSB edge end acceleration method, and using the extracted features to train a CRNN model for estimating the direction of the sound source.
2. The method according to claim 1, characterized in that: the step 1 specifically comprises the following steps:
is provided withIs a set of H positions on a unit sphere centered on the center of the array, for a polar coordinate system (θ hh ) The assumed sound source at which the delay on the array is given by m (t), m=1, 2, …, M, DSB of the h-th channel is represented by the following formula:
3. the method according to claim 2, characterized in that:
when P h Coincident with source i, then x i (t) uniformly accumulating in the array output.
4. A method according to claim 3, characterized in that
The DSB-based spatial mapping is as follows:
selecting an H virtual sound source position in the monitored area, denoted as P 1 ,P 2 ,…,P H The points are uniformly distributed on a unit sphere;
for each virtual source p h At (theta) hh ) Calculating its array vector
Given array input y m (k),m=1,2,…M,k=0,1,…,T·f s Calculation ofWhere fs is the sampling frequency, k samples are exponential, T is the number of samples rounded to the nearest sample at each (windowed) segment and fractional delay;
DSB-based spatial mapping toThe set of virtual sound source positions in (a) outputs an H-channel time sequence, called an H-channel, and frequency domain audio features can be further extracted for data in each channel.
5. The method according to claim 4, characterized in that:
determining amplitude and phase spectra, applying a discrete short time fourier transform, STFT, to the windowed data to calculate the amplitude and phase spectra, gives a discrete STFT of data in the h-channel expressed by:
where w (·) is a window function, in implementation we use a hamming window of length 2048;
determining a Log-Mel spectrogram, wherein the Mel spectrogram is obtained by simulating an incoming signal Y through a group of filter banks of the front end of human audio h (k, ω) transfer, the log Mel spectrum being calculated by taking the logarithm of the Mel spectrum, the l-th log Mel spectrum of the h channel being:
wherein W is mel (ω, l) is a mel-filter bank;
the intensity vector IV is determined, the intensity vector of the multi-channel audio signal is obtained by calculating the correlation between the signal in each channel and the reference channel, for the multi-channel signal after DSB, the received signal is directly accumulated without delay to obtain the reference channel audio, and for any channel h, the initial intensity vector of the audio after DSB is represented by the following formula:
wherein,is the STFT feature of channel h, +.>Is the STFT feature of the reference audio, is the complex conjugate of the element return, re is the real part of the complex of the element return; the extracted intensity vector features may be further normalized by dividing each element by the square root of the sum of the squares of the audio after all DSBs; the square root of the sum of the squares of the audio after DSB is:
the luminance vector features are transformed into the Mel domain by multiplication with a Mel filter bank, and after DSB, the audio intensity vector for any channel h is:
wherein at W mel (ω, l) is a mel-filter bank.
6. The method according to claim 5, characterized in that: the step 2 specifically comprises the following steps:
calculating delay summation, SFTF and LogMel characteristics on the GPU, and using a copy library; in delay and sum, when the summation is delayed by the same amount of time, the resulting summation is also delayed by the same amount of time, taking into account that the summation is locatedAnd->Is received by the two microphones of (a) 1 (t) and y 2 (t) for any position P where the unit sphere intersects the satisfied hyperbola h The delay sum is in the form of:
for different P h Delay and moveThese partial sums are stored and reused in different locations to further accelerate the computation of DSBs.
7. The method according to claim 6, characterized in that:
when there are four microphones and two points P in the space h And P j (i is more than or equal to 1 is less than or equal to j is more than or equal to H), and through delay calculation, the corresponding delays of the two points are [ delta ] i1i2i3i4 ]And [ delta ] j1j2j3j4 ]Redundancy exists when:
case 1: when it is [ …, delta ix ,…,Δ iy ,…]=[…,Δ jx ,…,Δ jy ,…]Calculate P i The delay sums of the x and y microphones of the point are stored in the memory, and then the delay results of other microphones are added; at the calculation point P j When directly calling P i The result of the sum of the x and y microphones at that location, and then adding the delay results of the other microphones;
case 2: when [ …, delta ix ,…,Δ iy ,…,Δ iz ,…]=[…,Δ jx ,…,Δ jy ,…,Δ jz ,…]The calculation process is similar to case 1, with case 1 being prioritized in the overall calculation process, while case 2 may also call the results in case 1 directly.
8. A characteristic preprocessing and extracting system for multi-channel audio positioning is characterized in that: the system comprises:
the feature extraction module is used for extracting the LogMel features and the IV features from the DSB features, global normalization is carried out on the extracted features, and training and testing are carried out on the normalized data on deep learning;
and the sound source estimation module performs acceleration processing through a DSB edge end acceleration method, and uses the extracted features to train a CRNN model for estimating the direction of the sound source.
9. A computer readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor for implementing the method according to any of claims 1-7.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized by: the processor, when executing the computer program, implements the method of any of claims 1-7.
CN202311601140.XA 2023-11-28 2023-11-28 Feature preprocessing and extracting method for multi-channel audio positioning Pending CN117630818A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311601140.XA CN117630818A (en) 2023-11-28 2023-11-28 Feature preprocessing and extracting method for multi-channel audio positioning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311601140.XA CN117630818A (en) 2023-11-28 2023-11-28 Feature preprocessing and extracting method for multi-channel audio positioning

Publications (1)

Publication Number Publication Date
CN117630818A true CN117630818A (en) 2024-03-01

Family

ID=90029811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311601140.XA Pending CN117630818A (en) 2023-11-28 2023-11-28 Feature preprocessing and extracting method for multi-channel audio positioning

Country Status (1)

Country Link
CN (1) CN117630818A (en)

Similar Documents

Publication Publication Date Title
Vecchiotti et al. End-to-end binaural sound localisation from the raw waveform
Yegnanarayana et al. Processing of reverberant speech for time-delay estimation
Pavlidi et al. 3D localization of multiple sound sources with intensity vector estimates in single source zones
Moore et al. Direction of arrival estimation using pseudo-intensity vectors with direct-path dominance test
MX2014006499A (en) Apparatus and method for microphone positioning based on a spatial power density.
JP6225245B2 (en) Signal processing apparatus, method and program
CN109188362A (en) A kind of microphone array auditory localization signal processing method
Di Carlo et al. Mirage: 2d source localization using microphone pair augmentation with echoes
KR20210137146A (en) Speech augmentation using clustering of queues
Bologni et al. Acoustic reflectors localization from stereo recordings using neural networks
Traa et al. Blind multi-channel source separation by circular-linear statistical modeling of phase differences
Dumortier et al. Blind RT60 estimation robust across room sizes and source distances
Pfeifenberger et al. Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results
Hu et al. Decoupled direction-of-arrival estimations using relative harmonic coefficients
Hu et al. Closed-form single source direction-of-arrival estimator using first-order relative harmonic coefficients
KR20090128221A (en) Method for sound source localization and system thereof
Deppisch et al. Spatial subtraction of reflections from room impulse responses measured with a spherical microphone array
Maazaoui et al. Adaptive blind source separation with HRTFs beamforming preprocessing
Hu et al. Evaluation and comparison of three source direction-of-arrival estimators using relative harmonic coefficients
CN117630818A (en) Feature preprocessing and extracting method for multi-channel audio positioning
Nakano et al. Automatic estimation of position and orientation of an acoustic source by a microphone array network
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
Dwivedi et al. Far-field source localization in spherical harmonics domain using acoustic intensity vector
Peng et al. Sound Source Localization Based on Convolutional Neural Network
SongGong et al. Multi-Speaker Localization in the Circular Harmonic Domain on Small Aperture Microphone Arrays Using Deep Convolutional Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination