CN116259313A

CN116259313A - Sound event positioning and detecting method based on time domain convolution network

Info

Publication number: CN116259313A
Application number: CN202310245354.1A
Authority: CN
Inventors: 刘一欣; 王玫; 杨松铭
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-06-13

Abstract

The invention provides a sound event positioning and detecting method based on a time domain convolution network, which adopts a TCN layer to enhance modeling capability of a model on long-term time and space information, simultaneously can realize simultaneous sound event detection and sound event positioning, reduces algorithm complexity and calculation amount, optimizes a loss function of sound event detection and sound event positioning tasks by using a combined training method, and improves generalization capability and stability of the model.

Description

Sound event positioning and detecting method based on time domain convolution network

Technical Field

The invention relates to the field of sound signal processing, in particular to a sound event positioning and detecting method based on a time domain convolution network.

Background

The method for identifying the sound event categories in the audio records and timely identifying the occurrence of the sound event categories is a currently active research theme, is popularized as sound event detection, and has wide application. While sound event detection may reveal much information about the recording environment, the spatial location of the event may bring valuable information to many applications. On the other hand, sound source localization is a classical multi-channel signal processing task based on the propagation characteristics of sound and the signal relationship between channels, irrespective of the sound type characteristics of the sound source. Sound event localization and detection systems aim to more fully spatiotemporal characterize sound scenes by combining sound event detection and sound source localization. The spatial dimensions make sound event localization and detection suitable for a wide range of machine listening tasks, such as inference of environmental types, synchronized localization and mapping by robots, navigation without visual input or occlusion of targets, tracking of sound sources of interest, audio monitoring. In addition, it may also assist in human-machine interaction, scene information visualization systems, scene-based service deployment, and assist in hearing devices, etc.

Furthermore, with the development of artificial intelligence technology, more and more audio signal processing tasks can be accomplished through deep learning models. Such as sound source localization and separation, speech recognition, speech synthesis, etc. These applications all require accurate processing and analysis of the audio signal to extract useful information. Deep learning models perform well in these tasks, especially when dealing with large-scale data sets. However, training of deep learning models requires a significant amount of computational resources and time, and requires many parameters to adjust. Furthermore, deep learning models often require large amounts of labeling data to achieve good performance, which makes implementation of some applications more difficult.

Disclosure of Invention

Aiming at the problems, the invention provides a sound event positioning and detecting method based on a time domain convolution network, which adopts a TCN layer to enhance the modeling capability of a model on long-term time and space information, simultaneously can realize simultaneous sound event detection and sound event positioning, reduces algorithm complexity and calculation amount, optimizes a loss function of sound event detection and sound event positioning tasks by using a combined training method, and improves generalization capability and stability of the model.

The invention is realized by the following technical scheme:

a method for locating and detecting sound events based on a time domain convolutional network (SELD-TCN), comprising the steps of:

step one: collecting and preprocessing an acoustic event data set, collecting a group of multi-channel audio data sets, preprocessing the data sets, and converting an audio file into a Log-melplgram (Log-melplctrograms) and generalized cross correlation (GCC-PHAT) among all audio channels;

step two: the characteristic extraction, namely inputting the logarithmic melprig diagram and the generalized cross correlation obtained in the step one into a time domain convolution network to extract required characteristics;

step three: audio event detection (SED): and (3) classifying the characteristics obtained in the step two by using a fully connected neural network to classify the characteristics at each moment, and outputting a binary classification label by the SED task on each time step to indicate whether a sound event exists in the time step or not so as to determine whether an audio event exists or not.

Step four: audio event localization (SEL): and (3) carrying out regression task on the feature obtained in the step two by using another feature at each moment, and outputting a quadruple by the SEL task on each time step to represent the position and duration of the sound event in the three-dimensional space.

Step five: multitasking learning: dividing the audio data into a training set, a verification set and a test set, constructing a time domain convolutional neural network to train the audio data, combining the loss functions of SED and SEL tasks, and optimizing by using a combined training method.

Step six: the model is evaluated using predefined evaluation criteria and compared to other methods to determine if its performance is sufficiently good.

In the first step, the collected data is stored as four-channel MIC format, the sampling frequency is 24KHz, a 1024-point FFT, a Hanning window of 40 ms and a jump length of 20 ms are used for calculating a four-channel spectrogram, and the logarithmic Mel spectrogram of 64 mel frequency bands is extracted from the spectrogram; for each frame, 6 generalized cross-correlation sequences are calculated and truncated to the same number of hysteresis values as the mel band; the 4-channel mel spectrogram is then stacked along the channel dimension with corresponding spatial features to provide input to the network.

As a preferred embodiment, in step three, a Softmax classifier is used to classify the SELD task. For each instant, the features of all input frames are passed into the TCN module, and then the probability of each class for that instant is predicted by the Softmax classifier. To integrate the time information into the classification, we use a sliding window to predict the class at each instant, where the window size is equal to the duration of each instant. Finally, for each moment, the position and duration of each event is calculated from the predicted class probabilities.

In the fifth step, as an optimization of the technical solution, the network structure mainly includes two parts: convolutional layers and TCN blocks. The convolution layer extracts local features from an input audio signal. In the model, two modes of one-dimensional convolution and two-dimensional convolution are adopted for processing different types of input features. The one-dimensional convolution is mainly applied to processing the sound channel characteristics, and the two-dimensional convolution is mainly applied to processing the time-frequency characteristics. The output of the convolutional layer is fed into the TCN block for feature extraction. The TCN block is composed of a plurality of time domain convolution layers, a batch normalization layer and residual error connection, wherein the time domain convolution layers are used for extracting long-term dependence characteristics of signals, the batch normalization layer is used for normalizing output characteristics in the middle of a network, and the residual error connection is used for preventing gradient from disappearing and accelerating the training process. By stacking multiple TCN blocks, the model is able to extract features at different levels of abstraction, thus achieving better performance.

As a preferred solution, in the fifth step, the model uses a multi-task learning framework to learn SED and SEL tasks simultaneously. In particular, we use two different output branches, one for predicting SED tasks and the other for predicting SEL tasks. Both branches share the same input features and learn the feature representation by sharing the convolutional layer. At the same time, we use different classifiers to handle the classification of SED and SEL tasks. During the training process, we use a multi-task penalty function to optimize the performance of both tasks simultaneously, thus enabling the model to process both SED and SEL tasks simultaneously.

Compared with the prior art, the invention has the advantages that:

1. the invention provides a sound event positioning and detecting method based on a time domain convolution network, which is used for accurately identifying and positioning a plurality of sound sources in a complex environment. The method uses a time-space convolution neural network to process an input audio signal, wherein the performance of a model is improved by adopting residual connection, expansion convolution and other technologies. The method not only can identify the category of the sound sources, but also can estimate the positions of the sound sources in space, thereby realizing the positioning and separation of a plurality of sound sources.

2. The method does not need to use complex feature engineering to extract useful information in the audio signal, but automatically learns the features in the audio signal by using the time domain convolutional neural network, and can improve the performance of the model without increasing the complexity of the model.

Drawings

The following is the main drawing of the process.

Fig. 1 is a schematic diagram of a data processing flow provided in the present invention.

Fig. 2 is a schematic diagram of the overall structure of the time domain convolution network according to the present invention.

Fig. 3 is a diagram of a time domain convolution block structure in accordance with the present invention.

Fig. 4 is a diagram of a residual block structure according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are only intended to illustrate the present invention and not to limit the scope of the present invention.

As shown in fig. 1, the present invention provides a sound event positioning and detecting method based on a time domain convolution network, which comprises the following steps:

the collection of the audio data set uses a four-channel microphone array of Kinect2.0, the microphone array is erected on a tripod 1.2 m away from the ground, fourteen sounds such as alarm clock sound, baby crying, knocking, dog calling, walking sound, piano sound and the like are collected indoors, and the type and position information of the sounds are marked.

(1) Preprocessing of audio data: the collected data is stored into a four-channel MIC format, the sampling frequency is 24KHz, a four-channel spectrogram is calculated by using a 1024-point FFT, a Hanning window of 40 milliseconds and a jump length of 20 milliseconds, and the logarithmic Mel spectrogram of 64 mel frequency bands is extracted from the spectrogram; for each frame, 6 generalized cross-correlation sequences are calculated and truncated to the same number of hysteresis values as the mel band; then stacking the 4-channel mel spectrogram with corresponding spatial features along the channel dimension to provide input for the network;

(2) Feature extraction of audio data: inputting a spectrogram into a network model for feature extraction, firstly using 3 layers of CNNs to learn features in the spectrogram, inputting the features into a TCN block, extracting the time connection of the features, and then entering a fully-connected network to carry out regression and classification processing, judgment, detection and positioning on the features;

(3) Multitasking learning: the audio data is divided into a training set, a verification set and a test set, and is divided into 6 cross verification parts, and 100 records are recorded in each part. Training the grouping by using 400 sound recordings, verifying 100 sound recordings, testing and building a time domain convolutional neural network by using 100 sound recordings to train audio data, combining the loss functions of SED and SEL tasks, and optimizing by using a combined training method;

(4) Prediction results and model evaluation: evaluating the model by using a predefined evaluation index, and comparing with other methods to determine whether the performance is excellent enough;

(5) Evaluation index: for audio event detection, an F-score and Error Rate (ER) calculated in 1 second units are used. For audio event localization estimation, two frame-by-frame metrics of audio event localization error (DE) and frame call (FR) are used.

TABLE 1 different models locate and detect prediction results for sound events

TABLE 2 Performance analysis of different models

Table 1 shows that the SELD-TCN and the SELD proposed by the invention are quantitatively compared, and the performance of the SELD-TCN and the SELD is obviously improved on the collected data set; as shown in Table 2, the present invention has greatly reduced model training time, although the number of parameters is increased; the method adopts the TCN layer to enhance the modeling capability of the model on long-term time and space information, and simultaneously can realize simultaneous sound event detection and sound event positioning, thereby reducing algorithm complexity and calculated amount and improving generalization capability and stability of the model.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A sound event localization and detection method based on a time domain convolutional network (SELD-TCN), comprising the steps of:

step one: collecting and preprocessing an acoustic event data set, collecting a group of multi-channel audio data sets, preprocessing the data sets, and converting an audio file into a Log-mel-map (Log-mel-map) and generalized cross-correlation (GCC-PHAT) among all audio channels;

step three: audio event detection (SED): classifying the characteristics obtained in the second step by using a fully connected neural network to classify the characteristics at each moment, and outputting a binary classification label by the SED task on each time step to indicate whether sound events exist in the time step or not so as to determine whether audio events exist or not;

step four: audio event localization (SEL): carrying out regression task on the feature obtained in the second step by using another feature at each moment, and outputting a quadruple by the SEL task on each time step to represent the position and duration of the sound event in the three-dimensional space;

step five: multitasking learning: dividing the audio data into a training set, a verification set and a test set, constructing a time domain convolutional neural network to train the audio data, combining the loss functions of SED and SEL tasks, and optimizing by using a combined training method;

2. The method for locating and detecting acoustic events based on a time domain convolutional network (SELD-TCN) according to claim 1, wherein in step one, the preprocessing method of audio data stores the collected data into a four-channel MIC format, the sampling frequency is 24KHz, a four-channel spectrogram is calculated by using 1024-point FFT, a hanning window of 40 ms and a jump length of 20 ms, and a log mel spectrogram of 64 mel frequency bands is extracted from the spectrogram; for each frame, 6 generalized cross-correlation sequences are calculated and truncated to the same number of hysteresis values as the mel band; the 4-channel mel spectrogram is then stacked along the channel dimension with corresponding spatial features to provide input to the network.

3. The method for locating and detecting acoustic events based on a time domain convolutional network (SELD-TCN) according to claim 1, wherein in step three, a Softmax classifier is used to classify SELD tasks; for each moment, transmitting the characteristics of all input frames into a TCN module, and predicting the probability of each category of the moment through a Softmax classifier; to integrate time information into the classification, we use a sliding window to predict the class at each instant, where the window size is equal to the duration of each instant; finally, for each moment, the position and duration of each event is calculated from the predicted class probabilities.

4. The method for locating and detecting a sound event based on a time domain convolutional network (SELD-TCN) according to claim 1, wherein in step five, the network structure mainly comprises two parts: a convolutional layer and a TCN block; the convolution layer is used for extracting local features from an input audio signal; in the model, two modes of one-dimensional convolution and two-dimensional convolution are adopted for processing different types of input features; the one-dimensional convolution is mainly applied to processing the sound channel characteristics, and the two-dimensional convolution is mainly applied to processing the time-frequency characteristics; the output of the convolution layer is sent to a TCN block for feature extraction; the TCN block consists of a plurality of time domain convolution layers, a batch normalization layer and residual error connection, wherein the time domain convolution layers are used for extracting long-term dependence characteristics of signals, the batch normalization layer is used for normalizing output characteristics in the middle of a network, and the residual error connection is used for preventing gradient from disappearing and accelerating the training process; by stacking multiple TCN blocks, the model is able to extract features at different levels of abstraction, thus achieving better performance.

5. The method for locating and detecting acoustic events based on time domain convolutional network (SELD-TCN) according to claim 1, wherein in step five, the model uses a multi-task learning framework to learn SED and SEL tasks simultaneously; in particular, we use two different output branches, one for predicting SED tasks and the other for predicting SEL tasks; the two branches share the same input features and learn the feature representation by sharing the convolutional layer; meanwhile, we use different classifiers to handle classification of SED and SEL tasks; during the training process, we use a multi-task penalty function to optimize the performance of both tasks simultaneously, thus enabling the model to process both SED and SEL tasks simultaneously.