CN116259313A - Sound event positioning and detecting method based on time domain convolution network - Google Patents

Sound event positioning and detecting method based on time domain convolution network Download PDF

Info

Publication number
CN116259313A
CN116259313A CN202310245354.1A CN202310245354A CN116259313A CN 116259313 A CN116259313 A CN 116259313A CN 202310245354 A CN202310245354 A CN 202310245354A CN 116259313 A CN116259313 A CN 116259313A
Authority
CN
China
Prior art keywords
time domain
tcn
network
tasks
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310245354.1A
Other languages
Chinese (zh)
Inventor
刘一欣
王玫
杨松铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Technology
Original Assignee
Guilin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Technology filed Critical Guilin University of Technology
Priority to CN202310245354.1A priority Critical patent/CN116259313A/en
Publication of CN116259313A publication Critical patent/CN116259313A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention provides a sound event positioning and detecting method based on a time domain convolution network, which adopts a TCN layer to enhance modeling capability of a model on long-term time and space information, simultaneously can realize simultaneous sound event detection and sound event positioning, reduces algorithm complexity and calculation amount, optimizes a loss function of sound event detection and sound event positioning tasks by using a combined training method, and improves generalization capability and stability of the model.

Description

Sound event positioning and detecting method based on time domain convolution network
Technical Field
The invention relates to the field of sound signal processing, in particular to a sound event positioning and detecting method based on a time domain convolution network.
Background
The method for identifying the sound event categories in the audio records and timely identifying the occurrence of the sound event categories is a currently active research theme, is popularized as sound event detection, and has wide application. While sound event detection may reveal much information about the recording environment, the spatial location of the event may bring valuable information to many applications. On the other hand, sound source localization is a classical multi-channel signal processing task based on the propagation characteristics of sound and the signal relationship between channels, irrespective of the sound type characteristics of the sound source. Sound event localization and detection systems aim to more fully spatiotemporal characterize sound scenes by combining sound event detection and sound source localization. The spatial dimensions make sound event localization and detection suitable for a wide range of machine listening tasks, such as inference of environmental types, synchronized localization and mapping by robots, navigation without visual input or occlusion of targets, tracking of sound sources of interest, audio monitoring. In addition, it may also assist in human-machine interaction, scene information visualization systems, scene-based service deployment, and assist in hearing devices, etc.
Furthermore, with the development of artificial intelligence technology, more and more audio signal processing tasks can be accomplished through deep learning models. Such as sound source localization and separation, speech recognition, speech synthesis, etc. These applications all require accurate processing and analysis of the audio signal to extract useful information. Deep learning models perform well in these tasks, especially when dealing with large-scale data sets. However, training of deep learning models requires a significant amount of computational resources and time, and requires many parameters to adjust. Furthermore, deep learning models often require large amounts of labeling data to achieve good performance, which makes implementation of some applications more difficult.
Disclosure of Invention
Aiming at the problems, the invention provides a sound event positioning and detecting method based on a time domain convolution network, which adopts a TCN layer to enhance the modeling capability of a model on long-term time and space information, simultaneously can realize simultaneous sound event detection and sound event positioning, reduces algorithm complexity and calculation amount, optimizes a loss function of sound event detection and sound event positioning tasks by using a combined training method, and improves generalization capability and stability of the model.
The invention is realized by the following technical scheme:
a method for locating and detecting sound events based on a time domain convolutional network (SELD-TCN), comprising the steps of:
step one: collecting and preprocessing an acoustic event data set, collecting a group of multi-channel audio data sets, preprocessing the data sets, and converting an audio file into a Log-melplgram (Log-melplctrograms) and generalized cross correlation (GCC-PHAT) among all audio channels;
step two: the characteristic extraction, namely inputting the logarithmic melprig diagram and the generalized cross correlation obtained in the step one into a time domain convolution network to extract required characteristics;
step three: audio event detection (SED): and (3) classifying the characteristics obtained in the step two by using a fully connected neural network to classify the characteristics at each moment, and outputting a binary classification label by the SED task on each time step to indicate whether a sound event exists in the time step or not so as to determine whether an audio event exists or not.
Step four: audio event localization (SEL): and (3) carrying out regression task on the feature obtained in the step two by using another feature at each moment, and outputting a quadruple by the SEL task on each time step to represent the position and duration of the sound event in the three-dimensional space.
Step five: multitasking learning: dividing the audio data into a training set, a verification set and a test set, constructing a time domain convolutional neural network to train the audio data, combining the loss functions of SED and SEL tasks, and optimizing by using a combined training method.
Step six: the model is evaluated using predefined evaluation criteria and compared to other methods to determine if its performance is sufficiently good.
In the first step, the collected data is stored as four-channel MIC format, the sampling frequency is 24KHz, a 1024-point FFT, a Hanning window of 40 ms and a jump length of 20 ms are used for calculating a four-channel spectrogram, and the logarithmic Mel spectrogram of 64 mel frequency bands is extracted from the spectrogram; for each frame, 6 generalized cross-correlation sequences are calculated and truncated to the same number of hysteresis values as the mel band; the 4-channel mel spectrogram is then stacked along the channel dimension with corresponding spatial features to provide input to the network.
As a preferred embodiment, in step three, a Softmax classifier is used to classify the SELD task. For each instant, the features of all input frames are passed into the TCN module, and then the probability of each class for that instant is predicted by the Softmax classifier. To integrate the time information into the classification, we use a sliding window to predict the class at each instant, where the window size is equal to the duration of each instant. Finally, for each moment, the position and duration of each event is calculated from the predicted class probabilities.
In the fifth step, as an optimization of the technical solution, the network structure mainly includes two parts: convolutional layers and TCN blocks. The convolution layer extracts local features from an input audio signal. In the model, two modes of one-dimensional convolution and two-dimensional convolution are adopted for processing different types of input features. The one-dimensional convolution is mainly applied to processing the sound channel characteristics, and the two-dimensional convolution is mainly applied to processing the time-frequency characteristics. The output of the convolutional layer is fed into the TCN block for feature extraction. The TCN block is composed of a plurality of time domain convolution layers, a batch normalization layer and residual error connection, wherein the time domain convolution layers are used for extracting long-term dependence characteristics of signals, the batch normalization layer is used for normalizing output characteristics in the middle of a network, and the residual error connection is used for preventing gradient from disappearing and accelerating the training process. By stacking multiple TCN blocks, the model is able to extract features at different levels of abstraction, thus achieving better performance.
As a preferred solution, in the fifth step, the model uses a multi-task learning framework to learn SED and SEL tasks simultaneously. In particular, we use two different output branches, one for predicting SED tasks and the other for predicting SEL tasks. Both branches share the same input features and learn the feature representation by sharing the convolutional layer. At the same time, we use different classifiers to handle the classification of SED and SEL tasks. During the training process, we use a multi-task penalty function to optimize the performance of both tasks simultaneously, thus enabling the model to process both SED and SEL tasks simultaneously.
Compared with the prior art, the invention has the advantages that:
1. the invention provides a sound event positioning and detecting method based on a time domain convolution network, which is used for accurately identifying and positioning a plurality of sound sources in a complex environment. The method uses a time-space convolution neural network to process an input audio signal, wherein the performance of a model is improved by adopting residual connection, expansion convolution and other technologies. The method not only can identify the category of the sound sources, but also can estimate the positions of the sound sources in space, thereby realizing the positioning and separation of a plurality of sound sources.
2. The method does not need to use complex feature engineering to extract useful information in the audio signal, but automatically learns the features in the audio signal by using the time domain convolutional neural network, and can improve the performance of the model without increasing the complexity of the model.
Drawings
The following is the main drawing of the process.
Fig. 1 is a schematic diagram of a data processing flow provided in the present invention.
Fig. 2 is a schematic diagram of the overall structure of the time domain convolution network according to the present invention.
Fig. 3 is a diagram of a time domain convolution block structure in accordance with the present invention.
Fig. 4 is a diagram of a residual block structure according to the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are only intended to illustrate the present invention and not to limit the scope of the present invention.
As shown in fig. 1, the present invention provides a sound event positioning and detecting method based on a time domain convolution network, which comprises the following steps:
the collection of the audio data set uses a four-channel microphone array of Kinect2.0, the microphone array is erected on a tripod 1.2 m away from the ground, fourteen sounds such as alarm clock sound, baby crying, knocking, dog calling, walking sound, piano sound and the like are collected indoors, and the type and position information of the sounds are marked.
(1) Preprocessing of audio data: the collected data is stored into a four-channel MIC format, the sampling frequency is 24KHz, a four-channel spectrogram is calculated by using a 1024-point FFT, a Hanning window of 40 milliseconds and a jump length of 20 milliseconds, and the logarithmic Mel spectrogram of 64 mel frequency bands is extracted from the spectrogram; for each frame, 6 generalized cross-correlation sequences are calculated and truncated to the same number of hysteresis values as the mel band; then stacking the 4-channel mel spectrogram with corresponding spatial features along the channel dimension to provide input for the network;
(2) Feature extraction of audio data: inputting a spectrogram into a network model for feature extraction, firstly using 3 layers of CNNs to learn features in the spectrogram, inputting the features into a TCN block, extracting the time connection of the features, and then entering a fully-connected network to carry out regression and classification processing, judgment, detection and positioning on the features;
(3) Multitasking learning: the audio data is divided into a training set, a verification set and a test set, and is divided into 6 cross verification parts, and 100 records are recorded in each part. Training the grouping by using 400 sound recordings, verifying 100 sound recordings, testing and building a time domain convolutional neural network by using 100 sound recordings to train audio data, combining the loss functions of SED and SEL tasks, and optimizing by using a combined training method;
(4) Prediction results and model evaluation: evaluating the model by using a predefined evaluation index, and comparing with other methods to determine whether the performance is excellent enough;
(5) Evaluation index: for audio event detection, an F-score and Error Rate (ER) calculated in 1 second units are used. For audio event localization estimation, two frame-by-frame metrics of audio event localization error (DE) and frame call (FR) are used.
TABLE 1 different models locate and detect prediction results for sound events
Figure BDA0004125812080000041
TABLE 2 Performance analysis of different models
Figure BDA0004125812080000051
Table 1 shows that the SELD-TCN and the SELD proposed by the invention are quantitatively compared, and the performance of the SELD-TCN and the SELD is obviously improved on the collected data set; as shown in Table 2, the present invention has greatly reduced model training time, although the number of parameters is increased; the method adopts the TCN layer to enhance the modeling capability of the model on long-term time and space information, and simultaneously can realize simultaneous sound event detection and sound event positioning, thereby reducing algorithm complexity and calculated amount and improving generalization capability and stability of the model.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (5)

1. A sound event localization and detection method based on a time domain convolutional network (SELD-TCN), comprising the steps of:
step one: collecting and preprocessing an acoustic event data set, collecting a group of multi-channel audio data sets, preprocessing the data sets, and converting an audio file into a Log-mel-map (Log-mel-map) and generalized cross-correlation (GCC-PHAT) among all audio channels;
step two: the characteristic extraction, namely inputting the logarithmic melprig diagram and the generalized cross correlation obtained in the step one into a time domain convolution network to extract required characteristics;
step three: audio event detection (SED): classifying the characteristics obtained in the second step by using a fully connected neural network to classify the characteristics at each moment, and outputting a binary classification label by the SED task on each time step to indicate whether sound events exist in the time step or not so as to determine whether audio events exist or not;
step four: audio event localization (SEL): carrying out regression task on the feature obtained in the second step by using another feature at each moment, and outputting a quadruple by the SEL task on each time step to represent the position and duration of the sound event in the three-dimensional space;
step five: multitasking learning: dividing the audio data into a training set, a verification set and a test set, constructing a time domain convolutional neural network to train the audio data, combining the loss functions of SED and SEL tasks, and optimizing by using a combined training method;
step six: the model is evaluated using predefined evaluation criteria and compared to other methods to determine if its performance is sufficiently good.
2. The method for locating and detecting acoustic events based on a time domain convolutional network (SELD-TCN) according to claim 1, wherein in step one, the preprocessing method of audio data stores the collected data into a four-channel MIC format, the sampling frequency is 24KHz, a four-channel spectrogram is calculated by using 1024-point FFT, a hanning window of 40 ms and a jump length of 20 ms, and a log mel spectrogram of 64 mel frequency bands is extracted from the spectrogram; for each frame, 6 generalized cross-correlation sequences are calculated and truncated to the same number of hysteresis values as the mel band; the 4-channel mel spectrogram is then stacked along the channel dimension with corresponding spatial features to provide input to the network.
3. The method for locating and detecting acoustic events based on a time domain convolutional network (SELD-TCN) according to claim 1, wherein in step three, a Softmax classifier is used to classify SELD tasks; for each moment, transmitting the characteristics of all input frames into a TCN module, and predicting the probability of each category of the moment through a Softmax classifier; to integrate time information into the classification, we use a sliding window to predict the class at each instant, where the window size is equal to the duration of each instant; finally, for each moment, the position and duration of each event is calculated from the predicted class probabilities.
4. The method for locating and detecting a sound event based on a time domain convolutional network (SELD-TCN) according to claim 1, wherein in step five, the network structure mainly comprises two parts: a convolutional layer and a TCN block; the convolution layer is used for extracting local features from an input audio signal; in the model, two modes of one-dimensional convolution and two-dimensional convolution are adopted for processing different types of input features; the one-dimensional convolution is mainly applied to processing the sound channel characteristics, and the two-dimensional convolution is mainly applied to processing the time-frequency characteristics; the output of the convolution layer is sent to a TCN block for feature extraction; the TCN block consists of a plurality of time domain convolution layers, a batch normalization layer and residual error connection, wherein the time domain convolution layers are used for extracting long-term dependence characteristics of signals, the batch normalization layer is used for normalizing output characteristics in the middle of a network, and the residual error connection is used for preventing gradient from disappearing and accelerating the training process; by stacking multiple TCN blocks, the model is able to extract features at different levels of abstraction, thus achieving better performance.
5. The method for locating and detecting acoustic events based on time domain convolutional network (SELD-TCN) according to claim 1, wherein in step five, the model uses a multi-task learning framework to learn SED and SEL tasks simultaneously; in particular, we use two different output branches, one for predicting SED tasks and the other for predicting SEL tasks; the two branches share the same input features and learn the feature representation by sharing the convolutional layer; meanwhile, we use different classifiers to handle classification of SED and SEL tasks; during the training process, we use a multi-task penalty function to optimize the performance of both tasks simultaneously, thus enabling the model to process both SED and SEL tasks simultaneously.
CN202310245354.1A 2023-03-14 2023-03-14 Sound event positioning and detecting method based on time domain convolution network Pending CN116259313A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310245354.1A CN116259313A (en) 2023-03-14 2023-03-14 Sound event positioning and detecting method based on time domain convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310245354.1A CN116259313A (en) 2023-03-14 2023-03-14 Sound event positioning and detecting method based on time domain convolution network

Publications (1)

Publication Number Publication Date
CN116259313A true CN116259313A (en) 2023-06-13

Family

ID=86679117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310245354.1A Pending CN116259313A (en) 2023-03-14 2023-03-14 Sound event positioning and detecting method based on time domain convolution network

Country Status (1)

Country Link
CN (1) CN116259313A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117238298A (en) * 2023-11-13 2023-12-15 四川师范大学 Method and system for identifying and positioning animals based on sound event

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117238298A (en) * 2023-11-13 2023-12-15 四川师范大学 Method and system for identifying and positioning animals based on sound event
CN117238298B (en) * 2023-11-13 2024-02-06 四川师范大学 Method and system for identifying and positioning animals based on sound event

Similar Documents

Publication Publication Date Title
Cao et al. Polyphonic sound event detection and localization using a two-stage strategy
Heittola et al. Context-dependent sound event detection
Adavanne et al. Multichannel sound event detection using 3D convolutional neural networks for learning inter-channel features
Wang et al. The USTC-iFlytek system for sound event localization and detection of DCASE2020 challenge
CN111341319B (en) Audio scene identification method and system based on local texture features
Morito et al. Partially Shared Deep Neural Network in sound source separation and identification using a UAV-embedded microphone array
CN111986699B (en) Sound event detection method based on full convolution network
He et al. SoundDet: Polyphonic moving sound event detection and localization from raw waveform
CN116259313A (en) Sound event positioning and detecting method based on time domain convolution network
CN113921034A (en) Sound event detection and positioning method based on deep learning
CN102509548B (en) Audio indexing method based on multi-distance sound sensor
Al-Hattab et al. Rethinking environmental sound classification using convolutional neural networks: optimized parameter tuning of single feature extraction
Bai et al. Multimodal urban sound tagging with spatiotemporal context
CN114582325A (en) Audio detection method and device, computer equipment and storage medium
CN117877516A (en) Sound event detection method based on cross-model two-stage training
Khandelwal et al. Sound Event Detection: A Journey Through DCASE Challenge Series
Feroze et al. Sound event detection in real life audio using perceptual linear predictive feature with neural network
CN110046655B (en) Audio scene recognition method based on ensemble learning
CN110580915B (en) Sound source target identification system based on wearable equipment
Slizovskaia et al. Locate this, not that: Class-conditioned sound event doa estimation
Lu et al. Context-based environmental audio event recognition for scene understanding
CN113539298B (en) Sound big data analysis and calculation imaging system based on cloud edge end
Ferroudj Detection of rain in acoustic recordings of the environment using machine learning techniques
Hu et al. A generalized network based on multi-scale densely connection and residual attention for sound source localization and detection
Chen et al. Sound event localization and detection using parallel multi-attention enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination