CN113744758B - Sound event detection method based on 2-DenseGRUNet model - Google Patents
Sound event detection method based on 2-DenseGRUNet model Download PDFInfo
- Publication number
- CN113744758B CN113744758B CN202111089655.7A CN202111089655A CN113744758B CN 113744758 B CN113744758 B CN 113744758B CN 202111089655 A CN202111089655 A CN 202111089655A CN 113744758 B CN113744758 B CN 113744758B
- Authority
- CN
- China
- Prior art keywords
- layer
- model
- event detection
- sound event
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 80
- 238000012545 processing Methods 0.000 claims description 29
- 238000011176 pooling Methods 0.000 claims description 25
- 230000007704 transition Effects 0.000 claims description 16
- 230000004913 activation Effects 0.000 claims description 11
- 238000005070 sampling Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 210000002569 neuron Anatomy 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 239000012634 fragment Substances 0.000 abstract description 14
- 125000004122 cyclic group Chemical group 0.000 abstract description 4
- 238000013527 convolutional neural network Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 abstract description 3
- 238000003062 neural network model Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 23
- 230000000875 corresponding effect Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000005764 inhibitory process Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Complex Calculations (AREA)
Abstract
The sound event detection method based on the 2-DenseGRUNet model is based on a 2-order DenseNet network model, and a gating circulation unit GRU network is added to construct a sound event detection model; compared with the traditional convolutional neural network and the cyclic neural network model, the acoustic event detection model in the technical scheme combines the advantages of the 2-DenseNet and the GRU, can fuse the characteristic information more efficiently, acquire more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in the process of detecting urban sound events, so that the sound classification result based on the method of the invention is more accurate.
Description
Technical Field
The invention relates to the technical field of sound detection, in particular to a sound event detection method based on a 2-DenseGRUNet model.
Background
The sound carries a large amount of information about life scenes and physical events in the city, and the information is automatically extracted by intelligently sensing each sound source through a deep learning method, so that the method has great potential and application prospect in building the smart city. In smart cities, sound event detection is an important basis for recognition and semantic understanding of environmental sound scenes. The urban sound event detection research is mainly applied to the aspects of environment perception, factory equipment detection, urban security, automatic driving and the like. The urban sound event detection technology in the prior art is mainly realized based on a MLP, CNN, LSTM network model. However, when these 3 network models are evaluated by considering the index F-Score of the reconciliation value of Precision and Recall in combination, the Score of the F-Score is low due to the high average segment error rate, which has a limited application range in practical applications.
Disclosure of Invention
In order to solve the problem of average fragment error rate of the detection of the urban sound event in the center of the prior art, the sound event detection method based on the 2-DenseGRUNet model provided by the invention can extract more effective acoustic information when processing audio data, has better time sequence modeling capability, and ensures that the model has lower average fragment error rate and higher usability when detecting the urban sound event.
The technical scheme of the invention is as follows: the sound event detection method based on the 2-DenseGRUNet model comprises the following steps:
s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation includes: sampling and quantizing, pre-emphasis processing and windowing;
s2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;
s3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
s4: constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model;
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for recognition detection, and obtaining a sound event detection result of the audio data to be processed;
the method is characterized in that:
the sound event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all the GRU units are connected in series behind the 2 nd-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;
the 2 nd order DenseNet model comprises: a continuous 2-DenseBlock structure, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;
in each 2-DenseBlock structure, the connection of the feature layer and the feature layer is based on the correlation connection of a 2-order Markov model, and the current feature layer input is related to the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer.
It is further characterized by:
the 2-order DenseNet model is sequentially connected with the GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer;
each 2-DenseBlock comprises a feature layer which is connected in sequence; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, wherein input data is subjected to batch normalization processing and activation function processing before entering the convolution layers for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between a first characteristic layer and a second characteristic layer in each 2-DenseBlock;
the characteristic extraction of the mel frequency cepstrum coefficient in the step S2, wherein the mel frequency cepstrum coefficient extracted in a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in the step S2 is a 2-dimensional vector, wherein the first-dimensional vector is the number of frames after sampling the audio data to be processed, and the second-dimensional vector is the dimension of the Mel frequency cepstrum coefficient;
in step S3, when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, and the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the frame number of mfcc in the original sound event tag is determined as follows: starting time to ending time, reconstruct: starting frame number to ending frame number;
in step S5, before inputting the reconstructed feature vector sequence into the trained acoustic event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the acoustic event detection model;
the characteristic input of the GRU unit is a two-dimensional characteristic vector;
the Transition Layer includes: a convolution kernel is a 1*1 convolution layer, a largest pooling layer with pooling size [2,2 ];
the number of the neurons of the full-connection layer is set to 256 or 128;
the output layer is realized based on a Sigmoid function.
The sound event detection method based on the 2-DenseGRUNet model is based on a 2-order DenseNet network model, and a gating circulation unit GRU network is added to construct a sound event detection model; compared with the traditional convolutional neural network and the cyclic neural network model, the acoustic event detection model in the technical scheme combines the advantages of the 2-DenseNet and the GRU, can fuse the characteristic information more efficiently, acquire more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in the process of detecting urban sound events, so that the sound classification result based on the method of the invention is more accurate.
Drawings
FIG. 1 is a flow chart of feature reconstruction in sound event detection according to the present invention;
FIG. 2 is a network block diagram of the 2-DenseGRUNet model of the present invention;
FIG. 3 is a schematic diagram of a 2-DenseBlock in a 2-DenseNet network according to the present invention;
FIG. 4 is a schematic diagram of a gate cycle unit GRU according to the present invention.
Detailed Description
As shown in fig. 1 to 4, the sound event detection method based on the 2-densegrinet model of the present invention specifically includes the following steps.
S1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation comprises the following steps: sampling and quantization, pre-emphasis processing, windowing.
S2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;
characteristic extraction of a Mel frequency cepstrum coefficient, wherein the Mel frequency cepstrum coefficient extracted from a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames of the audio data to be processed after sampling, the second-dimensional vector is the dimension of mel-frequency cepstrum coefficient, in this embodiment, the mel-frequency cepstrum coefficient is 40mfcc, and the second-dimensional vector is 40.
S3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1s, the corresponding frame number of mfcc of the reconstructed feature vector sequence is 128 frames, and the tag of the original sound event is as follows: starting time to ending time, reconstruct: starting frame number to ending frame number.
As shown in fig. 1, in step S3, the specific implementation process of reconstructing the feature information and the tag of the original feature vector sequence is as follows:
it is provided that the original audio sample comprises: files with file names of a013, a010, a129, b099, b008 and b100, and the duration of each file is respectively as follows: 3min44s, 3min30s, 4min0s, 3min30s, 3min01s. The audio sample files with different time lengths are spliced according to the time dimension to construct total audio data, wherein the total time length of the total audio data is T:
T=3min44s+3min30s+4min0s+4min0s+3min30s+3min01s
extracting label information of the audio frequency from the total audio frequency data; each section of file in the total audio data is respectively marked as follows:
[ time-start, time-end, category ] represents the [ start time, end time, category ] of the audio event corresponding to the original audio sample, respectively.
Extracting features of the marked total audio data, wherein f=44.1khz, nfft=2048, win-len=2048, hop_len=1024 and t= 0.0232s,Segment len =128;
wherein f represents the sampling frequency, nfft represents the length of the fast fourier transform, wen _len represents the sampling point number of each frame, hop_len represents the sampling point number between two frames, T represents the duration of each frame, segment=128 represents the feature of dividing the whole audio with the duration of T into a plurality of frames of 128 after feature extraction;
after the feature reconstruction, the total audio data with the length of T contains different samples, and the labels of the audio fragments respectively corresponding to the different samples are as follows: [ frame_start, frame_end, one-hot ];
representing each audio segment after segmentation: [ start frame number, end frame number, class one-hot coding ];
and finally, outputting a reconstructed feature vector sequence after the reconstructed feature processing, namely, representing the reconstructed feature vector sequence by using [ frame_start, frame_end, one-hot ].
S4: and constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model.
Based on a 2-order DenseNet model and combined with the characteristics of a gate control circulation unit GRU model, the network model is the sound event detection model, which is simply called: 2-DenseGRUNet model. The sound event detection model takes a 2-order DenseNet model as a characteristic extraction network of the front end of the network, and GRU units with good time sequence modeling capability are connected in series at the end of the network, so that characteristic information of sound events can be fused with each other efficiently, more effective characteristic information can be obtained, time sequence modeling is carried out effectively, and average segment error rate of sound segments is reduced.
The acoustic event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all GRU units are connected in series behind the 2 nd-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2-order DenseNet model;
the 2 nd order DenseNet model includes: m consecutive 2-DenseBlock structures, wherein m is a natural number greater than or equal to 1;
a Transition layer structure is arranged behind each 2-DenseBlock structure;
the 2-order DenseNet model is sequentially connected with the continuous GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the rear of the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer; the characteristic input of the GRU unit is a two-dimensional characteristic vector; the output layer is realized based on a Sigmoid function; the Transition Layer includes: one convolution kernel is the 1*1 convolution layer, one largest pooling layer with a pooling size of [2,2 ]. And setting the number of neurons with the full connection layer number to 256 or 128, carrying out parameter adjustment and suppression over fitting on the network, and finally outputting a detection result after the detection result is processed by a Sigmoid function.
The structure of the feature layer in 2-DenseBlock is shown in Table 1 below:
table 1: structure of feature layer in 2-DenseBlock
Conv(1×1) |
BN(·) |
ReLU activation function |
Conv(3×3) |
BN(·) |
ReLU activation function |
Concate function |
dropout layer |
Each 2-DenseBlock structure comprises l feature layers which are connected in sequence;
as shown in table 1, each feature layer includes 2 convolution layers, and in the feature layer, after the input data is subjected to convolution processing by the convolution layers, batch normalization processing (BN) and ReLU activation function processing are further performed;
the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through a Concate; a dropout layer is arranged between a first characteristic layer and a second characteristic layer in each 2-DenseBlock to perform small-amplitude overfitting inhibition, so that parameter adjustment of a later network model is facilitated.
DenseBlock is taken as an infrastructure of a DenseNet model, and the DenseBlock in the 2 nd-order DenseNet is also of two orders; that is, in each 2-DenseBlock structure, the feature layer-to-feature layer connections are based on the correlation connections of the 2 nd order Markov model, with the current feature layer input being correlated with the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer; the feature layer performs weight sharing in a Transition layer structure, maximally performs time sequence distinction, and detects the starting time and the ending time of a sound event;
each 2-DenseBlock comprises l feature layers which are connected in sequence, wherein l is a natural number which is more than or equal to 1; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, and in the characteristic layer, input data is subjected to batch standardization processing and activation function processing before entering the convolution layer for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between the first feature layer and the second feature layer in each 2-DenseBlock.
In the 2-DenseBlock network architecture, the 1×1 and 3×3 convolutional layers in each feature layer are a set of non-linearly varying feature layers. As shown in fig. 3 of the drawings of the specification, the 2-DenseBlock includes 4 feature layers, and the input layer of each non-linearly varying feature layer is defined as: x is X 1 ,X 2 ,…,X l U1-U8 are convolution layers in the feature layer, and W2-W9 are weight coefficient matrixes corresponding to the convolution layers;
in 2-DenseBlock, layer 3 initiates a characteristic output U of the network convolution transform c Can be defined as:
wherein [ X ] l ,X l-1 ,X l-2 ]Representing the current layer to perform a channel number merging concatenation operation by means of a 2-level correlation scheme, using the feature map of the first two layers as input to the current layer, W 3×3 And W is 1×1 Representing a rollThe kernel sizes are 1×1 and 3×3 kernel functions respectively, BN (·) represents batch normalization, f (·) is a ReLU activation function, BN (·) is batch normalization, and B represents a bias coefficient;
each Transition layer structure comprises a convolution layer and a pooling layer; the convolution kernel of the convolution layer is 1*1, characteristic dimension reduction processing is carried out through the convolution layer, the subsequent pooling layer is connected, the size of the matrix is reduced through the pooling layer processing, the parameters of the final full-connection layer are reduced, and the expression formula is as follows:
wherein:the output of the pooling layer is represented, the maximum value is taken in the pooling area, l is the number of the characteristic image layers included in each 2-DenseBlock structure, k is the number of channels of the characteristic image, and m and n are the sizes of convolution kernels; x (i, j) corresponds to a pixel on the feature map; p is a pre-specified parameter, when p goes to infinity, ->Maximum value is taken in the pooling area, which is the maximum pooling (max pooling).
As in the embodiment shown in fig. 3, when the number of layers l=4, the output of layer 1 is X 1 Input to layer 2 propagating forward without using a Concatenation layer is X 2 The method comprises the steps of carrying out a first treatment on the surface of the The input profile of layer 3 relates only to the output profiles of layers 2, 1, i.e. X 3 =f([x 3 ,x 2 ,x 1 ]) The method comprises the steps of carrying out a first treatment on the surface of the The input profile of layer 4 relates only to the output profiles of layers 3, 2, i.e. X 4 =f([x 4 ,x 3 ,x 2 ])。
As shown in fig. 4, the GRU unit is a gating cyclic unit, which is a gating mechanism in the cyclic neural network. In the mechanism of the gate control circulation unit, an update gate z is arranged t And reset gate r t . Updating door z t For controlling the state x of the previous moment t-1 Information is brought to the current state x t Reset gate r t The GRU has better performance in time sequence modeling by controlling how much information of the previous state is written to the current candidate set. The expression formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])
r t =σ(W r ·[h t-1 ,x t ])
wherein: σ represents the full connection layer and the activation function,indicating hidden state, h t Representing the output, representing the multiplication of the element;
W z weight coefficient representing update gate, W r The weight coefficient representing the reset gate, σ (g) represents the full connection layer and the activation function, and tanh (g) represents the tanh activation function.
As shown in fig. 2, a feature vector sequence input into a sound event detection model is subjected to one-Layer convolution operation and one-time pooling treatment in sequence, and then is sequentially input into m continuous 2-deieblocks respectively, wherein each 2-deieblock is respectively followed by a Transition Layer; after 2-DeseNet (m) processing comprising continuous m 2-DenseBlock structures and Transition layers, converting a three-dimensional feature vector into a two-dimensional feature vector through a reshape Layer, then connecting n gating circulation units GRU to perform Time sequence modeling processing, inputting into a Time Distributed Layer to perform Time sequence tensor operation, inputting into a full connection Layer to perform detection processing, and finally outputting a detection result after Sigmoid function processing. The number m and the number l of the 2-DenseBlock are valued according to the actual hardware condition and the data complexity. In the embodiment shown in fig. 2 and 3 of the drawings, n is 2, m is 2, and l is 4.
The Sigmoid function f (z) formula is:
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into a trained sound event detection model for identification detection, and obtaining a sound event detection result of the audio data to be processed; meanwhile, before the reconstructed feature vector sequence is input into the trained sound event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the channel number in the sound event detection model.
In the technical scheme, based on 2-order DenseNet, the information in the feature map can be extracted more effectively, and the gating cycle unit GRU model is simple, so that the method is more suitable for constructing a larger network; in the patent, the 2-order DenseNet network is combined with 2 gating cycle unit GRU models, so that the calculation efficiency is higher from the calculation perspective; based on the technical scheme of the patent, not only can the frequency domain information of the feature map be effectively extracted, but also the time sequence features of the long-time audio sequence can be effectively captured, and meanwhile, the classification task and the regression task in the detection can be more efficiently realized.
As shown in table 2 below, is an example of a network structure of the 2-densegrenet model.
Table 2: examples of the 2-DenseGRUNet model
Input:mfcc[128,80,1] |
Conv(3×3):[128,80,32] |
Pooling(2×2):[64,80,32] |
2-Denseblock(1):[32,80,16] |
Transition Layer(16,80,8) |
2-Denseblock(2):[16,80,8] |
Transition Layer(16,80,8) |
Reshape:[64,160] |
GRU(1):[64,64] |
GRU(2):[64,32] |
TimeDistributed:[64,6] |
Full connection layer (64,6) |
Output(Sigmoid):[64,6] |
Using the Dcase2017 dataset, the detection categories of the dataset were 6 major categories, respectively, and the time tag was time-start at the end point. Using a 2-DenseGRUNet model shown in Table 2, setting m as 2, namely sequentially performing a layer of convolution operation and primary pooling treatment on a feature vector sequence of an input detection model, and sequentially inputting the feature vector sequence into 2 continuous 2-DenseBlock respectively; and according to the data condition and the performance of the equipment, the value of l is 4 in each 2-DenseBlock structure, namely each 2-DenseBlock comprises 4 feature layers. n=2, i.e. the number of layers of the gate loop unit GRU is 2.
Performing time domain and frequency domain analysis on the audio frame sequence, and extracting Mel frequency cepstrum coefficient output characteristic vector sequences; the number of sampling frames for input audio data in the Dcase2017 dataset is 128, and the characteristic dimension of the mel frequency cepstrum coefficient is selected as follows: 40mfcc is obtained by extracting 40-dimensional mfcc characteristics under 40 mel filter groups, and outputting mel cepstrum coefficient characteristic vector sequences (128, 40).
The 2-dimensional vector is converted into 3-dimensional data through reshape, because the number of channels of Input in the network structure of the 2-DenseNet model is 1, after the 2-dimensional vector is converted into three-dimensional data, the feature vectors of Dcase2017 are respectively (128, 40, 1), the three-dimensional feature vectors are converted into two-dimensional feature vectors (256, 40), and the two-dimensional feature vectors are Input into a gating circulation unit GRU for time series modeling.
The feature vector is input into a 2-DenseNet model, the input feature map sequence firstly passes through a convolution layer, then a pooling layer is adopted to carry out pooling treatment, and the obtained three-dimensional data are sequentially input into 2 continuous 2-DenseBlock.
In each 2-DenseBlock there are 4 feature map layers, namely 4 2-DenseBlock functions, which are input as a feature map sequence. In the processing of the 2-DenseBlock function, batch standardization (BN) processing is carried out first, and the activation function is a ReLU function; and then transferred to the convolution layer; this procedure was performed twice within the function, the first convolution kernel size being 1*1 and the second convolution kernel size being 3*3. The specific operation in the 2-DenseBlock function (denoted as 2-DenseBlock in the formula) is therefore:
three-dimensional data processed by two continuous 2-DenseBlock and Transition layers are firstly converted into two-dimensional feature vectors, the two-dimensional feature vectors are input into a gating circulation unit GRU for Time sequence modeling, then enter a Time Distributed Layer for Time sequence tensor operation, weight sharing can be carried out, time sequence modeling can be carried out to the maximum extent, the starting Time and the ending Time of a sound event are detected, and finally the detection result is output after being processed by a Sigmoid function.
Under the experimental environment of a Window10 system, a display card GTX1060, i7-8750H of CPU and a memory 16G; the method comprises the steps that a keras+TensorFlow is used as a deep learning framework, a dataset Dcase2017 is adopted, firstly, different characteristics, different characteristic dimensions and comparison experiments of network structures of GRU units are respectively carried out on the dataset Dcase2017, and the fragment error rate and the F-score of a 2-DenseGRUNet model are verified; and then comparing with the existing research model to verify the good performance of the 2-DenseGRUNet model.
Audio data detection experiments were performed on the Dcase2017 dataset by extracting 40 and 120-dimensional mel cepstrum coefficient features and 40-dimensional gamma cepstrum coefficient features in 2-DenseNet and 2-densegune network models, with specific results shown in table 3 below.
Table 3: results of the fusion experiment of the 2-DenseGRUNet model
Model | Features (e.g. a character) | Average segment error rate | F-score |
2-DenseNet | 40mfcc | 0.566 | 0.610 |
2-DenseNet | 128mfcc | 0.562 | 0.613 |
2-DenseNet | 40gfcc | 0.586 | 0.607 |
2-DenseGRUNet | 40mfcc | 0.543 | 0.651 |
2-DenseGRUNet | 128mfcc | 0.541 | 0.648 |
2-DenseGRUNet | 40gfcc | 0.563 | 0.631 |
According to experimental results, compared with a 2-DenseGRUNet model without using a gate control circulation unit GRU, the average fragment error rate is reduced by 2.3%, 2.1% and 2.3% under the characteristics of 40mfcc, 128mfcc and 40gfcc respectively, and the F-score is improved by 4.1%, 3.5% and 2.4% respectively; under the 2-DenseGRUNet model, the average fragment error rate is respectively reduced by 2.0% and the F-score is respectively improved by 2.0% by using the characteristic of 40mfcc relative to 40 gfcc; the average fragment error rate and the F-score variation range are about 0.1% under the characteristic of 40mfcc relative to 128mfcc, but the model training time can be effectively reduced by using 40mfcc, and the calculation requirement of a computer can be reduced. In summary, the 2-DenseGRUNet model under 40mfcc can more efficiently utilize feature information fusion to acquire more effective feature information, and can effectively perform time sequence modeling. The optimal average fragment error rate was 0.543% and the F-score was 65.12%.
Further experiments were performed on the 2-DenseGRUNet model in the Dcase2017 dataset, and the test results were compared with the accuracy of the existing domestic and foreign investigator models, and the comparison test results are shown in Table 4.
TABLE 4 detection results for different models
Under the technical scheme of the invention, the adopted D-2-DenseNet model is compared with the test results of researchers at home and abroad, so that the classification accuracy of the technical scheme of the invention is respectively reduced by 14.7% compared with the average fragment error rate of the MLP model of baseline, and the F-score fraction is respectively improved by 8.4%; compared with the LSTM model, the average fragment error rate is respectively reduced by 1.9%, and the F-score is respectively improved by 3.9%; the average fragment error rate of the technical scheme of the invention is obviously reduced, and the F-score is obviously improved.
In summary, the technical scheme provided by the invention can more efficiently utilize feature information fusion to acquire more effective feature information when processing audio data, and the model has lower average fragment error rate and higher F-score, i.e. the accuracy of sound classification realized based on the method is higher.
Claims (8)
1. The sound event detection method based on the 2-DenseGRUNet model comprises the following steps:
s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation includes: sampling and quantizing, pre-emphasis processing and windowing;
s2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;
s3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
s4: constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model;
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for recognition detection, and obtaining a sound event detection result of the audio data to be processed;
the method is characterized in that:
the sound event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all the GRU units are connected in series behind the 2 nd-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;
the 2 nd order DenseNet model comprises: a continuous 2-DenseBlock structure, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;
in each 2-DenseBlock structure, the connection of the feature layer and the feature layer is based on the correlation connection of a 2-order Markov model, and the current feature layer input is related to the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer;
the 2-order DenseNet model is sequentially connected with the GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer;
each 2-DenseBlock comprises a feature layer which is connected in sequence; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, wherein input data is subjected to batch normalization processing and activation function processing before entering the convolution layers for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.
2. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the characteristic extraction of the mel frequency cepstrum coefficient in the step S2, wherein the mel frequency cepstrum coefficient extracted in a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in the step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames after sampling the audio data to be processed, and the second-dimensional vector is the dimension of the mel-frequency cepstrum coefficient.
3. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: in step S3, when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, and the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the frame number of mfcc in the original sound event tag is determined as follows: starting time to ending time, reconstruct: starting frame number to ending frame number.
4. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: in step S5, before inputting the reconstructed feature vector sequence into the trained acoustic event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the acoustic event detection model.
5. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the feature input of the GRU unit is a two-dimensional feature vector.
6. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the Transition Layer includes: one convolution kernel is the 1*1 convolution layer, one largest pooling layer with a pooling size of [2,2 ].
7. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the number of neurons of the full connection layer is set to 256 or 128.
8. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the output layer is realized based on a Sigmoid function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111089655.7A CN113744758B (en) | 2021-09-16 | 2021-09-16 | Sound event detection method based on 2-DenseGRUNet model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111089655.7A CN113744758B (en) | 2021-09-16 | 2021-09-16 | Sound event detection method based on 2-DenseGRUNet model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113744758A CN113744758A (en) | 2021-12-03 |
CN113744758B true CN113744758B (en) | 2023-12-01 |
Family
ID=78739499
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111089655.7A Active CN113744758B (en) | 2021-09-16 | 2021-09-16 | Sound event detection method based on 2-DenseGRUNet model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113744758B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065075A (en) * | 2018-09-26 | 2018-12-21 | 广州势必可赢网络科技有限公司 | A kind of method of speech processing, device, system and computer readable storage medium |
CN109949824A (en) * | 2019-01-24 | 2019-06-28 | 江南大学 | City sound event classification method based on N-DenseNet and higher-dimension mfcc feature |
CN110084292A (en) * | 2019-04-18 | 2019-08-02 | 江南大学 | Object detection method based on DenseNet and multi-scale feature fusion |
CN112890828A (en) * | 2021-01-14 | 2021-06-04 | 重庆兆琨智医科技有限公司 | Electroencephalogram signal identification method and system for densely connecting gating network |
CN113012714A (en) * | 2021-02-22 | 2021-06-22 | 哈尔滨工程大学 | Acoustic event detection method based on pixel attention mechanism capsule network model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11049018B2 (en) * | 2017-06-23 | 2021-06-29 | Nvidia Corporation | Transforming convolutional neural networks for visual sequence learning |
-
2021
- 2021-09-16 CN CN202111089655.7A patent/CN113744758B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065075A (en) * | 2018-09-26 | 2018-12-21 | 广州势必可赢网络科技有限公司 | A kind of method of speech processing, device, system and computer readable storage medium |
CN109949824A (en) * | 2019-01-24 | 2019-06-28 | 江南大学 | City sound event classification method based on N-DenseNet and higher-dimension mfcc feature |
CN110084292A (en) * | 2019-04-18 | 2019-08-02 | 江南大学 | Object detection method based on DenseNet and multi-scale feature fusion |
CN112890828A (en) * | 2021-01-14 | 2021-06-04 | 重庆兆琨智医科技有限公司 | Electroencephalogram signal identification method and system for densely connecting gating network |
CN113012714A (en) * | 2021-02-22 | 2021-06-22 | 哈尔滨工程大学 | Acoustic event detection method based on pixel attention mechanism capsule network model |
Non-Patent Citations (1)
Title |
---|
N-DenseNet城市声音事件分类模型;曹毅;《西安电子科技大学学报》;第46卷(第6期);10-15 * |
Also Published As
Publication number | Publication date |
---|---|
CN113744758A (en) | 2021-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2019213369B2 (en) | Non-local memory network for semi-supervised video object segmentation | |
CN110390952B (en) | City sound event classification method based on dual-feature 2-DenseNet parallel connection | |
CN111243579B (en) | Time domain single-channel multi-speaker voice recognition method and system | |
CN109448703B (en) | Audio scene recognition method and system combining deep neural network and topic model | |
CN111832440B (en) | Face feature extraction model construction method, computer storage medium and equipment | |
CN113012714B (en) | Acoustic event detection method based on pixel attention mechanism capsule network model | |
Ming et al. | 3D-TDC: A 3D temporal dilation convolution framework for video action recognition | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN111833906B (en) | Sound scene classification method based on multi-path acoustic characteristic data enhancement | |
CN113129908A (en) | End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion | |
Zhu et al. | Semantic image segmentation with improved position attention and feature fusion | |
CN115641533A (en) | Target object emotion recognition method and device and computer equipment | |
Zhang et al. | Learning audio sequence representations for acoustic event classification | |
CN114037699B (en) | Pathological image classification method, equipment, system and storage medium | |
CN111666996A (en) | High-precision equipment source identification method based on attention mechanism | |
CN114943937A (en) | Pedestrian re-identification method and device, storage medium and electronic equipment | |
CN114428860A (en) | Pre-hospital emergency case text recognition method and device, terminal and storage medium | |
CN113744758B (en) | Sound event detection method based on 2-DenseGRUNet model | |
CN111488486B (en) | Electronic music classification method and system based on multi-sound-source separation | |
CN112766368A (en) | Data classification method, equipment and readable storage medium | |
CN117095460A (en) | Self-supervision group behavior recognition method and system based on long-short time relation predictive coding | |
CN113488069B (en) | Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network | |
CN112861949B (en) | Emotion prediction method and system based on face and sound | |
CN111382761B (en) | CNN-based detector, image detection method and terminal | |
Shin et al. | Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |