CN113744758B - Sound event detection method based on 2-DenseGRUNet model - Google Patents

Sound event detection method based on 2-DenseGRUNet model Download PDF

Info

Publication number
CN113744758B
CN113744758B CN202111089655.7A CN202111089655A CN113744758B CN 113744758 B CN113744758 B CN 113744758B CN 202111089655 A CN202111089655 A CN 202111089655A CN 113744758 B CN113744758 B CN 113744758B
Authority
CN
China
Prior art keywords
layer
model
event detection
sound event
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111089655.7A
Other languages
Chinese (zh)
Other versions
CN113744758A (en
Inventor
曹毅
黄子龙
费鸿博
吴伟官
夏宇
周辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202111089655.7A priority Critical patent/CN113744758B/en
Publication of CN113744758A publication Critical patent/CN113744758A/en
Application granted granted Critical
Publication of CN113744758B publication Critical patent/CN113744758B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Complex Calculations (AREA)

Abstract

The sound event detection method based on the 2-DenseGRUNet model is based on a 2-order DenseNet network model, and a gating circulation unit GRU network is added to construct a sound event detection model; compared with the traditional convolutional neural network and the cyclic neural network model, the acoustic event detection model in the technical scheme combines the advantages of the 2-DenseNet and the GRU, can fuse the characteristic information more efficiently, acquire more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in the process of detecting urban sound events, so that the sound classification result based on the method of the invention is more accurate.

Description

Sound event detection method based on 2-DenseGRUNet model
Technical Field
The invention relates to the technical field of sound detection, in particular to a sound event detection method based on a 2-DenseGRUNet model.
Background
The sound carries a large amount of information about life scenes and physical events in the city, and the information is automatically extracted by intelligently sensing each sound source through a deep learning method, so that the method has great potential and application prospect in building the smart city. In smart cities, sound event detection is an important basis for recognition and semantic understanding of environmental sound scenes. The urban sound event detection research is mainly applied to the aspects of environment perception, factory equipment detection, urban security, automatic driving and the like. The urban sound event detection technology in the prior art is mainly realized based on a MLP, CNN, LSTM network model. However, when these 3 network models are evaluated by considering the index F-Score of the reconciliation value of Precision and Recall in combination, the Score of the F-Score is low due to the high average segment error rate, which has a limited application range in practical applications.
Disclosure of Invention
In order to solve the problem of average fragment error rate of the detection of the urban sound event in the center of the prior art, the sound event detection method based on the 2-DenseGRUNet model provided by the invention can extract more effective acoustic information when processing audio data, has better time sequence modeling capability, and ensures that the model has lower average fragment error rate and higher usability when detecting the urban sound event.
The technical scheme of the invention is as follows: the sound event detection method based on the 2-DenseGRUNet model comprises the following steps:
s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation includes: sampling and quantizing, pre-emphasis processing and windowing;
s2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;
s3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
s4: constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model;
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for recognition detection, and obtaining a sound event detection result of the audio data to be processed;
the method is characterized in that:
the sound event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all the GRU units are connected in series behind the 2 nd-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;
the 2 nd order DenseNet model comprises: a continuous 2-DenseBlock structure, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;
in each 2-DenseBlock structure, the connection of the feature layer and the feature layer is based on the correlation connection of a 2-order Markov model, and the current feature layer input is related to the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer.
It is further characterized by:
the 2-order DenseNet model is sequentially connected with the GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer;
each 2-DenseBlock comprises a feature layer which is connected in sequence; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, wherein input data is subjected to batch normalization processing and activation function processing before entering the convolution layers for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between a first characteristic layer and a second characteristic layer in each 2-DenseBlock;
the characteristic extraction of the mel frequency cepstrum coefficient in the step S2, wherein the mel frequency cepstrum coefficient extracted in a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in the step S2 is a 2-dimensional vector, wherein the first-dimensional vector is the number of frames after sampling the audio data to be processed, and the second-dimensional vector is the dimension of the Mel frequency cepstrum coefficient;
in step S3, when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, and the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the frame number of mfcc in the original sound event tag is determined as follows: starting time to ending time, reconstruct: starting frame number to ending frame number;
in step S5, before inputting the reconstructed feature vector sequence into the trained acoustic event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the acoustic event detection model;
the characteristic input of the GRU unit is a two-dimensional characteristic vector;
the Transition Layer includes: a convolution kernel is a 1*1 convolution layer, a largest pooling layer with pooling size [2,2 ];
the number of the neurons of the full-connection layer is set to 256 or 128;
the output layer is realized based on a Sigmoid function.
The sound event detection method based on the 2-DenseGRUNet model is based on a 2-order DenseNet network model, and a gating circulation unit GRU network is added to construct a sound event detection model; compared with the traditional convolutional neural network and the cyclic neural network model, the acoustic event detection model in the technical scheme combines the advantages of the 2-DenseNet and the GRU, can fuse the characteristic information more efficiently, acquire more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in the process of detecting urban sound events, so that the sound classification result based on the method of the invention is more accurate.
Drawings
FIG. 1 is a flow chart of feature reconstruction in sound event detection according to the present invention;
FIG. 2 is a network block diagram of the 2-DenseGRUNet model of the present invention;
FIG. 3 is a schematic diagram of a 2-DenseBlock in a 2-DenseNet network according to the present invention;
FIG. 4 is a schematic diagram of a gate cycle unit GRU according to the present invention.
Detailed Description
As shown in fig. 1 to 4, the sound event detection method based on the 2-densegrinet model of the present invention specifically includes the following steps.
S1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation comprises the following steps: sampling and quantization, pre-emphasis processing, windowing.
S2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;
characteristic extraction of a Mel frequency cepstrum coefficient, wherein the Mel frequency cepstrum coefficient extracted from a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames of the audio data to be processed after sampling, the second-dimensional vector is the dimension of mel-frequency cepstrum coefficient, in this embodiment, the mel-frequency cepstrum coefficient is 40mfcc, and the second-dimensional vector is 40.
S3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1s, the corresponding frame number of mfcc of the reconstructed feature vector sequence is 128 frames, and the tag of the original sound event is as follows: starting time to ending time, reconstruct: starting frame number to ending frame number.
As shown in fig. 1, in step S3, the specific implementation process of reconstructing the feature information and the tag of the original feature vector sequence is as follows:
it is provided that the original audio sample comprises: files with file names of a013, a010, a129, b099, b008 and b100, and the duration of each file is respectively as follows: 3min44s, 3min30s, 4min0s, 3min30s, 3min01s. The audio sample files with different time lengths are spliced according to the time dimension to construct total audio data, wherein the total time length of the total audio data is T:
T=3min44s+3min30s+4min0s+4min0s+3min30s+3min01s
extracting label information of the audio frequency from the total audio frequency data; each section of file in the total audio data is respectively marked as follows:
[ time-start, time-end, category ] represents the [ start time, end time, category ] of the audio event corresponding to the original audio sample, respectively.
Extracting features of the marked total audio data, wherein f=44.1khz, nfft=2048, win-len=2048, hop_len=1024 and t= 0.0232s,Segment len =128;
wherein f represents the sampling frequency, nfft represents the length of the fast fourier transform, wen _len represents the sampling point number of each frame, hop_len represents the sampling point number between two frames, T represents the duration of each frame, segment=128 represents the feature of dividing the whole audio with the duration of T into a plurality of frames of 128 after feature extraction;
after the feature reconstruction, the total audio data with the length of T contains different samples, and the labels of the audio fragments respectively corresponding to the different samples are as follows: [ frame_start, frame_end, one-hot ];
representing each audio segment after segmentation: [ start frame number, end frame number, class one-hot coding ];
and finally, outputting a reconstructed feature vector sequence after the reconstructed feature processing, namely, representing the reconstructed feature vector sequence by using [ frame_start, frame_end, one-hot ].
S4: and constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model.
Based on a 2-order DenseNet model and combined with the characteristics of a gate control circulation unit GRU model, the network model is the sound event detection model, which is simply called: 2-DenseGRUNet model. The sound event detection model takes a 2-order DenseNet model as a characteristic extraction network of the front end of the network, and GRU units with good time sequence modeling capability are connected in series at the end of the network, so that characteristic information of sound events can be fused with each other efficiently, more effective characteristic information can be obtained, time sequence modeling is carried out effectively, and average segment error rate of sound segments is reduced.
The acoustic event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all GRU units are connected in series behind the 2 nd-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2-order DenseNet model;
the 2 nd order DenseNet model includes: m consecutive 2-DenseBlock structures, wherein m is a natural number greater than or equal to 1;
a Transition layer structure is arranged behind each 2-DenseBlock structure;
the 2-order DenseNet model is sequentially connected with the continuous GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the rear of the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer; the characteristic input of the GRU unit is a two-dimensional characteristic vector; the output layer is realized based on a Sigmoid function; the Transition Layer includes: one convolution kernel is the 1*1 convolution layer, one largest pooling layer with a pooling size of [2,2 ]. And setting the number of neurons with the full connection layer number to 256 or 128, carrying out parameter adjustment and suppression over fitting on the network, and finally outputting a detection result after the detection result is processed by a Sigmoid function.
The structure of the feature layer in 2-DenseBlock is shown in Table 1 below:
table 1: structure of feature layer in 2-DenseBlock
Conv(1×1)
BN(·)
ReLU activation function
Conv(3×3)
BN(·)
ReLU activation function
Concate function
dropout layer
Each 2-DenseBlock structure comprises l feature layers which are connected in sequence;
as shown in table 1, each feature layer includes 2 convolution layers, and in the feature layer, after the input data is subjected to convolution processing by the convolution layers, batch normalization processing (BN) and ReLU activation function processing are further performed;
the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through a Concate; a dropout layer is arranged between a first characteristic layer and a second characteristic layer in each 2-DenseBlock to perform small-amplitude overfitting inhibition, so that parameter adjustment of a later network model is facilitated.
DenseBlock is taken as an infrastructure of a DenseNet model, and the DenseBlock in the 2 nd-order DenseNet is also of two orders; that is, in each 2-DenseBlock structure, the feature layer-to-feature layer connections are based on the correlation connections of the 2 nd order Markov model, with the current feature layer input being correlated with the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer; the feature layer performs weight sharing in a Transition layer structure, maximally performs time sequence distinction, and detects the starting time and the ending time of a sound event;
each 2-DenseBlock comprises l feature layers which are connected in sequence, wherein l is a natural number which is more than or equal to 1; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, and in the characteristic layer, input data is subjected to batch standardization processing and activation function processing before entering the convolution layer for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between the first feature layer and the second feature layer in each 2-DenseBlock.
In the 2-DenseBlock network architecture, the 1×1 and 3×3 convolutional layers in each feature layer are a set of non-linearly varying feature layers. As shown in fig. 3 of the drawings of the specification, the 2-DenseBlock includes 4 feature layers, and the input layer of each non-linearly varying feature layer is defined as: x is X 1 ,X 2 ,…,X l U1-U8 are convolution layers in the feature layer, and W2-W9 are weight coefficient matrixes corresponding to the convolution layers;
in 2-DenseBlock, layer 3 initiates a characteristic output U of the network convolution transform c Can be defined as:
wherein [ X ] l ,X l-1 ,X l-2 ]Representing the current layer to perform a channel number merging concatenation operation by means of a 2-level correlation scheme, using the feature map of the first two layers as input to the current layer, W 3×3 And W is 1×1 Representing a rollThe kernel sizes are 1×1 and 3×3 kernel functions respectively, BN (·) represents batch normalization, f (·) is a ReLU activation function, BN (·) is batch normalization, and B represents a bias coefficient;
each Transition layer structure comprises a convolution layer and a pooling layer; the convolution kernel of the convolution layer is 1*1, characteristic dimension reduction processing is carried out through the convolution layer, the subsequent pooling layer is connected, the size of the matrix is reduced through the pooling layer processing, the parameters of the final full-connection layer are reduced, and the expression formula is as follows:
wherein:the output of the pooling layer is represented, the maximum value is taken in the pooling area, l is the number of the characteristic image layers included in each 2-DenseBlock structure, k is the number of channels of the characteristic image, and m and n are the sizes of convolution kernels; x (i, j) corresponds to a pixel on the feature map; p is a pre-specified parameter, when p goes to infinity, ->Maximum value is taken in the pooling area, which is the maximum pooling (max pooling).
As in the embodiment shown in fig. 3, when the number of layers l=4, the output of layer 1 is X 1 Input to layer 2 propagating forward without using a Concatenation layer is X 2 The method comprises the steps of carrying out a first treatment on the surface of the The input profile of layer 3 relates only to the output profiles of layers 2, 1, i.e. X 3 =f([x 3 ,x 2 ,x 1 ]) The method comprises the steps of carrying out a first treatment on the surface of the The input profile of layer 4 relates only to the output profiles of layers 3, 2, i.e. X 4 =f([x 4 ,x 3 ,x 2 ])。
As shown in fig. 4, the GRU unit is a gating cyclic unit, which is a gating mechanism in the cyclic neural network. In the mechanism of the gate control circulation unit, an update gate z is arranged t And reset gate r t . Updating door z t For controlling the state x of the previous moment t-1 Information is brought to the current state x t Reset gate r t The GRU has better performance in time sequence modeling by controlling how much information of the previous state is written to the current candidate set. The expression formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])
r t =σ(W r ·[h t-1 ,x t ])
wherein: σ represents the full connection layer and the activation function,indicating hidden state, h t Representing the output, representing the multiplication of the element;
W z weight coefficient representing update gate, W r The weight coefficient representing the reset gate, σ (g) represents the full connection layer and the activation function, and tanh (g) represents the tanh activation function.
As shown in fig. 2, a feature vector sequence input into a sound event detection model is subjected to one-Layer convolution operation and one-time pooling treatment in sequence, and then is sequentially input into m continuous 2-deieblocks respectively, wherein each 2-deieblock is respectively followed by a Transition Layer; after 2-DeseNet (m) processing comprising continuous m 2-DenseBlock structures and Transition layers, converting a three-dimensional feature vector into a two-dimensional feature vector through a reshape Layer, then connecting n gating circulation units GRU to perform Time sequence modeling processing, inputting into a Time Distributed Layer to perform Time sequence tensor operation, inputting into a full connection Layer to perform detection processing, and finally outputting a detection result after Sigmoid function processing. The number m and the number l of the 2-DenseBlock are valued according to the actual hardware condition and the data complexity. In the embodiment shown in fig. 2 and 3 of the drawings, n is 2, m is 2, and l is 4.
The Sigmoid function f (z) formula is:
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into a trained sound event detection model for identification detection, and obtaining a sound event detection result of the audio data to be processed; meanwhile, before the reconstructed feature vector sequence is input into the trained sound event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the channel number in the sound event detection model.
In the technical scheme, based on 2-order DenseNet, the information in the feature map can be extracted more effectively, and the gating cycle unit GRU model is simple, so that the method is more suitable for constructing a larger network; in the patent, the 2-order DenseNet network is combined with 2 gating cycle unit GRU models, so that the calculation efficiency is higher from the calculation perspective; based on the technical scheme of the patent, not only can the frequency domain information of the feature map be effectively extracted, but also the time sequence features of the long-time audio sequence can be effectively captured, and meanwhile, the classification task and the regression task in the detection can be more efficiently realized.
As shown in table 2 below, is an example of a network structure of the 2-densegrenet model.
Table 2: examples of the 2-DenseGRUNet model
Input:mfcc[128,80,1]
Conv(3×3):[128,80,32]
Pooling(2×2):[64,80,32]
2-Denseblock(1):[32,80,16]
Transition Layer(16,80,8)
2-Denseblock(2):[16,80,8]
Transition Layer(16,80,8)
Reshape:[64,160]
GRU(1):[64,64]
GRU(2):[64,32]
TimeDistributed:[64,6]
Full connection layer (64,6)
Output(Sigmoid):[64,6]
Using the Dcase2017 dataset, the detection categories of the dataset were 6 major categories, respectively, and the time tag was time-start at the end point. Using a 2-DenseGRUNet model shown in Table 2, setting m as 2, namely sequentially performing a layer of convolution operation and primary pooling treatment on a feature vector sequence of an input detection model, and sequentially inputting the feature vector sequence into 2 continuous 2-DenseBlock respectively; and according to the data condition and the performance of the equipment, the value of l is 4 in each 2-DenseBlock structure, namely each 2-DenseBlock comprises 4 feature layers. n=2, i.e. the number of layers of the gate loop unit GRU is 2.
Performing time domain and frequency domain analysis on the audio frame sequence, and extracting Mel frequency cepstrum coefficient output characteristic vector sequences; the number of sampling frames for input audio data in the Dcase2017 dataset is 128, and the characteristic dimension of the mel frequency cepstrum coefficient is selected as follows: 40mfcc is obtained by extracting 40-dimensional mfcc characteristics under 40 mel filter groups, and outputting mel cepstrum coefficient characteristic vector sequences (128, 40).
The 2-dimensional vector is converted into 3-dimensional data through reshape, because the number of channels of Input in the network structure of the 2-DenseNet model is 1, after the 2-dimensional vector is converted into three-dimensional data, the feature vectors of Dcase2017 are respectively (128, 40, 1), the three-dimensional feature vectors are converted into two-dimensional feature vectors (256, 40), and the two-dimensional feature vectors are Input into a gating circulation unit GRU for time series modeling.
The feature vector is input into a 2-DenseNet model, the input feature map sequence firstly passes through a convolution layer, then a pooling layer is adopted to carry out pooling treatment, and the obtained three-dimensional data are sequentially input into 2 continuous 2-DenseBlock.
In each 2-DenseBlock there are 4 feature map layers, namely 4 2-DenseBlock functions, which are input as a feature map sequence. In the processing of the 2-DenseBlock function, batch standardization (BN) processing is carried out first, and the activation function is a ReLU function; and then transferred to the convolution layer; this procedure was performed twice within the function, the first convolution kernel size being 1*1 and the second convolution kernel size being 3*3. The specific operation in the 2-DenseBlock function (denoted as 2-DenseBlock in the formula) is therefore:
three-dimensional data processed by two continuous 2-DenseBlock and Transition layers are firstly converted into two-dimensional feature vectors, the two-dimensional feature vectors are input into a gating circulation unit GRU for Time sequence modeling, then enter a Time Distributed Layer for Time sequence tensor operation, weight sharing can be carried out, time sequence modeling can be carried out to the maximum extent, the starting Time and the ending Time of a sound event are detected, and finally the detection result is output after being processed by a Sigmoid function.
Under the experimental environment of a Window10 system, a display card GTX1060, i7-8750H of CPU and a memory 16G; the method comprises the steps that a keras+TensorFlow is used as a deep learning framework, a dataset Dcase2017 is adopted, firstly, different characteristics, different characteristic dimensions and comparison experiments of network structures of GRU units are respectively carried out on the dataset Dcase2017, and the fragment error rate and the F-score of a 2-DenseGRUNet model are verified; and then comparing with the existing research model to verify the good performance of the 2-DenseGRUNet model.
Audio data detection experiments were performed on the Dcase2017 dataset by extracting 40 and 120-dimensional mel cepstrum coefficient features and 40-dimensional gamma cepstrum coefficient features in 2-DenseNet and 2-densegune network models, with specific results shown in table 3 below.
Table 3: results of the fusion experiment of the 2-DenseGRUNet model
Model Features (e.g. a character) Average segment error rate F-score
2-DenseNet 40mfcc 0.566 0.610
2-DenseNet 128mfcc 0.562 0.613
2-DenseNet 40gfcc 0.586 0.607
2-DenseGRUNet 40mfcc 0.543 0.651
2-DenseGRUNet 128mfcc 0.541 0.648
2-DenseGRUNet 40gfcc 0.563 0.631
According to experimental results, compared with a 2-DenseGRUNet model without using a gate control circulation unit GRU, the average fragment error rate is reduced by 2.3%, 2.1% and 2.3% under the characteristics of 40mfcc, 128mfcc and 40gfcc respectively, and the F-score is improved by 4.1%, 3.5% and 2.4% respectively; under the 2-DenseGRUNet model, the average fragment error rate is respectively reduced by 2.0% and the F-score is respectively improved by 2.0% by using the characteristic of 40mfcc relative to 40 gfcc; the average fragment error rate and the F-score variation range are about 0.1% under the characteristic of 40mfcc relative to 128mfcc, but the model training time can be effectively reduced by using 40mfcc, and the calculation requirement of a computer can be reduced. In summary, the 2-DenseGRUNet model under 40mfcc can more efficiently utilize feature information fusion to acquire more effective feature information, and can effectively perform time sequence modeling. The optimal average fragment error rate was 0.543% and the F-score was 65.12%.
Further experiments were performed on the 2-DenseGRUNet model in the Dcase2017 dataset, and the test results were compared with the accuracy of the existing domestic and foreign investigator models, and the comparison test results are shown in Table 4.
TABLE 4 detection results for different models
Under the technical scheme of the invention, the adopted D-2-DenseNet model is compared with the test results of researchers at home and abroad, so that the classification accuracy of the technical scheme of the invention is respectively reduced by 14.7% compared with the average fragment error rate of the MLP model of baseline, and the F-score fraction is respectively improved by 8.4%; compared with the LSTM model, the average fragment error rate is respectively reduced by 1.9%, and the F-score is respectively improved by 3.9%; the average fragment error rate of the technical scheme of the invention is obviously reduced, and the F-score is obviously improved.
In summary, the technical scheme provided by the invention can more efficiently utilize feature information fusion to acquire more effective feature information when processing audio data, and the model has lower average fragment error rate and higher F-score, i.e. the accuracy of sound classification realized based on the method is higher.

Claims (8)

1. The sound event detection method based on the 2-DenseGRUNet model comprises the following steps:
s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;
the preprocessing operation includes: sampling and quantizing, pre-emphasis processing and windowing;
s2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;
s3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;
converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;
s4: constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model;
s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for recognition detection, and obtaining a sound event detection result of the audio data to be processed;
the method is characterized in that:
the sound event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all the GRU units are connected in series behind the 2 nd-order DenseNet model;
a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;
the 2 nd order DenseNet model comprises: a continuous 2-DenseBlock structure, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;
in each 2-DenseBlock structure, the connection of the feature layer and the feature layer is based on the correlation connection of a 2-order Markov model, and the current feature layer input is related to the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer;
the 2-order DenseNet model is sequentially connected with the GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer;
each 2-DenseBlock comprises a feature layer which is connected in sequence; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, wherein input data is subjected to batch normalization processing and activation function processing before entering the convolution layers for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.
2. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the characteristic extraction of the mel frequency cepstrum coefficient in the step S2, wherein the mel frequency cepstrum coefficient extracted in a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in the step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames after sampling the audio data to be processed, and the second-dimensional vector is the dimension of the mel-frequency cepstrum coefficient.
3. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: in step S3, when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, and the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the frame number of mfcc in the original sound event tag is determined as follows: starting time to ending time, reconstruct: starting frame number to ending frame number.
4. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: in step S5, before inputting the reconstructed feature vector sequence into the trained acoustic event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the acoustic event detection model.
5. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the feature input of the GRU unit is a two-dimensional feature vector.
6. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the Transition Layer includes: one convolution kernel is the 1*1 convolution layer, one largest pooling layer with a pooling size of [2,2 ].
7. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the number of neurons of the full connection layer is set to 256 or 128.
8. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the output layer is realized based on a Sigmoid function.
CN202111089655.7A 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model Active CN113744758B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111089655.7A CN113744758B (en) 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111089655.7A CN113744758B (en) 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model

Publications (2)

Publication Number Publication Date
CN113744758A CN113744758A (en) 2021-12-03
CN113744758B true CN113744758B (en) 2023-12-01

Family

ID=78739499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111089655.7A Active CN113744758B (en) 2021-09-16 2021-09-16 Sound event detection method based on 2-DenseGRUNet model

Country Status (1)

Country Link
CN (1) CN113744758B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN109949824A (en) * 2019-01-24 2019-06-28 江南大学 City sound event classification method based on N-DenseNet and higher-dimension mfcc feature
CN110084292A (en) * 2019-04-18 2019-08-02 江南大学 Object detection method based on DenseNet and multi-scale feature fusion
CN112890828A (en) * 2021-01-14 2021-06-04 重庆兆琨智医科技有限公司 Electroencephalogram signal identification method and system for densely connecting gating network
CN113012714A (en) * 2021-02-22 2021-06-22 哈尔滨工程大学 Acoustic event detection method based on pixel attention mechanism capsule network model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11049018B2 (en) * 2017-06-23 2021-06-29 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065075A (en) * 2018-09-26 2018-12-21 广州势必可赢网络科技有限公司 A kind of method of speech processing, device, system and computer readable storage medium
CN109949824A (en) * 2019-01-24 2019-06-28 江南大学 City sound event classification method based on N-DenseNet and higher-dimension mfcc feature
CN110084292A (en) * 2019-04-18 2019-08-02 江南大学 Object detection method based on DenseNet and multi-scale feature fusion
CN112890828A (en) * 2021-01-14 2021-06-04 重庆兆琨智医科技有限公司 Electroencephalogram signal identification method and system for densely connecting gating network
CN113012714A (en) * 2021-02-22 2021-06-22 哈尔滨工程大学 Acoustic event detection method based on pixel attention mechanism capsule network model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
N-DenseNet城市声音事件分类模型;曹毅;《西安电子科技大学学报》;第46卷(第6期);10-15 *

Also Published As

Publication number Publication date
CN113744758A (en) 2021-12-03

Similar Documents

Publication Publication Date Title
AU2019213369B2 (en) Non-local memory network for semi-supervised video object segmentation
CN110390952B (en) City sound event classification method based on dual-feature 2-DenseNet parallel connection
CN111243579B (en) Time domain single-channel multi-speaker voice recognition method and system
CN109448703B (en) Audio scene recognition method and system combining deep neural network and topic model
CN111832440B (en) Face feature extraction model construction method, computer storage medium and equipment
CN113012714B (en) Acoustic event detection method based on pixel attention mechanism capsule network model
Ming et al. 3D-TDC: A 3D temporal dilation convolution framework for video action recognition
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN111833906B (en) Sound scene classification method based on multi-path acoustic characteristic data enhancement
CN113129908A (en) End-to-end macaque voiceprint verification method and system based on cycle frame level feature fusion
Zhu et al. Semantic image segmentation with improved position attention and feature fusion
CN115641533A (en) Target object emotion recognition method and device and computer equipment
Zhang et al. Learning audio sequence representations for acoustic event classification
CN114037699B (en) Pathological image classification method, equipment, system and storage medium
CN111666996A (en) High-precision equipment source identification method based on attention mechanism
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN114428860A (en) Pre-hospital emergency case text recognition method and device, terminal and storage medium
CN113744758B (en) Sound event detection method based on 2-DenseGRUNet model
CN111488486B (en) Electronic music classification method and system based on multi-sound-source separation
CN112766368A (en) Data classification method, equipment and readable storage medium
CN117095460A (en) Self-supervision group behavior recognition method and system based on long-short time relation predictive coding
CN113488069B (en) Speech high-dimensional characteristic rapid extraction method and device based on generation type countermeasure network
CN112861949B (en) Emotion prediction method and system based on face and sound
CN111382761B (en) CNN-based detector, image detection method and terminal
Shin et al. Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant