CN113744758B

CN113744758B - Sound event detection method based on 2-DenseGRUNet model

Info

Publication number: CN113744758B
Application number: CN202111089655.7A
Authority: CN
Inventors: 曹毅; 黄子龙; 费鸿博; 吴伟官; 夏宇; 周辉
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2023-12-01
Anticipated expiration: 2041-09-16
Also published as: CN113744758A

Abstract

The sound event detection method based on the 2-DenseGRUNet model is based on a 2-order DenseNet network model, and a gating circulation unit GRU network is added to construct a sound event detection model; compared with the traditional convolutional neural network and the cyclic neural network model, the acoustic event detection model in the technical scheme combines the advantages of the 2-DenseNet and the GRU, can fuse the characteristic information more efficiently, acquire more effective characteristic information, and can effectively perform time sequence modeling. The sound event detection model based on the technical scheme of the patent has lower average fragment error rate and higher F-Score in the process of detecting urban sound events, so that the sound classification result based on the method of the invention is more accurate.

Description

Sound event detection method based on 2-DenseGRUNet model

Technical Field

The invention relates to the technical field of sound detection, in particular to a sound event detection method based on a 2-DenseGRUNet model.

Background

The sound carries a large amount of information about life scenes and physical events in the city, and the information is automatically extracted by intelligently sensing each sound source through a deep learning method, so that the method has great potential and application prospect in building the smart city. In smart cities, sound event detection is an important basis for recognition and semantic understanding of environmental sound scenes. The urban sound event detection research is mainly applied to the aspects of environment perception, factory equipment detection, urban security, automatic driving and the like. The urban sound event detection technology in the prior art is mainly realized based on a MLP, CNN, LSTM network model. However, when these 3 network models are evaluated by considering the index F-Score of the reconciliation value of Precision and Recall in combination, the Score of the F-Score is low due to the high average segment error rate, which has a limited application range in practical applications.

Disclosure of Invention

In order to solve the problem of average fragment error rate of the detection of the urban sound event in the center of the prior art, the sound event detection method based on the 2-DenseGRUNet model provided by the invention can extract more effective acoustic information when processing audio data, has better time sequence modeling capability, and ensures that the model has lower average fragment error rate and higher usability when detecting the urban sound event.

The technical scheme of the invention is as follows: the sound event detection method based on the 2-DenseGRUNet model comprises the following steps:

s1: collecting audio data to be processed, preprocessing an original audio signal of the audio data to be processed, and outputting an audio frame sequence;

the preprocessing operation includes: sampling and quantizing, pre-emphasis processing and windowing;

s2: performing time domain and frequency domain analysis on the audio frame sequence, extracting a Mel frequency cepstrum coefficient, and outputting an original feature vector sequence;

s3: reconstructing the feature information and the tag of the original feature vector sequence, and outputting a reconstructed feature vector sequence after the reconstructed feature processing;

converting the starting time, the ending time and the category of the sound event in the original feature vector sequence into a starting frame, an ending frame and an event label corresponding to the reconstructed feature vector sequence;

s4: constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model;

s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into the trained sound event detection model for recognition detection, and obtaining a sound event detection result of the audio data to be processed;

the method is characterized in that:

the sound event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all the GRU units are connected in series behind the 2 nd-order DenseNet model;

a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2 nd-order DenseNet model;

the 2 nd order DenseNet model comprises: a continuous 2-DenseBlock structure, wherein a Transition layer structure is arranged behind each 2-DenseBlock structure;

in each 2-DenseBlock structure, the connection of the feature layer and the feature layer is based on the correlation connection of a 2-order Markov model, and the current feature layer input is related to the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer.

It is further characterized by:

the 2-order DenseNet model is sequentially connected with the GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer;

each 2-DenseBlock comprises a feature layer which is connected in sequence; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, wherein input data is subjected to batch normalization processing and activation function processing before entering the convolution layers for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between a first characteristic layer and a second characteristic layer in each 2-DenseBlock;

the characteristic extraction of the mel frequency cepstrum coefficient in the step S2, wherein the mel frequency cepstrum coefficient extracted in a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in the step S2 is a 2-dimensional vector, wherein the first-dimensional vector is the number of frames after sampling the audio data to be processed, and the second-dimensional vector is the dimension of the Mel frequency cepstrum coefficient;

in step S3, when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, and the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the frame number of mfcc in the original sound event tag is determined as follows: starting time to ending time, reconstruct: starting frame number to ending frame number;

in step S5, before inputting the reconstructed feature vector sequence into the trained acoustic event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the acoustic event detection model;

the characteristic input of the GRU unit is a two-dimensional characteristic vector;

the Transition Layer includes: a convolution kernel is a 1*1 convolution layer, a largest pooling layer with pooling size [2,2 ];

the number of the neurons of the full-connection layer is set to 256 or 128;

the output layer is realized based on a Sigmoid function.

Drawings

FIG. 1 is a flow chart of feature reconstruction in sound event detection according to the present invention;

FIG. 2 is a network block diagram of the 2-DenseGRUNet model of the present invention;

FIG. 3 is a schematic diagram of a 2-DenseBlock in a 2-DenseNet network according to the present invention;

FIG. 4 is a schematic diagram of a gate cycle unit GRU according to the present invention.

Detailed Description

As shown in fig. 1 to 4, the sound event detection method based on the 2-densegrinet model of the present invention specifically includes the following steps.

the preprocessing operation comprises the following steps: sampling and quantization, pre-emphasis processing, windowing.

characteristic extraction of a Mel frequency cepstrum coefficient, wherein the Mel frequency cepstrum coefficient extracted from a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames of the audio data to be processed after sampling, the second-dimensional vector is the dimension of mel-frequency cepstrum coefficient, in this embodiment, the mel-frequency cepstrum coefficient is 40mfcc, and the second-dimensional vector is 40.

when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1s, the corresponding frame number of mfcc of the reconstructed feature vector sequence is 128 frames, and the tag of the original sound event is as follows: starting time to ending time, reconstruct: starting frame number to ending frame number.

As shown in fig. 1, in step S3, the specific implementation process of reconstructing the feature information and the tag of the original feature vector sequence is as follows:

it is provided that the original audio sample comprises: files with file names of a013, a010, a129, b099, b008 and b100, and the duration of each file is respectively as follows: 3min44s, 3min30s, 4min0s, 3min30s, 3min01s. The audio sample files with different time lengths are spliced according to the time dimension to construct total audio data, wherein the total time length of the total audio data is T:

T＝3min44s+3min30s+4min0s+4min0s+3min30s+3min01s

extracting label information of the audio frequency from the total audio frequency data; each section of file in the total audio data is respectively marked as follows:

[ time-start, time-end, category ] represents the [ start time, end time, category ] of the audio event corresponding to the original audio sample, respectively.

Extracting features of the marked total audio data, wherein f=44.1khz, nfft=2048, win-len=2048, hop_len=1024 and t= 0.0232s,Segment len =128;

wherein f represents the sampling frequency, nfft represents the length of the fast fourier transform, wen _len represents the sampling point number of each frame, hop_len represents the sampling point number between two frames, T represents the duration of each frame, segment=128 represents the feature of dividing the whole audio with the duration of T into a plurality of frames of 128 after feature extraction;

after the feature reconstruction, the total audio data with the length of T contains different samples, and the labels of the audio fragments respectively corresponding to the different samples are as follows: [ frame_start, frame_end, one-hot ];

representing each audio segment after segmentation: [ start frame number, end frame number, class one-hot coding ];

and finally, outputting a reconstructed feature vector sequence after the reconstructed feature processing, namely, representing the reconstructed feature vector sequence by using [ frame_start, frame_end, one-hot ].

S4: and constructing a sound event detection model, and performing iterative training on the model to obtain a trained sound event detection model.

Based on a 2-order DenseNet model and combined with the characteristics of a gate control circulation unit GRU model, the network model is the sound event detection model, which is simply called: 2-DenseGRUNet model. The sound event detection model takes a 2-order DenseNet model as a characteristic extraction network of the front end of the network, and GRU units with good time sequence modeling capability are connected in series at the end of the network, so that characteristic information of sound events can be fused with each other efficiently, more effective characteristic information can be obtained, time sequence modeling is carried out effectively, and average segment error rate of sound segments is reduced.

The acoustic event detection model includes: an input layer, a 2 nd order DenseNet model, and a GRU unit; all GRU units are connected in series behind the 2 nd-order DenseNet model;

a convolution layer and a pooling layer are sequentially arranged between the input layer and the 2-order DenseNet model;

the 2 nd order DenseNet model includes: m consecutive 2-DenseBlock structures, wherein m is a natural number greater than or equal to 1;

a Transition layer structure is arranged behind each 2-DenseBlock structure;

the 2-order DenseNet model is sequentially connected with the continuous GRU units in series, and the 2-order DenseNet model is connected with the first GRU unit through a reshape layer; the rear of the GRU unit is sequentially provided with a Time Distributed layer, a full connection layer and an output layer; the characteristic input of the GRU unit is a two-dimensional characteristic vector; the output layer is realized based on a Sigmoid function; the Transition Layer includes: one convolution kernel is the 1*1 convolution layer, one largest pooling layer with a pooling size of [2,2 ]. And setting the number of neurons with the full connection layer number to 256 or 128, carrying out parameter adjustment and suppression over fitting on the network, and finally outputting a detection result after the detection result is processed by a Sigmoid function.

The structure of the feature layer in 2-DenseBlock is shown in Table 1 below:

table 1: structure of feature layer in 2-DenseBlock

Conv(1×1)
	BN(·)
ReLU activation function
	Conv(3×3)
BN(·)
	ReLU activation function
Concate function
	dropout layer

Each 2-DenseBlock structure comprises l feature layers which are connected in sequence;

as shown in table 1, each feature layer includes 2 convolution layers, and in the feature layer, after the input data is subjected to convolution processing by the convolution layers, batch normalization processing (BN) and ReLU activation function processing are further performed;

the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through a Concate; a dropout layer is arranged between a first characteristic layer and a second characteristic layer in each 2-DenseBlock to perform small-amplitude overfitting inhibition, so that parameter adjustment of a later network model is facilitated.

DenseBlock is taken as an infrastructure of a DenseNet model, and the DenseBlock in the 2 nd-order DenseNet is also of two orders; that is, in each 2-DenseBlock structure, the feature layer-to-feature layer connections are based on the correlation connections of the 2 nd order Markov model, with the current feature layer input being correlated with the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer; the feature layer performs weight sharing in a Transition layer structure, maximally performs time sequence distinction, and detects the starting time and the ending time of a sound event;

each 2-DenseBlock comprises l feature layers which are connected in sequence, wherein l is a natural number which is more than or equal to 1; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, and in the characteristic layer, input data is subjected to batch standardization processing and activation function processing before entering the convolution layer for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between the first feature layer and the second feature layer in each 2-DenseBlock.

In the 2-DenseBlock network architecture, the 1×1 and 3×3 convolutional layers in each feature layer are a set of non-linearly varying feature layers. As shown in fig. 3 of the drawings of the specification, the 2-DenseBlock includes 4 feature layers, and the input layer of each non-linearly varying feature layer is defined as: x is X ₁ ,X ₂ ,…,X _l U1-U8 are convolution layers in the feature layer, and W2-W9 are weight coefficient matrixes corresponding to the convolution layers;

in 2-DenseBlock, layer 3 initiates a characteristic output U of the network convolution transform _c Can be defined as:

wherein [ X ] _l ,X _l-1 ,X _l-2 ]Representing the current layer to perform a channel number merging concatenation operation by means of a 2-level correlation scheme, using the feature map of the first two layers as input to the current layer, W _3×3 And W is _1×1 Representing a rollThe kernel sizes are 1×1 and 3×3 kernel functions respectively, BN (·) represents batch normalization, f (·) is a ReLU activation function, BN (·) is batch normalization, and B represents a bias coefficient;

each Transition layer structure comprises a convolution layer and a pooling layer; the convolution kernel of the convolution layer is 1*1, characteristic dimension reduction processing is carried out through the convolution layer, the subsequent pooling layer is connected, the size of the matrix is reduced through the pooling layer processing, the parameters of the final full-connection layer are reduced, and the expression formula is as follows:

wherein:the output of the pooling layer is represented, the maximum value is taken in the pooling area, l is the number of the characteristic image layers included in each 2-DenseBlock structure, k is the number of channels of the characteristic image, and m and n are the sizes of convolution kernels; x (i, j) corresponds to a pixel on the feature map; p is a pre-specified parameter, when p goes to infinity, ->Maximum value is taken in the pooling area, which is the maximum pooling (max pooling).

As in the embodiment shown in fig. 3, when the number of layers l=4, the output of layer 1 is X ₁ Input to layer 2 propagating forward without using a Concatenation layer is X ₂ The method comprises the steps of carrying out a first treatment on the surface of the The input profile of layer 3 relates only to the output profiles of layers 2, 1, i.e. X ₃ ＝f([x ₃ ,x ₂ ,x ₁ ]) The method comprises the steps of carrying out a first treatment on the surface of the The input profile of layer 4 relates only to the output profiles of layers 3, 2, i.e. X ₄ ＝f([x ₄ ,x ₃ ,x ₂ ])。

As shown in fig. 4, the GRU unit is a gating cyclic unit, which is a gating mechanism in the cyclic neural network. In the mechanism of the gate control circulation unit, an update gate z is arranged _t And reset gate r _t . Updating door z _t For controlling the state x of the previous moment _t-1 Information is brought to the current state x _t Reset gate r _t The GRU has better performance in time sequence modeling by controlling how much information of the previous state is written to the current candidate set. The expression formula is as follows:

z _t ＝σ(W _z ·[h _t-1 ,x _t ])

r _t ＝σ(W _r ·[h _t-1 ,x _t ])

wherein: σ represents the full connection layer and the activation function,indicating hidden state, h _t Representing the output, representing the multiplication of the element;

W _z weight coefficient representing update gate, W _r The weight coefficient representing the reset gate, σ (g) represents the full connection layer and the activation function, and tanh (g) represents the tanh activation function.

As shown in fig. 2, a feature vector sequence input into a sound event detection model is subjected to one-Layer convolution operation and one-time pooling treatment in sequence, and then is sequentially input into m continuous 2-deieblocks respectively, wherein each 2-deieblock is respectively followed by a Transition Layer; after 2-DeseNet (m) processing comprising continuous m 2-DenseBlock structures and Transition layers, converting a three-dimensional feature vector into a two-dimensional feature vector through a reshape Layer, then connecting n gating circulation units GRU to perform Time sequence modeling processing, inputting into a Time Distributed Layer to perform Time sequence tensor operation, inputting into a full connection Layer to perform detection processing, and finally outputting a detection result after Sigmoid function processing. The number m and the number l of the 2-DenseBlock are valued according to the actual hardware condition and the data complexity. In the embodiment shown in fig. 2 and 3 of the drawings, n is 2, m is 2, and l is 4.

The Sigmoid function f (z) formula is:

s5: after the reconstructed feature vector sequence is processed, inputting the processed reconstructed feature vector sequence into a trained sound event detection model for identification detection, and obtaining a sound event detection result of the audio data to be processed; meanwhile, before the reconstructed feature vector sequence is input into the trained sound event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the channel number in the sound event detection model.

In the technical scheme, based on 2-order DenseNet, the information in the feature map can be extracted more effectively, and the gating cycle unit GRU model is simple, so that the method is more suitable for constructing a larger network; in the patent, the 2-order DenseNet network is combined with 2 gating cycle unit GRU models, so that the calculation efficiency is higher from the calculation perspective; based on the technical scheme of the patent, not only can the frequency domain information of the feature map be effectively extracted, but also the time sequence features of the long-time audio sequence can be effectively captured, and meanwhile, the classification task and the regression task in the detection can be more efficiently realized.

As shown in table 2 below, is an example of a network structure of the 2-densegrenet model.

Table 2: examples of the 2-DenseGRUNet model

Input：mfcc[128,80,1]
	Conv(3×3)：[128,80,32]
Pooling(2×2)：[64,80,32]
	2-Denseblock(1)：[32,80,16]
Transition Layer(16,80,8)
	2-Denseblock(2)：[16,80,8]
Transition Layer(16,80,8)
	Reshape：[64,160]
GRU(1)：[64,64]
	GRU(2)：[64,32]
TimeDistributed：[64,6]
	Full connection layer (64,6)
Output(Sigmoid)：[64,6]

Using the Dcase2017 dataset, the detection categories of the dataset were 6 major categories, respectively, and the time tag was time-start at the end point. Using a 2-DenseGRUNet model shown in Table 2, setting m as 2, namely sequentially performing a layer of convolution operation and primary pooling treatment on a feature vector sequence of an input detection model, and sequentially inputting the feature vector sequence into 2 continuous 2-DenseBlock respectively; and according to the data condition and the performance of the equipment, the value of l is 4 in each 2-DenseBlock structure, namely each 2-DenseBlock comprises 4 feature layers. n=2, i.e. the number of layers of the gate loop unit GRU is 2.

Performing time domain and frequency domain analysis on the audio frame sequence, and extracting Mel frequency cepstrum coefficient output characteristic vector sequences; the number of sampling frames for input audio data in the Dcase2017 dataset is 128, and the characteristic dimension of the mel frequency cepstrum coefficient is selected as follows: 40mfcc is obtained by extracting 40-dimensional mfcc characteristics under 40 mel filter groups, and outputting mel cepstrum coefficient characteristic vector sequences (128, 40).

The 2-dimensional vector is converted into 3-dimensional data through reshape, because the number of channels of Input in the network structure of the 2-DenseNet model is 1, after the 2-dimensional vector is converted into three-dimensional data, the feature vectors of Dcase2017 are respectively (128, 40, 1), the three-dimensional feature vectors are converted into two-dimensional feature vectors (256, 40), and the two-dimensional feature vectors are Input into a gating circulation unit GRU for time series modeling.

The feature vector is input into a 2-DenseNet model, the input feature map sequence firstly passes through a convolution layer, then a pooling layer is adopted to carry out pooling treatment, and the obtained three-dimensional data are sequentially input into 2 continuous 2-DenseBlock.

In each 2-DenseBlock there are 4 feature map layers, namely 4 2-DenseBlock functions, which are input as a feature map sequence. In the processing of the 2-DenseBlock function, batch standardization (BN) processing is carried out first, and the activation function is a ReLU function; and then transferred to the convolution layer; this procedure was performed twice within the function, the first convolution kernel size being 1*1 and the second convolution kernel size being 3*3. The specific operation in the 2-DenseBlock function (denoted as 2-DenseBlock in the formula) is therefore:

three-dimensional data processed by two continuous 2-DenseBlock and Transition layers are firstly converted into two-dimensional feature vectors, the two-dimensional feature vectors are input into a gating circulation unit GRU for Time sequence modeling, then enter a Time Distributed Layer for Time sequence tensor operation, weight sharing can be carried out, time sequence modeling can be carried out to the maximum extent, the starting Time and the ending Time of a sound event are detected, and finally the detection result is output after being processed by a Sigmoid function.

Under the experimental environment of a Window10 system, a display card GTX1060, i7-8750H of CPU and a memory 16G; the method comprises the steps that a keras+TensorFlow is used as a deep learning framework, a dataset Dcase2017 is adopted, firstly, different characteristics, different characteristic dimensions and comparison experiments of network structures of GRU units are respectively carried out on the dataset Dcase2017, and the fragment error rate and the F-score of a 2-DenseGRUNet model are verified; and then comparing with the existing research model to verify the good performance of the 2-DenseGRUNet model.

Audio data detection experiments were performed on the Dcase2017 dataset by extracting 40 and 120-dimensional mel cepstrum coefficient features and 40-dimensional gamma cepstrum coefficient features in 2-DenseNet and 2-densegune network models, with specific results shown in table 3 below.

Table 3: results of the fusion experiment of the 2-DenseGRUNet model

Model	Features (e.g. a character)	Average segment error rate	F-score
				2-DenseNet	40mfcc	0.566	0.610
2-DenseNet	128mfcc	0.562	0.613
				2-DenseNet	40gfcc	0.586	0.607
2-DenseGRUNet	40mfcc	0.543	0.651
				2-DenseGRUNet	128mfcc	0.541	0.648
2-DenseGRUNet	40gfcc	0.563	0.631

According to experimental results, compared with a 2-DenseGRUNet model without using a gate control circulation unit GRU, the average fragment error rate is reduced by 2.3%, 2.1% and 2.3% under the characteristics of 40mfcc, 128mfcc and 40gfcc respectively, and the F-score is improved by 4.1%, 3.5% and 2.4% respectively; under the 2-DenseGRUNet model, the average fragment error rate is respectively reduced by 2.0% and the F-score is respectively improved by 2.0% by using the characteristic of 40mfcc relative to 40 gfcc; the average fragment error rate and the F-score variation range are about 0.1% under the characteristic of 40mfcc relative to 128mfcc, but the model training time can be effectively reduced by using 40mfcc, and the calculation requirement of a computer can be reduced. In summary, the 2-DenseGRUNet model under 40mfcc can more efficiently utilize feature information fusion to acquire more effective feature information, and can effectively perform time sequence modeling. The optimal average fragment error rate was 0.543% and the F-score was 65.12%.

Further experiments were performed on the 2-DenseGRUNet model in the Dcase2017 dataset, and the test results were compared with the accuracy of the existing domestic and foreign investigator models, and the comparison test results are shown in Table 4.

TABLE 4 detection results for different models

Under the technical scheme of the invention, the adopted D-2-DenseNet model is compared with the test results of researchers at home and abroad, so that the classification accuracy of the technical scheme of the invention is respectively reduced by 14.7% compared with the average fragment error rate of the MLP model of baseline, and the F-score fraction is respectively improved by 8.4%; compared with the LSTM model, the average fragment error rate is respectively reduced by 1.9%, and the F-score is respectively improved by 3.9%; the average fragment error rate of the technical scheme of the invention is obviously reduced, and the F-score is obviously improved.

In summary, the technical scheme provided by the invention can more efficiently utilize feature information fusion to acquire more effective feature information when processing audio data, and the model has lower average fragment error rate and higher F-score, i.e. the accuracy of sound classification realized based on the method is higher.

Claims

1. The sound event detection method based on the 2-DenseGRUNet model comprises the following steps:

the method is characterized in that:

in each 2-DenseBlock structure, the connection of the feature layer and the feature layer is based on the correlation connection of a 2-order Markov model, and the current feature layer input is related to the first 2 feature layer outputs; each Transition layer structure comprises a convolution layer and a pooling layer;

each 2-DenseBlock comprises a feature layer which is connected in sequence; each characteristic layer comprises a 1*1 convolution layer and a 3*3 convolution layer, wherein input data is subjected to batch normalization processing and activation function processing before entering the convolution layers for convolution processing; the last convolution layer in each feature layer is combined and cascaded with the next convolution layer through connection; a dropout layer is added between the first characteristic layer and the second characteristic layer in each 2-DenseBlock.

2. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the characteristic extraction of the mel frequency cepstrum coefficient in the step S2, wherein the mel frequency cepstrum coefficient extracted in a specific dimension is 40mfcc; the structure of the original feature vector sequence obtained in the step S2 is a 2-dimensional vector, the first-dimensional vector is the number of frames after sampling the audio data to be processed, and the second-dimensional vector is the dimension of the mel-frequency cepstrum coefficient.

3. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: in step S3, when reconstructing the feature information and the tag of the original feature vector sequence, the specific duration of the data in the original feature vector sequence is 1S, and the number of frames of mfcc of the corresponding reconstructed feature vector sequence is 128 frames, and the frame number of mfcc in the original sound event tag is determined as follows: starting time to ending time, reconstruct: starting frame number to ending frame number.

4. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: in step S5, before inputting the reconstructed feature vector sequence into the trained acoustic event detection model, the feature vector needs to be converted from a 2-dimensional vector to a 3-dimensional vector, and the 3-dimensional vector is the number of channels in the acoustic event detection model.

5. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the feature input of the GRU unit is a two-dimensional feature vector.

6. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the Transition Layer includes: one convolution kernel is the 1*1 convolution layer, one largest pooling layer with a pooling size of [2,2 ].

7. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the number of neurons of the full connection layer is set to 256 or 128.

8. The sound event detection method based on the 2-DenseGRUNet model according to claim 1, wherein: the output layer is realized based on a Sigmoid function.