CN111599376B

CN111599376B - Sound event detection method based on cavity convolution cyclic neural network

Info

Publication number: CN111599376B
Application number: CN202010483079.3A
Authority: CN
Inventors: 李艳雄; 刘名乐; 王武城; 江钟杰; 陈昊
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2023-02-14
Anticipated expiration: 2040-06-01
Also published as: CN111599376A

Abstract

The invention discloses a sound event detection method based on a cavity convolution cyclic neural network, which comprises the following steps: extracting logarithmic Mel spectrum characteristics of each sample; building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer; training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input; and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result. The method introduces the cavity convolution into the convolutional neural network and optimally combines the convolutional neural network and the cyclic neural network to obtain the cavity convolution cyclic neural network. Compared with the traditional convolutional neural network, under the condition that the network parameter sets are the same in size, the cavity convolutional cyclic neural network has a larger receptive field, the context information of the audio sample can be more effectively utilized, and a better sound event detection result is obtained.

Description

Sound event detection method based on cavity convolution cyclic neural network

Technical Field

The invention relates to the technical field of audio signal processing and pattern recognition, in particular to a sound event detection method based on a cavity convolution cyclic neural network.

Background

The goal of Sound Event Detection (SED) is to accurately identify various types of target Sound events in an audio recording. Sound event detection can be applied in many areas related to machine monitoring, such as traffic monitoring, intelligent conference rooms, automated driving assistance, and multimedia analysis. The classifier for sound event detection includes a depth model and a shallow model. The depth model mainly comprises a convolution cyclic neural network, a cyclic neural network and a convolution neural network. The shallow layer model mainly comprises a random regression forest, a support vector machine, a hidden Markov model and a Gaussian mixture model.

The existing mainstream sound event detection method based on the convolutional neural network has the following defects: in order to increase the receptive field and capture the context information with longer input audio features, the number of convolutional layers of the network needs to be increased, so that the scale of the network parameters is very large, and the overfitting problem (the generalization capability of the network is reduced) is easy to cause.

In the course of the present invention, at least the following technical teaching has been found: under the condition that the scale of network parameters is the same, the convolution cyclic neural network with the cavity convolution has a larger receptive field and can capture context information with longer input audio features. In order to obtain the same size of receptive field, the network layers used by the convolutional cyclic neural network of the cavity convolution are much smaller than those used by the convolutional cyclic neural network of the conventional convolution, and the overfitting problem caused by the large-scale neural network parameters is effectively avoided. Therefore, a method for detecting a sound event based on a void convolution cyclic neural network is urgently needed to be provided at present, and the sound event detection performance is effectively improved.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a sound event detection method based on a hole convolution cyclic neural network, which comprises the following steps: firstly, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and logarithmic Mel spectral characteristics of each audio frame are respectively extracted; and secondly, building a cavity convolution cyclic neural network: the system comprises a convolutional neural network, a bidirectional long-short-time memory neural network and a Sigmoid output layer; thirdly, training a hole convolution cyclic neural network: training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input; fourthly, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.

The purpose of the invention can be achieved by adopting the following technical scheme:

a sound event detection method based on a hole convolution cyclic neural network comprises the following steps:

s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;

s2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;

s3, training a cavity convolution cyclic neural network, and training the cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input;

s4, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.

Further, the process of extracting the log mel-frequency spectrum features in the step S1 is as follows:

s1.1, pre-emphasis: audio samples are read and pre-emphasized using a digital filter having a transfer function H (z) = 1-az ^-1 Where α is the filter coefficient and takes the value: alpha is more than or equal to 0.9 and less than or equal to 1;

s1.2, framing and windowing: dividing the read audio sample into frames with the frame length of 0.02s and the frame shift of 0.01s to obtain each frame signal of x' _t (n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame' _t Multiplying (n) with Hamming window omega (n) to obtain t frame audio signal x after windowing _t (n)；

S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal x _t (n) performing discrete Fourier transform to obtain a linear spectrum X _t (k) Then linear spectrum X is applied _t (k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum S _t (m)；

S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio frames _t (m) finally, the log spectrum S of all audio frames _t (m) arranging the feature matrices in the order of frames, wherein the rows of the feature matrices are the order of the frames and the columns of the feature matrices are the feature dimensions.

Furthermore, the convolutional neural network consists of a hole convolution module or more than two cascaded hole convolution modules, wherein each hole convolution module comprises a hole convolution unit, a pooling unit, an excitation unit and a batch standardization unit,

the expression of the hole convolution unit is as follows:

wherein,

a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, k _i And b _i Respectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;

the pooling unit adopts a maximum pooling method; the excitation function adopted in the excitation unit is a linear rectification function and is used for increasing the nonlinear relation among all layers of the neural network;

the batch standardization unit is used for solving the problem of gradient explosion of the network and accelerating the convergence speed of the network, and the calculation process comprises the following steps:

approximate whitening preprocessing:

transformation and reconstruction:

wherein, E (x) ⁽ⁱ⁾ ) Feature vector x representing the ith audio sample ⁽ⁱ⁾ Is determined by the average value of (a) of (b),

feature vector x representing the ith audio sample ⁽ⁱ⁾ The standard deviation of (a) is determined,

as feature vector x ⁽ⁱ⁾ Approximate the result of whitening pre-processing, y ⁽ⁱ⁾ Representing the feature vector after reconstruction, gamma ⁽ⁱ⁾ And beta ⁽ⁱ⁾ Indicating adjustable reconstruction parameters.

Further, the bidirectional long-time and short-time memory network fully utilizes context information and maps the feature representation obtained by the learning of the convolutional neural network to a sample mark space.

Further, the Sigmoid output layer adopts a loss function, and the expression is as follows:

wherein N represents the number of samples, l ⁽ⁱ⁾ A true tag representing the ith audio sample,

a prediction tag representing the ith audio sample.

Further, the specific process of training the hole convolution cyclic neural network in step S3 is as follows:

inputting logarithmic Mel spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;

when the number of the void convolution modules is 1, setting two groups of void ratio values, wherein one group of void ratios is 1, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 1, and the other group of void ratios is 2, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 2;

when the number of the void convolution modules is 2, two groups of void ratio values are set, wherein one group of void ratios is 1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4, namely the void ratios of the convolution layers in the first convolution module and the second convolution module are respectively 2 and 4;

when the number of the void convolution modules is 3, two groups of void ratio values are set, wherein one group of void ratios is 1-1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4-8, namely the void ratios of the convolution layers in the first convolution module, the second convolution module and the third convolution module are respectively 2, 4 and 8;

when the number of the void convolution modules is 4, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are 2, 4, 8 and 16 respectively;

when the number of the void convolution modules is 5, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.

Further, the process of detecting the sound event in step S4 is as follows:

s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;

and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification result of the audio segment, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.

Compared with the prior art, the invention has the following advantages and effects:

the sound event detection method based on the cavity convolution cyclic neural network obtains higher detection precision under the condition of capturing context information with the same length of input audio features, reduces the parameter scale of the neural network, avoids the over-fitting problem of the neural network and improves the generalization capability of the neural network.

Drawings

FIG. 1 is a flowchart of a method for detecting a sound event based on a hole convolution cyclic neural network according to an embodiment of the present invention;

fig. 2 is a structural diagram of a hole convolution cyclic neural network disclosed in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

FIG. 1 is a flow diagram of one embodiment of a method for sound event detection based on a hole convolutional recurrent neural network, the method comprising the steps of:

in this embodiment, the extracting the logarithmic mel-frequency spectrum feature in the step S1 specifically includes the following steps:

s1.2, framing and windowing: the read-in audio sample is divided into frames, the frame length is 0.02s, the frame shift is 0.01s, and each frame signal is x' _t (n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame' _t (n) multiplying the Hamming window omega (n) to obtain a windowed t frame audio signal x _t (n)；

S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal x _t (n) performing a discrete Fourier transform to obtain a linear spectrum X _t (k) Then linear spectrum X is applied _t (k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum S _t (m)；

the cavity convolution cyclic neural network comprises a cascade convolution neural network, a bidirectional long-time memory network and a Sigmoid output layer, and is shown in fig. 2.

The convolutional neural network consists of one cavity convolution module or more than two cascaded cavity convolution modules, wherein each cavity convolution module comprises a cavity convolution unit, a pooling unit, an excitation unit and a batch standardization unit;

(1) The expression of the hole convolution unit is as follows:

wherein,

(2) Pooling unit and excitation unit:

the pooling Unit adopts a maximum pooling method, and the excitation function adopted in the excitation Unit is a Linear rectification function (ReLU) which is used for increasing the nonlinear relation among each layer of the neural network;

(3) A batch standardization unit:

the batch standardization unit is mainly used for solving the gradient explosion problem of the network and accelerating the convergence speed of the network, and the main calculation process comprises the following steps:

approximate whitening preprocessing:

transformation and reconstruction:

wherein, E (x) ⁽ⁱ⁾ ) Feature vector x representing the ith audio sample ⁽ⁱ⁾ Average value of (a);

feature vector x representing the ith audio sample ⁽ⁱ⁾ The standard deviation of the (c) is,

The bidirectional long-time and short-time memory network fully utilizes context information and maps the feature representation obtained by the learning of the convolutional neural network to a sample mark space;

wherein, the Sigmoid output layer adopts a loss function, and the expression is as follows:

a prediction tag representing the ith audio sample.

S3, training a cavity convolution cyclic neural network: training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input;

in this embodiment, the specific process of training the void convolution cyclic neural network is as follows:

inputting the logarithmic Mel-spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;

when the number of the void volume modules is 1, two groups of void ratio values are set: the void rate group takes a value of 1, namely the void rates of all the convolution layers in the void convolution module are all set to be 1; the other group of void rates is 2, that is, the void rates of all convolution layers in the void convolution module are set to be 2.

When the number of the void volume modules is 2, two groups of void ratio values are set: the group of void ratios takes a value of 1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other set of void rates is 2-4, i.e., the void rates of the convolution layers in the first and second convolution modules are 2 and 4, respectively.

When the number of the void volume modules is 3, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void ratios has a value of 2-4-8, i.e., the void ratios of the convolution layers in the first, second and third convolution modules are 2, 4 and 8, respectively.

When the number of the void volume modules is 4, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void ratios has a value of 2-4-8-16, i.e., the void ratios of the convolution layers in the first, second, third and fourth convolution modules are 2, 4, 8 and 16, respectively.

When the number of the void volume modules is 5, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.

In this embodiment, the sound event detection specifically includes the following steps:

and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification results of the audio segments, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A sound event detection method based on a hole convolution cyclic neural network is characterized by comprising the following steps:

s3, training the cavity convolution cyclic neural network, and training the cavity convolution cyclic neural network by taking the logarithmic Mel spectrum features extracted from the training samples as input, wherein the process is as follows:

when the number of the void convolution modules is 1, setting two groups of void ratio values, wherein one group of the void ratios is 1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1, and the other group of the void ratios is 2, namely the void ratios of all the convolution layers in the void convolution module are all set to be 2;

when the number of the void convolution modules is 2, setting two groups of void ratio values, wherein one group of void ratios is 1-1, namely the void ratios of all the convolution layers in the void convolution modules are set as 1, and the other group of void ratios is 2-4, namely the void ratios of all the convolution layers in the first convolution module and the second convolution module are 2 and 4 respectively;

when the number of the void convolution modules is 5, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16-32, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module, the fourth convolution module and the fifth convolution module are 2, 4, 8, 16 and 32 respectively;

s4, sound event detection: and identifying the sound event in the test sample by adopting the trained cavity convolution cyclic neural network to obtain a sound event detection result.

2. The method for detecting a sound event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the process of extracting log mel-frequency spectrum features in step S1 is as follows:

s1.1, pre-emphasis: audio samples are read and pre-emphasis is performed using a digital filter having a transfer function H (z) = 1-alphaz ^-1 Where α is the filter coefficient and takes the value: alpha is more than or equal to 0.9 and less than or equal to 1;

S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal x _t (n) performing a discrete Fourier transform to obtain a linear spectrum X _t (k) Then the linear spectrum X is divided into _t (k) Obtaining a Mel frequency spectrum through a Mel frequency filter bank, and finally obtaining a logarithmic frequency spectrum S after carrying out logarithmic operation _t (m)；

S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio frames _t (m) finally, the log spectrum S of all audio frames _t (m) arranging the feature matrixes into a feature matrix according to the sequence of the frames, wherein the row number of the feature matrix is the sequence of the frames, and the column number of the feature matrix is the feature dimension.

3. The method according to claim 1, wherein the convolutional neural network comprises one or more cascaded hole convolutional modules, each of which comprises a hole convolutional unit, a pooling unit, an excitation unit and a batch normalization unit,

the expression of the void convolution unit is as follows:

wherein,

approximate whitening preprocessing:

transformation and reconstruction:

wherein, E (x) ⁽ⁱ⁾ ) Feature vector x representing the ith audio sample ⁽ⁱ⁾ Is determined by the average value of (a),

4. The method as claimed in claim 1, wherein the bidirectional long-and-short term memory network maps the feature representation learned by the convolutional neural network to the sample label space by fully utilizing the context information.

5. The method for detecting the sound event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the Sigmoid output layer adopts a loss function, and the expression is as follows:

a prediction tag representing the ith audio sample.

6. The method for detecting the acoustic event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the acoustic event detection in step S4 is as follows: