CN111599376B - Sound event detection method based on cavity convolution cyclic neural network - Google Patents

Sound event detection method based on cavity convolution cyclic neural network Download PDF

Info

Publication number
CN111599376B
CN111599376B CN202010483079.3A CN202010483079A CN111599376B CN 111599376 B CN111599376 B CN 111599376B CN 202010483079 A CN202010483079 A CN 202010483079A CN 111599376 B CN111599376 B CN 111599376B
Authority
CN
China
Prior art keywords
convolution
neural network
void
audio
cyclic neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010483079.3A
Other languages
Chinese (zh)
Other versions
CN111599376A (en
Inventor
李艳雄
刘名乐
王武城
江钟杰
陈昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010483079.3A priority Critical patent/CN111599376B/en
Publication of CN111599376A publication Critical patent/CN111599376A/en
Application granted granted Critical
Publication of CN111599376B publication Critical patent/CN111599376B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a sound event detection method based on a cavity convolution cyclic neural network, which comprises the following steps: extracting logarithmic Mel spectrum characteristics of each sample; building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer; training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input; and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result. The method introduces the cavity convolution into the convolutional neural network and optimally combines the convolutional neural network and the cyclic neural network to obtain the cavity convolution cyclic neural network. Compared with the traditional convolutional neural network, under the condition that the network parameter sets are the same in size, the cavity convolutional cyclic neural network has a larger receptive field, the context information of the audio sample can be more effectively utilized, and a better sound event detection result is obtained.

Description

Sound event detection method based on cavity convolution cyclic neural network
Technical Field
The invention relates to the technical field of audio signal processing and pattern recognition, in particular to a sound event detection method based on a cavity convolution cyclic neural network.
Background
The goal of Sound Event Detection (SED) is to accurately identify various types of target Sound events in an audio recording. Sound event detection can be applied in many areas related to machine monitoring, such as traffic monitoring, intelligent conference rooms, automated driving assistance, and multimedia analysis. The classifier for sound event detection includes a depth model and a shallow model. The depth model mainly comprises a convolution cyclic neural network, a cyclic neural network and a convolution neural network. The shallow layer model mainly comprises a random regression forest, a support vector machine, a hidden Markov model and a Gaussian mixture model.
The existing mainstream sound event detection method based on the convolutional neural network has the following defects: in order to increase the receptive field and capture the context information with longer input audio features, the number of convolutional layers of the network needs to be increased, so that the scale of the network parameters is very large, and the overfitting problem (the generalization capability of the network is reduced) is easy to cause.
In the course of the present invention, at least the following technical teaching has been found: under the condition that the scale of network parameters is the same, the convolution cyclic neural network with the cavity convolution has a larger receptive field and can capture context information with longer input audio features. In order to obtain the same size of receptive field, the network layers used by the convolutional cyclic neural network of the cavity convolution are much smaller than those used by the convolutional cyclic neural network of the conventional convolution, and the overfitting problem caused by the large-scale neural network parameters is effectively avoided. Therefore, a method for detecting a sound event based on a void convolution cyclic neural network is urgently needed to be provided at present, and the sound event detection performance is effectively improved.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a sound event detection method based on a hole convolution cyclic neural network, which comprises the following steps: firstly, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and logarithmic Mel spectral characteristics of each audio frame are respectively extracted; and secondly, building a cavity convolution cyclic neural network: the system comprises a convolutional neural network, a bidirectional long-short-time memory neural network and a Sigmoid output layer; thirdly, training a hole convolution cyclic neural network: training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input; fourthly, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
The purpose of the invention can be achieved by adopting the following technical scheme:
a sound event detection method based on a hole convolution cyclic neural network comprises the following steps:
s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;
s2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;
s3, training a cavity convolution cyclic neural network, and training the cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input;
s4, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
Further, the process of extracting the log mel-frequency spectrum features in the step S1 is as follows:
s1.1, pre-emphasis: audio samples are read and pre-emphasized using a digital filter having a transfer function H (z) = 1-az -1 Where α is the filter coefficient and takes the value: alpha is more than or equal to 0.9 and less than or equal to 1;
s1.2, framing and windowing: dividing the read audio sample into frames with the frame length of 0.02s and the frame shift of 0.01s to obtain each frame signal of x' t (n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame' t Multiplying (n) with Hamming window omega (n) to obtain t frame audio signal x after windowing t (n);
S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal x t (n) performing discrete Fourier transform to obtain a linear spectrum X t (k) Then linear spectrum X is applied t (k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum S t (m);
S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio frames t (m) finally, the log spectrum S of all audio frames t (m) arranging the feature matrices in the order of frames, wherein the rows of the feature matrices are the order of the frames and the columns of the feature matrices are the feature dimensions.
Furthermore, the convolutional neural network consists of a hole convolution module or more than two cascaded hole convolution modules, wherein each hole convolution module comprises a hole convolution unit, a pooling unit, an excitation unit and a batch standardization unit,
the expression of the hole convolution unit is as follows:
Figure BDA0002517982080000031
wherein,
Figure BDA0002517982080000032
a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, k i And b i Respectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;
the pooling unit adopts a maximum pooling method; the excitation function adopted in the excitation unit is a linear rectification function and is used for increasing the nonlinear relation among all layers of the neural network;
the batch standardization unit is used for solving the problem of gradient explosion of the network and accelerating the convergence speed of the network, and the calculation process comprises the following steps:
approximate whitening preprocessing:
Figure BDA0002517982080000033
transformation and reconstruction:
Figure BDA0002517982080000034
wherein, E (x) (i) ) Feature vector x representing the ith audio sample (i) Is determined by the average value of (a) of (b),
Figure BDA0002517982080000035
feature vector x representing the ith audio sample (i) The standard deviation of (a) is determined,
Figure BDA0002517982080000036
as feature vector x (i) Approximate the result of whitening pre-processing, y (i) Representing the feature vector after reconstruction, gamma (i) And beta (i) Indicating adjustable reconstruction parameters.
Further, the bidirectional long-time and short-time memory network fully utilizes context information and maps the feature representation obtained by the learning of the convolutional neural network to a sample mark space.
Further, the Sigmoid output layer adopts a loss function, and the expression is as follows:
Figure BDA0002517982080000041
wherein N represents the number of samples, l (i) A true tag representing the ith audio sample,
Figure BDA0002517982080000042
a prediction tag representing the ith audio sample.
Further, the specific process of training the hole convolution cyclic neural network in step S3 is as follows:
inputting logarithmic Mel spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;
when the number of the void convolution modules is 1, setting two groups of void ratio values, wherein one group of void ratios is 1, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 1, and the other group of void ratios is 2, namely the void ratios of all the convolution layers in the void convolution modules are all set to be 2;
when the number of the void convolution modules is 2, two groups of void ratio values are set, wherein one group of void ratios is 1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4, namely the void ratios of the convolution layers in the first convolution module and the second convolution module are respectively 2 and 4;
when the number of the void convolution modules is 3, two groups of void ratio values are set, wherein one group of void ratios is 1-1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4-8, namely the void ratios of the convolution layers in the first convolution module, the second convolution module and the third convolution module are respectively 2, 4 and 8;
when the number of the void convolution modules is 4, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are 2, 4, 8 and 16 respectively;
when the number of the void convolution modules is 5, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.
Further, the process of detecting the sound event in step S4 is as follows:
s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;
and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification result of the audio segment, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.
Compared with the prior art, the invention has the following advantages and effects:
the sound event detection method based on the cavity convolution cyclic neural network obtains higher detection precision under the condition of capturing context information with the same length of input audio features, reduces the parameter scale of the neural network, avoids the over-fitting problem of the neural network and improves the generalization capability of the neural network.
Drawings
FIG. 1 is a flowchart of a method for detecting a sound event based on a hole convolution cyclic neural network according to an embodiment of the present invention;
fig. 2 is a structural diagram of a hole convolution cyclic neural network disclosed in the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
FIG. 1 is a flow diagram of one embodiment of a method for sound event detection based on a hole convolutional recurrent neural network, the method comprising the steps of:
s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;
in this embodiment, the extracting the logarithmic mel-frequency spectrum feature in the step S1 specifically includes the following steps:
s1.1, pre-emphasis: audio samples are read and pre-emphasized using a digital filter having a transfer function H (z) = 1-az -1 Where α is the filter coefficient and takes the value: alpha is more than or equal to 0.9 and less than or equal to 1;
s1.2, framing and windowing: the read-in audio sample is divided into frames, the frame length is 0.02s, the frame shift is 0.01s, and each frame signal is x' t (n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame' t (n) multiplying the Hamming window omega (n) to obtain a windowed t frame audio signal x t (n);
S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal x t (n) performing a discrete Fourier transform to obtain a linear spectrum X t (k) Then linear spectrum X is applied t (k) Obtaining Mel frequency spectrum by Mel frequency filter bank, and finally performing logarithm operation to obtain logarithm frequency spectrum S t (m);
S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio frames t (m) finally, the log spectrum S of all audio frames t (m) arranging the feature matrices in the order of frames, wherein the rows of the feature matrices are the order of the frames and the columns of the feature matrices are the feature dimensions.
S2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;
the cavity convolution cyclic neural network comprises a cascade convolution neural network, a bidirectional long-time memory network and a Sigmoid output layer, and is shown in fig. 2.
The convolutional neural network consists of one cavity convolution module or more than two cascaded cavity convolution modules, wherein each cavity convolution module comprises a cavity convolution unit, a pooling unit, an excitation unit and a batch standardization unit;
(1) The expression of the hole convolution unit is as follows:
Figure BDA0002517982080000071
wherein,
Figure BDA0002517982080000072
a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, k i And b i Respectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;
(2) Pooling unit and excitation unit:
the pooling Unit adopts a maximum pooling method, and the excitation function adopted in the excitation Unit is a Linear rectification function (ReLU) which is used for increasing the nonlinear relation among each layer of the neural network;
(3) A batch standardization unit:
the batch standardization unit is mainly used for solving the gradient explosion problem of the network and accelerating the convergence speed of the network, and the main calculation process comprises the following steps:
approximate whitening preprocessing:
Figure BDA0002517982080000073
transformation and reconstruction:
Figure BDA0002517982080000074
wherein, E (x) (i) ) Feature vector x representing the ith audio sample (i) Average value of (a);
Figure BDA0002517982080000075
feature vector x representing the ith audio sample (i) The standard deviation of the (c) is,
Figure BDA0002517982080000076
as feature vector x (i) Approximate the result of whitening pre-processing, y (i) Representing the feature vector after reconstruction, gamma (i) And beta (i) Indicating adjustable reconstruction parameters.
The bidirectional long-time and short-time memory network fully utilizes context information and maps the feature representation obtained by the learning of the convolutional neural network to a sample mark space;
wherein, the Sigmoid output layer adopts a loss function, and the expression is as follows:
Figure BDA0002517982080000081
wherein N represents the number of samples, l (i) A true tag representing the ith audio sample,
Figure BDA0002517982080000082
a prediction tag representing the ith audio sample.
S3, training a cavity convolution cyclic neural network: training a cavity convolution cyclic neural network by taking logarithmic Mel spectrum features extracted from training samples as input;
in this embodiment, the specific process of training the void convolution cyclic neural network is as follows:
inputting the logarithmic Mel-spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;
when the number of the void volume modules is 1, two groups of void ratio values are set: the void rate group takes a value of 1, namely the void rates of all the convolution layers in the void convolution module are all set to be 1; the other group of void rates is 2, that is, the void rates of all convolution layers in the void convolution module are set to be 2.
When the number of the void volume modules is 2, two groups of void ratio values are set: the group of void ratios takes a value of 1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other set of void rates is 2-4, i.e., the void rates of the convolution layers in the first and second convolution modules are 2 and 4, respectively.
When the number of the void volume modules is 3, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void ratios has a value of 2-4-8, i.e., the void ratios of the convolution layers in the first, second and third convolution modules are 2, 4 and 8, respectively.
When the number of the void volume modules is 4, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void ratios has a value of 2-4-8-16, i.e., the void ratios of the convolution layers in the first, second, third and fourth convolution modules are 2, 4, 8 and 16, respectively.
When the number of the void volume modules is 5, two groups of void ratio values are set: the group of void ratios takes the value of 1-1-1-1-1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1; the other group of void rates has a value of 2-4-8-16-32, i.e., the void rates of the convolution layers in the first, second, third, fourth and fifth convolution modules are 2, 4, 8, 16 and 32, respectively.
S4, sound event detection: and identifying the sound event in the test sample by adopting the trained hole convolution cyclic neural network to obtain a sound event detection result.
In this embodiment, the sound event detection specifically includes the following steps:
s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;
and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification results of the audio segments, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. A sound event detection method based on a hole convolution cyclic neural network is characterized by comprising the following steps:
s1, extracting logarithmic Mel spectrum characteristics: pre-emphasis, framing and windowing are carried out on the audio samples, and then the logarithmic Mel spectrum of each audio frame is respectively extracted;
s2, building a cavity convolution cyclic neural network, wherein the cavity convolution cyclic neural network comprises a convolution neural network, a bidirectional long-time and short-time memory neural network and a Sigmoid output layer;
s3, training the cavity convolution cyclic neural network, and training the cavity convolution cyclic neural network by taking the logarithmic Mel spectrum features extracted from the training samples as input, wherein the process is as follows:
inputting logarithmic Mel spectrum features extracted from training samples of different audio databases into a hole convolution cyclic neural network, and respectively adjusting the number of hole convolution modules and the size of a hole rate;
when the number of the void convolution modules is 1, setting two groups of void ratio values, wherein one group of the void ratios is 1, namely the void ratios of all the convolution layers in the void convolution module are all set to be 1, and the other group of the void ratios is 2, namely the void ratios of all the convolution layers in the void convolution module are all set to be 2;
when the number of the void convolution modules is 2, setting two groups of void ratio values, wherein one group of void ratios is 1-1, namely the void ratios of all the convolution layers in the void convolution modules are set as 1, and the other group of void ratios is 2-4, namely the void ratios of all the convolution layers in the first convolution module and the second convolution module are 2 and 4 respectively;
when the number of the void convolution modules is 3, two groups of void ratio values are set, wherein one group of void ratios is 1-1-1, namely the void ratios of all the convolution layers in the void convolution modules are set to be 1, and the other group of void ratios is 2-4-8, namely the void ratios of the convolution layers in the first convolution module, the second convolution module and the third convolution module are respectively 2, 4 and 8;
when the number of the void convolution modules is 4, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are 2, 4, 8 and 16 respectively;
when the number of the void convolution modules is 5, setting two groups of void ratio values, wherein one group of void ratio values is 1-1-1-1-1, namely the void ratios of all convolution layers in the void convolution modules are all set to be 1; the other group of void rates is 2-4-8-16-32, namely the void rates of the convolution layers in the first convolution module, the second convolution module, the third convolution module, the fourth convolution module and the fifth convolution module are 2, 4, 8, 16 and 32 respectively;
s4, sound event detection: and identifying the sound event in the test sample by adopting the trained cavity convolution cyclic neural network to obtain a sound event detection result.
2. The method for detecting a sound event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the process of extracting log mel-frequency spectrum features in step S1 is as follows:
s1.1, pre-emphasis: audio samples are read and pre-emphasis is performed using a digital filter having a transfer function H (z) = 1-alphaz -1 Where α is the filter coefficient and takes the value: alpha is more than or equal to 0.9 and less than or equal to 1;
s1.2, framing and windowing: dividing the read audio sample into frames with the frame length of 0.02s and the frame shift of 0.01s to obtain each frame signal of x' t (n), the window function is a Hamming window ω (n), and is given as the signal x 'per frame' t Multiplying (n) with Hamming window omega (n) to obtain t frame audio signal x after windowing t (n);
S1.3, extracting logarithmic spectrum features: for the t-th frame audio signal x t (n) performing a discrete Fourier transform to obtain a linear spectrum X t (k) Then the linear spectrum X is divided into t (k) Obtaining a Mel frequency spectrum through a Mel frequency filter bank, and finally obtaining a logarithmic frequency spectrum S after carrying out logarithmic operation t (m);
S1.4, performing the operation of the step S1.3 on each frame of audio signal to obtain the log spectrum S of all audio frames t (m) finally, the log spectrum S of all audio frames t (m) arranging the feature matrixes into a feature matrix according to the sequence of the frames, wherein the row number of the feature matrix is the sequence of the frames, and the column number of the feature matrix is the feature dimension.
3. The method according to claim 1, wherein the convolutional neural network comprises one or more cascaded hole convolutional modules, each of which comprises a hole convolutional unit, a pooling unit, an excitation unit and a batch normalization unit,
the expression of the void convolution unit is as follows:
Figure FDA0003989956910000031
wherein,
Figure FDA0003989956910000032
a feature vector representing the ith audio sample at layer l, f (-) represents an activation function, k i And b i Respectively representing convolution kernel parameters and bias terms which are subjected to convolution operation with the feature vector of the ith audio sample;
the pooling unit adopts a maximum pooling method; the excitation function adopted in the excitation unit is a linear rectification function and is used for increasing the nonlinear relation among all layers of the neural network;
the batch standardization unit is used for solving the problem of gradient explosion of the network and accelerating the convergence speed of the network, and the calculation process comprises the following steps:
approximate whitening preprocessing:
Figure FDA0003989956910000033
transformation and reconstruction:
Figure FDA0003989956910000034
wherein, E (x) (i) ) Feature vector x representing the ith audio sample (i) Is determined by the average value of (a),
Figure FDA0003989956910000035
feature vector x representing the ith audio sample (i) The standard deviation of the (c) is,
Figure FDA0003989956910000036
as feature vector x (i) Approximate the result of whitening pre-processing, y (i) Representing the feature vector after reconstruction, gamma (i) And beta (i) Indicating adjustable reconstruction parameters.
4. The method as claimed in claim 1, wherein the bidirectional long-and-short term memory network maps the feature representation learned by the convolutional neural network to the sample label space by fully utilizing the context information.
5. The method for detecting the sound event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the Sigmoid output layer adopts a loss function, and the expression is as follows:
Figure FDA0003989956910000041
wherein N represents the number of samples, l (i) A true tag representing the ith audio sample,
Figure FDA0003989956910000042
a prediction tag representing the ith audio sample.
6. The method for detecting the acoustic event based on the hole convolution cyclic neural network as claimed in claim 1, wherein the acoustic event detection in step S4 is as follows:
s4.1, extracting audio features in each test data set, and identifying each audio frame by using a trained cavity convolution cyclic neural network;
and S4.2, splicing the identification results of the audio frames according to the time sequence to obtain the identification results of the audio segments, and then calculating the detection precision of the sound event based on the audio frame level and the audio segment level.
CN202010483079.3A 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network Active CN111599376B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010483079.3A CN111599376B (en) 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010483079.3A CN111599376B (en) 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network

Publications (2)

Publication Number Publication Date
CN111599376A CN111599376A (en) 2020-08-28
CN111599376B true CN111599376B (en) 2023-02-14

Family

ID=72192486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010483079.3A Active CN111599376B (en) 2020-06-01 2020-06-01 Sound event detection method based on cavity convolution cyclic neural network

Country Status (1)

Country Link
CN (1) CN111599376B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133326A (en) * 2020-09-08 2020-12-25 东南大学 Gunshot data amplification and detection method based on antagonistic neural network
CN112529152A (en) * 2020-12-03 2021-03-19 开放智能机器(上海)有限公司 System and method for detecting watermelon maturity based on artificial intelligence
CN112951242B (en) * 2021-02-02 2022-10-25 华南理工大学 Phrase voice speaker matching method based on twin neural network
CN113658607A (en) * 2021-07-23 2021-11-16 南京理工大学 Environmental sound classification method based on data enhancement and convolution cyclic neural network
CN113990303B (en) * 2021-10-08 2024-04-12 华南理工大学 Environmental sound identification method based on multi-resolution cavity depth separable convolution network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065030A (en) * 2018-08-01 2018-12-21 上海大学 Ambient sound recognition methods and system based on convolutional neural networks
CN109767785A (en) * 2019-03-06 2019-05-17 河北工业大学 Ambient noise method for identifying and classifying based on convolutional neural networks
CN110223715A (en) * 2019-05-07 2019-09-10 华南理工大学 It is a kind of based on sound event detection old solitary people man in activity estimation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10930167B2 (en) * 2015-06-16 2021-02-23 Upchurch & Associates Inc. Sound association test

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065030A (en) * 2018-08-01 2018-12-21 上海大学 Ambient sound recognition methods and system based on convolutional neural networks
CN109767785A (en) * 2019-03-06 2019-05-17 河北工业大学 Ambient noise method for identifying and classifying based on convolutional neural networks
CN110223715A (en) * 2019-05-07 2019-09-10 华南理工大学 It is a kind of based on sound event detection old solitary people man in activity estimation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
AUDIO-BASED AUTOMATIC MATING SUCCESS PREDICTION OF GIANT PANDAS;WeiRan Yan;《arXiv》;20191224;第1-4页 *

Also Published As

Publication number Publication date
CN111599376A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN111599376B (en) Sound event detection method based on cavity convolution cyclic neural network
CN110245608B (en) Underwater target identification method based on half tensor product neural network
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN111160533A (en) Neural network acceleration method based on cross-resolution knowledge distillation
CN112799128B (en) Method for seismic signal detection and seismic phase extraction
CN111724770B (en) Audio keyword identification method for generating confrontation network based on deep convolution
CN111986699B (en) Sound event detection method based on full convolution network
CN111341294B (en) Method for converting text into voice with specified style
CN112820322B (en) Semi-supervised audio event labeling method based on self-supervised contrast learning
CN111833906B (en) Sound scene classification method based on multi-path acoustic characteristic data enhancement
CN115188387A (en) Effective marine mammal sound automatic detection and classification method
CN115393968A (en) Audio-visual event positioning method fusing self-supervision multi-mode features
CN113990303B (en) Environmental sound identification method based on multi-resolution cavity depth separable convolution network
CN116861303A (en) Digital twin multisource information fusion diagnosis method for transformer substation
Cai et al. TDCA-Net: Time-Domain Channel Attention Network for Depression Detection.
CN111755024B (en) Violent and terrorist audio detection method based on transfer learning
CN116013276A (en) Indoor environment sound automatic classification method based on lightweight ECAPA-TDNN neural network
CN114898773A (en) Synthetic speech detection method based on deep self-attention neural network classifier
CN114299995A (en) Language emotion recognition method for emotion assessment
CN107578785B (en) Music continuous emotion characteristic analysis and evaluation method based on Gamma distribution analysis
CN113921041A (en) Recording equipment identification method and system based on packet convolution attention network
CN112035700A (en) Voice deep hash learning method and system based on CNN
CN112951242B (en) Phrase voice speaker matching method based on twin neural network
CN113963718A (en) Voice session segmentation method based on deep learning
CN113823292A (en) Small sample speaker identification method based on channel attention depth separable convolutional network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant