CN111754988B

CN111754988B - Sound scene classification method based on attention mechanism and double-path depth residual error network

Info

Publication number: CN111754988B
Application number: CN202010585359.5A
Authority: CN
Inventors: 唐闺臣; 梁瑞宇; 谢跃; 黄裕磊; 王青云
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2022-08-16
Anticipated expiration: 2040-06-23
Also published as: CN111754988A

Abstract

The invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, which comprises the following steps of: calculating an original voice spectrogram, a horizontal spectrogram and a vertical spectrogram of the original voice signal, and transforming the horizontal spectrogram and the vertical spectrogram to obtain new two paths of time domain signals; respectively calculating a logarithmic Mel spectrogram, a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram of the original voice signal and the new two paths of time domain signals, and fusing on the channel dimension to obtain a fused spectrogram; dividing the fused spectrogram on a frequency axis into a high-frequency spectrogram and a low-frequency spectrogram in an average manner; building a double-path depth residual error network with an attention layer; and inputting the high-frequency spectrogram and the low-frequency spectrogram into a depth residual error network, and outputting the sound scene category to which the original voice signal belongs. The invention can better capture the time-frequency characteristics of high-frequency and low-frequency components and the importance of different channels in the characteristic diagram, and improves the accuracy and robustness of the sound scene classification system.

Description

Sound scene classification method based on attention mechanism and double-path depth residual error network

Technical Field

The invention belongs to the technical field of sound scene classification, and particularly relates to a sound scene classification method based on an attention mechanism and a double-path depth residual error network.

Background

The sound scene classification is to train a computer to correctly classify sound into scenes to which the computer belongs according to information contained in the sound. The sound scene classification technology is widely applied to the fields of Internet of things equipment, intelligent hearing aids, automatic driving and the like, and has very important significance in deep research on sound scene classification.

Acoustic scene classification initially belongs to a sub-domain of pattern recognition. In the nineties of the last century, Sawhney and Maes first proposed the concept of sound scene classification. They recorded a data set containing five sound scenes of sidewalks, subways, restaurants, parks and streets, and Sawhney extracted three characteristics of power spectral density, relative spectrum and frequency band of a filter bank from the recorded audio, and then classified by adopting a k-nearest neighbor and recurrent neural network algorithm, so that the accuracy rate is 68%. In the early twentieth century, the field of machine learning has rapidly developed, and more learners try to use a machine learning method to divide sound scenes. Machine learning algorithms such as a support vector machine and a decision tree gradually replace a traditional HMM model, and are widely applied to sound scene classification and sound event detection tasks. Meanwhile, the sound scene classification effect is further improved by some integrated learning methods such as random forests and XGboost. In 2015, Phan et al converted the sound scene classification problem into a regression problem, built a random forest regression-based model, and reduced the detection error rate by 6% and 10% on ITC-Irst and UPC-TALP databases, respectively. In 2012, Krizhevsky proposed the AlexNet model and obtained champions at a glance in the ImageNet image classification competition. The great success of AlexNet has triggered the heat of deep learning, and researchers have gradually begun to introduce the deep learning method into the sound scene classification task.

In addition, there are many acoustic features that can be used for sound scene classification, and how to fuse these features to match a deeply learned model is an important research direction in the future.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, which is characterized in that a logarithmic Mel spectrogram and a first-order and second-order difference spectrogram of the logarithmic Mel spectrogram are respectively obtained from three transformed signals, the logarithmic Mel spectrogram and the first-order and second-order difference spectrogram are fused and separated into high-frequency parts and low-frequency parts, and the high-frequency parts and the low-frequency parts are input into a double-path depth residual error network model with the attention mechanism, so that a characteristic diagram which has important influence on classification results can be effectively captured, and the accuracy and the robustness of a sound scene classification system are improved.

The technical scheme is as follows: the invention adopts the following technical scheme: an attention mechanism and dual-path depth residual error network-based sound scene classification method is characterized by comprising the following steps:

step 1, preprocessing an original voice signal, calculating an original voice spectrogram, respectively enhancing horizontal lines and vertical lines in the original voice spectrogram to obtain a horizontal spectrogram and a vertical spectrogram, and respectively transforming the horizontal spectrogram and the vertical spectrogram to obtain two new paths of time domain signals;

step 2, respectively calculating a logarithmic Mel spectrogram, a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram of the original voice signal and the new two paths of time domain signals, and fusing on channel dimensions to obtain a fused spectrogram;

step 3, averagely dividing the fusion spectrogram into a high-frequency spectrogram and a low-frequency spectrogram on a frequency axis;

step 4, building a double-path depth residual error network with an attention layer;

and 5, inputting the high-frequency spectrogram and the low-frequency spectrogram in the step 3 into the depth residual error network in the step 4, and outputting the sound scene type to which the original voice signal belongs.

Preferably, in step 1:

wherein, X _h Is a horizontal spectrogram, X _p Is a vertical spectrogram, and X is an original voice spectrogram; kappa and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimizing the cost function J, let θ J/θ X _h 0 and θ J/θ X _p When the value is 0, the horizontal spectrogram X can be obtained _h And vertical spectrogram X _p 。

Preferably, in the step 2:

S _o (T，F)＝(S _X (T，F)，S _H (T，F)，S _P (T，F))

wherein S is _a Representing a fused spectrogram; s _X A logarithmic Mel spectrum, a first order difference logarithmic Mel spectrum and a second order difference logarithmic Mel spectrum representing the original speech signal; s _H Representing a logarithmic mel spectrum generated from a horizontal spectrogram, and a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; s _P Representing a logarithmic mel spectrum generated from a vertical spectrogram, a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; t and F denote a time axis and a frequency axis, respectively.

Preferably, the step 5 comprises the following steps:

step 51, inputting the high-frequency spectrogram and the low-frequency spectrogram into double paths of a depth residual error network, and then respectively outputting a high-frequency characteristic map and a low-frequency characteristic map;

step 52, fusing the high-frequency characteristic diagram and the low-frequency characteristic diagram on the frequency axis dimension to obtain a fused characteristic diagram, obtaining a multi-channel characteristic diagram through the fused characteristic diagram, and calculating through the multi-channel characteristic diagram to obtain an attention coefficient;

step 53, applying the attention coefficient to the multi-channel feature map to obtain a weighted feature map;

and step 54, expanding the weighted feature map into a one-dimensional feature vector, and outputting the sound scene type to which the original voice signal belongs through the feature vector.

Preferably, in step 52:

M _P (T，F)＝(M _P1 (T，F _L )，M _P2 (T，F _H ))

wherein M is _P (T, F) represents a fusion feature map; m _P1 (T，F _L ) And M _P2 (T，F _H ) Respectively representing a low-frequency characteristic diagram and a high-frequency characteristic diagram; t represents the height of the feature map; F. f _L And F _H The widths of the fused feature map, the low frequency feature map, and the high frequency feature map are shown, respectively.

Preferably, in step 52:

α＝σ(W ₂ ReLU(W ₁ z))

wherein α ∈ R ^C Representing an attention coefficient vector;

and

representing a weight; sigma represents a sigmoid activation function; m represents a multi-channel feature map; t and F respectively represent the height and width of the multichannel feature map; c, channel dimensions of the multi-channel feature map are shown; r represents a scaling factor.

Preferably, in step 53:

wherein,

as a weighted profile

The kth channel value of (2); m _k (T, F) is the kth channel value in the multi-channel feature map M (T, F); alpha is alpha _k Is the kth value in the attention coefficient vector α; t and F represent the height and width of the feature map, respectively; c represents the channel dimension.

Preferably, each path of the depth residual error network includes a residual error block;

the residual block comprises a batch normalization layer, a ReLU active layer, a convolution layer, a batch normalization layer, a ReLU active layer and a convolution layer which are connected in sequence, and outputs a low-frequency characteristic diagram and a high-frequency characteristic diagram;

after the low-frequency characteristic diagram and the high-frequency characteristic diagram are fused, inputting a batch normalization layer, a ReLU activation layer and a convolution block which are sequentially connected, wherein the convolution block comprises a convolution layer and a batch normalization layer which are sequentially connected, and outputting a multi-channel characteristic diagram;

inputting a global average pooling layer and a full-connection layer which are sequentially connected by a multi-channel feature map, and outputting an attention coefficient vector;

and (4) merging the attention coefficient vector and the multi-channel feature map, inputting the sequentially connected flat layer, full-connection layer and Softmax layer, and outputting a classification result.

Preferably, the pretreatment in step 1 comprises: down-sampling or up-sampling an original voice signal to 48kHZ, and then performing pre-emphasis, framing and windowing; in the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.

Has the advantages that: the invention has the following beneficial effects:

(1) according to the method, the logarithmic Mel spectrograms and the differential spectrograms of the signals of the original audio, the horizontal lines of the enhanced spectrogram and the vertical lines of the enhanced spectrogram are fused, so that the fused spectrogram not only embodies the static characteristics and the dynamic characteristics of the audio, but also enhances the expression capacity of the characteristics, and the accuracy of sound scene classification is effectively improved;

(2) the high-frequency part and the low-frequency part in the fusion spectrogram are separated, and a dual-path depth residual error network is built for modeling respectively, so that the characteristic that the high frequency and the low frequency in the spectrogram have different physical meanings is greatly reflected; the high-low frequency separation modeling enables the model to better capture the time-frequency characteristics of high-frequency components and low-frequency components, and similar sound scenes can be distinguished more accurately by utilizing the characteristics;

(3) according to the method, an attention mechanism is introduced into the depth residual error network, and attention weighting operation is performed on the multi-channel fusion feature graph on the channel dimension, so that the feature graph which has an active effect on the final classification result obtains higher attention in the rear full-connection layer, the classification effect of the model is effectively improved, and the recognition rate of the whole system is greatly improved.

Drawings

FIG. 1 is a general block diagram of an improved sound scene classification method according to the present invention;

FIG. 2 is a diagram of a dual path depth residual network with attention layer according to the present invention;

FIG. 3 is a diagram of an attention network architecture of the present invention;

FIG. 4 is a comparison of the classification results of the method of the present invention and other 4-class sound scene classification methods.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, as shown in figure 1, comprising the following steps:

step 1, preprocessing an original voice signal X in an audio sample, then performing Fourier transform to obtain an original voice spectrogram X, and decomposing the original voice signal X into a new horizontal spectrogram X by respectively enhancing horizontal lines and vertical lines in the original voice spectrogram X _h And vertical spectrogram X _p 。

The pre-processing includes uniform down-sampling or up-sampling of the original speech signal in all audio samples to 48kHZ followed by pre-emphasis, framing and windowing. In the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.

Because of the horizontal spectrogram X _h And vertical spectrogram X _p Is equal to the energy spectrum of the signal, so that the horizontal spectrogram X is solved _h And vertical spectrogram X _p The cost J (X) can be constructed _h ，X _p )：

Wherein, k and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimize the cost function J, order

And

then the horizontal spectrogram X can be obtained _h And vertical spectrogram X _p 。

Step 2, the horizontal spectrogram X _h And vertical spectrogram X _p Carrying out inverse Fourier transform to obtain a new time domain signal x generated by a horizontal spectrogram _h And a time domain signal x generated from the vertical spectrogram _p Then extracting the time domain signals x generated by the horizontal spectrogram respectively _h Time-domain signal x generated from vertical spectrogram _p And a logarithmic Mel spectrogram of the original speech signal x, and further calculating a first order difference logarithmic Mel spectrogram and a second order difference logarithmic Mel spectrogram thereof respectively.

And 3, fusing the three groups of logarithmic Mel spectrograms, the first-order difference logarithmic Mel spectrograms and the second-order difference logarithmic Mel spectrograms obtained in the step 2, and segmenting according to high frequency and low frequency.

Step 3, generating the time domain signal x generated by the horizontal spectrogram in the step 2 _h Time-domain signal x generated from vertical spectrogram _p And the logarithmic Mel spectrogram of the original voice signal x and the first order difference logarithmic Mel spectrogram and the second order difference logarithmic Mel spectrogram thereof are spliced on the channel dimension to form a fused spectrogram S _a (T，F)：

S _a (T，F)＝(S _X (T，F)，S _H (T，F)，S _P (T，F))

Wherein S is _X 、S _H And S _P Respectively representing an original speech signal x, a time domain signal x generated from a horizontal spectrogram _h And a time domain signal x generated from the vertical spectrogram _p Three spectra of (2), S _a Represents the fused spectrogram, and T and F represent the time axis and the frequency axis, respectively.

The fusion spectrogram embodies the static characteristics and the dynamic characteristics of the original voice signal and has good characteristic expression capability.

Then the fused spectrogram S is processed on a frequency axis _a (T, F) average cleavage into a high frequency spectrum S _a (T，F _L ) And low frequency spectrum S _a (T，F _H ) In which F is _L And F _H The frequency axes of the low-frequency spectrum and the high-frequency spectrum are respectively represented.

And 4, building a double-path depth residual error network model with an attention layer.

And 5, inputting the high-frequency spectrogram and the low-frequency spectrogram obtained in the step 3 into the depth residual error network established in the step 4 to obtain a final audio scene label.

The dual-path depth residual error network respectively models the high-frequency spectrogram and the low-frequency spectrogram, and fuses the characteristic graphs respectively obtained by the two paths on the frequency axis dimension:

M _P (T，F)＝(M _P1 (T，F _L )，M _P2 (T，F _H ))

wherein M is _P1 (T，F _L )、M _P2 (T，F _H ) And M _p (T, F) represents the low-frequency feature map output from the low-frequency path P1, the high-frequency feature map output from the high-frequency path P2, and the fusion feature map, respectively.

In one embodiment of the present invention, fig. 2 is a block diagram of a high-low frequency dual-path depth residual error network with attention layer according to the present invention. Each path in the depth residual error network model comprises 4 residual error blocks, and the structures in the residual error blocks are as follows in sequence: batch Normalization (BN) layer, ReLU active layer, convolution layer, BN layer, ReLU active layer, convolution layer, the size of convolution kernel in two convolution layers is 3 x 3. Tong (Chinese character of 'tong')After 4 residual blocks are subjected to feature extraction, feature graphs obtained on two paths are fused on the frequency axis dimension, and then a fused feature graph M is obtained _p And (T, F) finally obtaining a multi-channel feature map M (T, F) containing 768 channels through the BN layer, the ReLU activation layer and two convolution blocks, wherein the structures in the convolution blocks are a convolution layer and the BN layer in sequence, and the size of a convolution kernel in the convolution layer is 1 multiplied by 1.

Fig. 3 is a diagram of an attention network architecture. Sending the multi-channel feature map M (T, F) into an attention network, carrying out attention weighting on channel dimensions, and sequentially carrying out the following operations in the attention network:

(1) performing global average pooling on input multi-channel feature maps M (T, F) in channel dimensions, and encoding the whole spatial feature on one channel into a global feature:

in the above formula, M represents a multichannel characteristic diagram, and z is belonged to R ^C Is the output vector after global average pooling, T, F and C represent the height, width and number of channels, respectively, of the multi-channel feature map.

(2) The one-dimensional feature vector z of length C is input to the DNN model of the fully-connected layer containing two layers, and the output is calculated:

α＝F _DNN (z，W)＝σ(g(z，W))＝σ(W ₂ ReLU(W ₁ z))

in the above formula, alpha is epsilon to R ^C Is the output of the DNN model, i.e. the attention coefficient vector.

And

the weights of the two fully-connected layers are respectively, the C score represents a channel dimension, r represents a scale scaling coefficient, and sigma represents a sigmoid activation function. In order to reduce the complexity of the model and improve the generalization capability, the invention adopts a bottleeck structure comprising two fully-connected layers, wherein the second layerA full connection layer plays a role in dimension reduction, a scale scaling coefficient r is a hyper-parameter, and then ReLU activation is adopted; the second fully connected layer restores the original dimensions.

(3) The obtained attention coefficient vector acts on each channel of the multi-channel feature map, and the channels are weighted to obtain a weighted feature map

Wherein,

as a weighted profile

The kth channel value of (1); m _k (T, F) is the kth channel value in the multi-channel feature map M (T, F); alpha is alpha _k Is the kth value in the attention coefficient vector α; t and F represent the height and width of the feature map, respectively; c denotes the channel dimension.

And expanding the weighted feature map into a one-dimensional feature vector through a tiling (Flatten) layer, and finally obtaining the output of the model through a full connection layer and a Softmax layer, namely the scene category to which the original voice signal belongs.

Fig. 4 shows the comparison of the classification results of the improved sound scene classification method of the present invention and other 4-class sound scene classification methods. According to the improved sound scene classification method, 5 classification models are compared on a data set: gaussian Mixture Model (GMM), k-nearest neighbor (kNN), Support Vector Machine (SVM), Random Forest (RF) and the dual path depth residual network model proposed by the present invention. Using 988-dimensional feature vectors extracted from audio as input of GMM, kNN, SVM and RF models, wherein the number of Gaussian distributions in GMM is 12, and each Gaussian distribution has different standard covariance matrixes; taking 7 as the nearest k of the kNN model in classification; the penalty coefficient of the SVM is 1.8, a Gaussian kernel function is adopted, and the classification mode is OvO, namely one-to-one classification; the number of decision trees contained in the RF is 200, and the decision trees adopt a kini index as an optimal feature selection criterion when node splitting is performed. The kini index represents the probability of a randomly selected sample being divided in a sample set, and the kini index is the probability of the selected sample multiplied by the probability of the divided sample, and the property of the kini index is the same as the information entropy, and measures the uncertainty of a random variable: the larger the kini index, the higher the uncertainty in the data; the smaller the kini index, the lower the uncertainty in the representation data; the kini index is 0, indicating that all samples in the dataset are of the same class. The selected original voice data set comprises 14400 pieces of audio data of 10 types of sound scenes of airports, buses, subways, subway stations, parks, squares, shopping malls, walking streets, traffic streets and electric car tracks. Experimental results show that the improved sound scene classification method provided by the invention achieves the average accuracy of 81.6% on a data set, and is far higher than other 4 sound scene identification methods.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An attention mechanism and dual-path depth residual error network-based sound scene classification method is characterized by comprising the following steps:

2. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein in the step 1:

wherein, X _h Is a horizontal spectrogram, X _p Is a vertical spectrogram, X is an original voice spectrogram; kappa and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimize the cost function J, order

And

3. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein in the step 2:

S _a (T，F)＝(S _X (T，F)，S _H (T，F)，S _P (T，F))

wherein S is _a Representing a fused spectrogram; s _X A logarithmic Mel spectrum, a first order difference logarithmic Mel spectrum and a second order difference logarithmic Mel spectrum representing the original speech signal; s _H Representing a horizontal spectrumGenerating a logarithmic Mel spectrogram, a first order difference logarithmic Mel spectrogram and a second order difference logarithmic Mel spectrogram of the time domain signal; s _P Representing a logarithmic mel spectrum generated from a vertical spectrogram, a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; t and F denote a time axis and a frequency axis, respectively.

4. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein the step 5 comprises the following steps:

5. The method for sound scene classification based on attention mechanism and dual-path depth residual error network as claimed in claim 4, wherein in the step 52:

M _P (T，F)＝(M _P1 (T，F _L )，M _P2 (T，F _H ))

6. The method for sound scene classification based on attention mechanism and dual-path depth residual error network as claimed in claim 4, wherein in the step 52:

α＝σ(W ₂ ReLU(W ₁ z))

wherein α ∈ R ^C Representing an attention coefficient vector;

and

representing a weight; sigma represents a sigmoid activation function; m represents a multi-channel feature map; t and F respectively represent the height and width of the multi-channel feature map; c, channel dimensions of the multi-channel feature map are shown; r represents a scaling factor.

7. The method according to claim 4, wherein in the step 53:

wherein,

as a weighted profile

The kth channel value of (1); m _k (T, F) is the kth channel value in the multi-channel feature map M (T, F); alpha is alpha _k Is the kth value in the attention coefficient vector α; t and F respectively represent the height and width of the feature map; c denotes the channel dimension.

8. The method according to claim 4, wherein each path of the depth residual network comprises a residual block;

9. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein the preprocessing in step 1 comprises: down-sampling or up-sampling an original voice signal to 48kHZ, and then performing pre-emphasis, framing and windowing; in the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.