CN111754988B - Sound scene classification method based on attention mechanism and double-path depth residual error network - Google Patents

Sound scene classification method based on attention mechanism and double-path depth residual error network Download PDF

Info

Publication number
CN111754988B
CN111754988B CN202010585359.5A CN202010585359A CN111754988B CN 111754988 B CN111754988 B CN 111754988B CN 202010585359 A CN202010585359 A CN 202010585359A CN 111754988 B CN111754988 B CN 111754988B
Authority
CN
China
Prior art keywords
spectrogram
frequency
feature map
residual error
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010585359.5A
Other languages
Chinese (zh)
Other versions
CN111754988A (en
Inventor
唐闺臣
梁瑞宇
谢跃
黄裕磊
王青云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202010585359.5A priority Critical patent/CN111754988B/en
Publication of CN111754988A publication Critical patent/CN111754988A/en
Application granted granted Critical
Publication of CN111754988B publication Critical patent/CN111754988B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, which comprises the following steps of: calculating an original voice spectrogram, a horizontal spectrogram and a vertical spectrogram of the original voice signal, and transforming the horizontal spectrogram and the vertical spectrogram to obtain new two paths of time domain signals; respectively calculating a logarithmic Mel spectrogram, a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram of the original voice signal and the new two paths of time domain signals, and fusing on the channel dimension to obtain a fused spectrogram; dividing the fused spectrogram on a frequency axis into a high-frequency spectrogram and a low-frequency spectrogram in an average manner; building a double-path depth residual error network with an attention layer; and inputting the high-frequency spectrogram and the low-frequency spectrogram into a depth residual error network, and outputting the sound scene category to which the original voice signal belongs. The invention can better capture the time-frequency characteristics of high-frequency and low-frequency components and the importance of different channels in the characteristic diagram, and improves the accuracy and robustness of the sound scene classification system.

Description

Sound scene classification method based on attention mechanism and double-path depth residual error network
Technical Field
The invention belongs to the technical field of sound scene classification, and particularly relates to a sound scene classification method based on an attention mechanism and a double-path depth residual error network.
Background
The sound scene classification is to train a computer to correctly classify sound into scenes to which the computer belongs according to information contained in the sound. The sound scene classification technology is widely applied to the fields of Internet of things equipment, intelligent hearing aids, automatic driving and the like, and has very important significance in deep research on sound scene classification.
Acoustic scene classification initially belongs to a sub-domain of pattern recognition. In the nineties of the last century, Sawhney and Maes first proposed the concept of sound scene classification. They recorded a data set containing five sound scenes of sidewalks, subways, restaurants, parks and streets, and Sawhney extracted three characteristics of power spectral density, relative spectrum and frequency band of a filter bank from the recorded audio, and then classified by adopting a k-nearest neighbor and recurrent neural network algorithm, so that the accuracy rate is 68%. In the early twentieth century, the field of machine learning has rapidly developed, and more learners try to use a machine learning method to divide sound scenes. Machine learning algorithms such as a support vector machine and a decision tree gradually replace a traditional HMM model, and are widely applied to sound scene classification and sound event detection tasks. Meanwhile, the sound scene classification effect is further improved by some integrated learning methods such as random forests and XGboost. In 2015, Phan et al converted the sound scene classification problem into a regression problem, built a random forest regression-based model, and reduced the detection error rate by 6% and 10% on ITC-Irst and UPC-TALP databases, respectively. In 2012, Krizhevsky proposed the AlexNet model and obtained champions at a glance in the ImageNet image classification competition. The great success of AlexNet has triggered the heat of deep learning, and researchers have gradually begun to introduce the deep learning method into the sound scene classification task.
In addition, there are many acoustic features that can be used for sound scene classification, and how to fuse these features to match a deeply learned model is an important research direction in the future.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, which is characterized in that a logarithmic Mel spectrogram and a first-order and second-order difference spectrogram of the logarithmic Mel spectrogram are respectively obtained from three transformed signals, the logarithmic Mel spectrogram and the first-order and second-order difference spectrogram are fused and separated into high-frequency parts and low-frequency parts, and the high-frequency parts and the low-frequency parts are input into a double-path depth residual error network model with the attention mechanism, so that a characteristic diagram which has important influence on classification results can be effectively captured, and the accuracy and the robustness of a sound scene classification system are improved.
The technical scheme is as follows: the invention adopts the following technical scheme: an attention mechanism and dual-path depth residual error network-based sound scene classification method is characterized by comprising the following steps:
step 1, preprocessing an original voice signal, calculating an original voice spectrogram, respectively enhancing horizontal lines and vertical lines in the original voice spectrogram to obtain a horizontal spectrogram and a vertical spectrogram, and respectively transforming the horizontal spectrogram and the vertical spectrogram to obtain two new paths of time domain signals;
step 2, respectively calculating a logarithmic Mel spectrogram, a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram of the original voice signal and the new two paths of time domain signals, and fusing on channel dimensions to obtain a fused spectrogram;
step 3, averagely dividing the fusion spectrogram into a high-frequency spectrogram and a low-frequency spectrogram on a frequency axis;
step 4, building a double-path depth residual error network with an attention layer;
and 5, inputting the high-frequency spectrogram and the low-frequency spectrogram in the step 3 into the depth residual error network in the step 4, and outputting the sound scene type to which the original voice signal belongs.
Preferably, in step 1:
Figure BDA0002553104880000021
wherein, X h Is a horizontal spectrogram, X p Is a vertical spectrogram, and X is an original voice spectrogram; kappa and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimizing the cost function J, let θ J/θ X h 0 and θ J/θ X p When the value is 0, the horizontal spectrogram X can be obtained h And vertical spectrogram X p
Preferably, in the step 2:
S o (T,F)=(S X (T,F),S H (T,F),S P (T,F))
wherein S is a Representing a fused spectrogram; s X A logarithmic Mel spectrum, a first order difference logarithmic Mel spectrum and a second order difference logarithmic Mel spectrum representing the original speech signal; s H Representing a logarithmic mel spectrum generated from a horizontal spectrogram, and a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; s P Representing a logarithmic mel spectrum generated from a vertical spectrogram, a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; t and F denote a time axis and a frequency axis, respectively.
Preferably, the step 5 comprises the following steps:
step 51, inputting the high-frequency spectrogram and the low-frequency spectrogram into double paths of a depth residual error network, and then respectively outputting a high-frequency characteristic map and a low-frequency characteristic map;
step 52, fusing the high-frequency characteristic diagram and the low-frequency characteristic diagram on the frequency axis dimension to obtain a fused characteristic diagram, obtaining a multi-channel characteristic diagram through the fused characteristic diagram, and calculating through the multi-channel characteristic diagram to obtain an attention coefficient;
step 53, applying the attention coefficient to the multi-channel feature map to obtain a weighted feature map;
and step 54, expanding the weighted feature map into a one-dimensional feature vector, and outputting the sound scene type to which the original voice signal belongs through the feature vector.
Preferably, in step 52:
M P (T,F)=(M P1 (T,F L ),M P2 (T,F H ))
wherein M is P (T, F) represents a fusion feature map; m P1 (T,F L ) And M P2 (T,F H ) Respectively representing a low-frequency characteristic diagram and a high-frequency characteristic diagram; t represents the height of the feature map; F. f L And F H The widths of the fused feature map, the low frequency feature map, and the high frequency feature map are shown, respectively.
Preferably, in step 52:
α=σ(W 2 ReLU(W 1 z))
Figure BDA0002553104880000031
wherein α ∈ R C Representing an attention coefficient vector;
Figure BDA0002553104880000032
and
Figure BDA0002553104880000033
representing a weight; sigma represents a sigmoid activation function; m represents a multi-channel feature map; t and F respectively represent the height and width of the multichannel feature map; c, channel dimensions of the multi-channel feature map are shown; r represents a scaling factor.
Preferably, in step 53:
Figure BDA0002553104880000034
wherein,
Figure BDA0002553104880000035
as a weighted profile
Figure BDA0002553104880000036
The kth channel value of (2); m k (T, F) is the kth channel value in the multi-channel feature map M (T, F); alpha is alpha k Is the kth value in the attention coefficient vector α; t and F represent the height and width of the feature map, respectively; c represents the channel dimension.
Preferably, each path of the depth residual error network includes a residual error block;
the residual block comprises a batch normalization layer, a ReLU active layer, a convolution layer, a batch normalization layer, a ReLU active layer and a convolution layer which are connected in sequence, and outputs a low-frequency characteristic diagram and a high-frequency characteristic diagram;
after the low-frequency characteristic diagram and the high-frequency characteristic diagram are fused, inputting a batch normalization layer, a ReLU activation layer and a convolution block which are sequentially connected, wherein the convolution block comprises a convolution layer and a batch normalization layer which are sequentially connected, and outputting a multi-channel characteristic diagram;
inputting a global average pooling layer and a full-connection layer which are sequentially connected by a multi-channel feature map, and outputting an attention coefficient vector;
and (4) merging the attention coefficient vector and the multi-channel feature map, inputting the sequentially connected flat layer, full-connection layer and Softmax layer, and outputting a classification result.
Preferably, the pretreatment in step 1 comprises: down-sampling or up-sampling an original voice signal to 48kHZ, and then performing pre-emphasis, framing and windowing; in the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.
Has the advantages that: the invention has the following beneficial effects:
(1) according to the method, the logarithmic Mel spectrograms and the differential spectrograms of the signals of the original audio, the horizontal lines of the enhanced spectrogram and the vertical lines of the enhanced spectrogram are fused, so that the fused spectrogram not only embodies the static characteristics and the dynamic characteristics of the audio, but also enhances the expression capacity of the characteristics, and the accuracy of sound scene classification is effectively improved;
(2) the high-frequency part and the low-frequency part in the fusion spectrogram are separated, and a dual-path depth residual error network is built for modeling respectively, so that the characteristic that the high frequency and the low frequency in the spectrogram have different physical meanings is greatly reflected; the high-low frequency separation modeling enables the model to better capture the time-frequency characteristics of high-frequency components and low-frequency components, and similar sound scenes can be distinguished more accurately by utilizing the characteristics;
(3) according to the method, an attention mechanism is introduced into the depth residual error network, and attention weighting operation is performed on the multi-channel fusion feature graph on the channel dimension, so that the feature graph which has an active effect on the final classification result obtains higher attention in the rear full-connection layer, the classification effect of the model is effectively improved, and the recognition rate of the whole system is greatly improved.
Drawings
FIG. 1 is a general block diagram of an improved sound scene classification method according to the present invention;
FIG. 2 is a diagram of a dual path depth residual network with attention layer according to the present invention;
FIG. 3 is a diagram of an attention network architecture of the present invention;
FIG. 4 is a comparison of the classification results of the method of the present invention and other 4-class sound scene classification methods.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
The invention discloses an attention mechanism and double-path depth residual error network-based sound scene classification method, as shown in figure 1, comprising the following steps:
step 1, preprocessing an original voice signal X in an audio sample, then performing Fourier transform to obtain an original voice spectrogram X, and decomposing the original voice signal X into a new horizontal spectrogram X by respectively enhancing horizontal lines and vertical lines in the original voice spectrogram X h And vertical spectrogram X p
The pre-processing includes uniform down-sampling or up-sampling of the original speech signal in all audio samples to 48kHZ followed by pre-emphasis, framing and windowing. In the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.
Because of the horizontal spectrogram X h And vertical spectrogram X p Is equal to the energy spectrum of the signal, so that the horizontal spectrogram X is solved h And vertical spectrogram X p The cost J (X) can be constructed h ,X p ):
Figure BDA0002553104880000051
Wherein, k and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimize the cost function J, order
Figure BDA0002553104880000052
And
Figure BDA0002553104880000053
then the horizontal spectrogram X can be obtained h And vertical spectrogram X p
Step 2, the horizontal spectrogram X h And vertical spectrogram X p Carrying out inverse Fourier transform to obtain a new time domain signal x generated by a horizontal spectrogram h And a time domain signal x generated from the vertical spectrogram p Then extracting the time domain signals x generated by the horizontal spectrogram respectively h Time-domain signal x generated from vertical spectrogram p And a logarithmic Mel spectrogram of the original speech signal x, and further calculating a first order difference logarithmic Mel spectrogram and a second order difference logarithmic Mel spectrogram thereof respectively.
And 3, fusing the three groups of logarithmic Mel spectrograms, the first-order difference logarithmic Mel spectrograms and the second-order difference logarithmic Mel spectrograms obtained in the step 2, and segmenting according to high frequency and low frequency.
Step 3, generating the time domain signal x generated by the horizontal spectrogram in the step 2 h Time-domain signal x generated from vertical spectrogram p And the logarithmic Mel spectrogram of the original voice signal x and the first order difference logarithmic Mel spectrogram and the second order difference logarithmic Mel spectrogram thereof are spliced on the channel dimension to form a fused spectrogram S a (T,F):
S a (T,F)=(S X (T,F),S H (T,F),S P (T,F))
Wherein S is X 、S H And S P Respectively representing an original speech signal x, a time domain signal x generated from a horizontal spectrogram h And a time domain signal x generated from the vertical spectrogram p Three spectra of (2), S a Represents the fused spectrogram, and T and F represent the time axis and the frequency axis, respectively.
The fusion spectrogram embodies the static characteristics and the dynamic characteristics of the original voice signal and has good characteristic expression capability.
Then the fused spectrogram S is processed on a frequency axis a (T, F) average cleavage into a high frequency spectrum S a (T,F L ) And low frequency spectrum S a (T,F H ) In which F is L And F H The frequency axes of the low-frequency spectrum and the high-frequency spectrum are respectively represented.
And 4, building a double-path depth residual error network model with an attention layer.
And 5, inputting the high-frequency spectrogram and the low-frequency spectrogram obtained in the step 3 into the depth residual error network established in the step 4 to obtain a final audio scene label.
The dual-path depth residual error network respectively models the high-frequency spectrogram and the low-frequency spectrogram, and fuses the characteristic graphs respectively obtained by the two paths on the frequency axis dimension:
M P (T,F)=(M P1 (T,F L ),M P2 (T,F H ))
wherein M is P1 (T,F L )、M P2 (T,F H ) And M p (T, F) represents the low-frequency feature map output from the low-frequency path P1, the high-frequency feature map output from the high-frequency path P2, and the fusion feature map, respectively.
In one embodiment of the present invention, fig. 2 is a block diagram of a high-low frequency dual-path depth residual error network with attention layer according to the present invention. Each path in the depth residual error network model comprises 4 residual error blocks, and the structures in the residual error blocks are as follows in sequence: batch Normalization (BN) layer, ReLU active layer, convolution layer, BN layer, ReLU active layer, convolution layer, the size of convolution kernel in two convolution layers is 3 x 3. Tong (Chinese character of 'tong')After 4 residual blocks are subjected to feature extraction, feature graphs obtained on two paths are fused on the frequency axis dimension, and then a fused feature graph M is obtained p And (T, F) finally obtaining a multi-channel feature map M (T, F) containing 768 channels through the BN layer, the ReLU activation layer and two convolution blocks, wherein the structures in the convolution blocks are a convolution layer and the BN layer in sequence, and the size of a convolution kernel in the convolution layer is 1 multiplied by 1.
Fig. 3 is a diagram of an attention network architecture. Sending the multi-channel feature map M (T, F) into an attention network, carrying out attention weighting on channel dimensions, and sequentially carrying out the following operations in the attention network:
(1) performing global average pooling on input multi-channel feature maps M (T, F) in channel dimensions, and encoding the whole spatial feature on one channel into a global feature:
Figure BDA0002553104880000061
in the above formula, M represents a multichannel characteristic diagram, and z is belonged to R C Is the output vector after global average pooling, T, F and C represent the height, width and number of channels, respectively, of the multi-channel feature map.
(2) The one-dimensional feature vector z of length C is input to the DNN model of the fully-connected layer containing two layers, and the output is calculated:
α=F DNN (z,W)=σ(g(z,W))=σ(W 2 ReLU(W 1 z))
in the above formula, alpha is epsilon to R C Is the output of the DNN model, i.e. the attention coefficient vector.
Figure BDA0002553104880000062
And
Figure BDA0002553104880000063
the weights of the two fully-connected layers are respectively, the C score represents a channel dimension, r represents a scale scaling coefficient, and sigma represents a sigmoid activation function. In order to reduce the complexity of the model and improve the generalization capability, the invention adopts a bottleeck structure comprising two fully-connected layers, wherein the second layerA full connection layer plays a role in dimension reduction, a scale scaling coefficient r is a hyper-parameter, and then ReLU activation is adopted; the second fully connected layer restores the original dimensions.
(3) The obtained attention coefficient vector acts on each channel of the multi-channel feature map, and the channels are weighted to obtain a weighted feature map
Figure BDA0002553104880000071
Figure BDA0002553104880000072
Wherein,
Figure BDA0002553104880000073
as a weighted profile
Figure BDA0002553104880000074
The kth channel value of (1); m k (T, F) is the kth channel value in the multi-channel feature map M (T, F); alpha is alpha k Is the kth value in the attention coefficient vector α; t and F represent the height and width of the feature map, respectively; c denotes the channel dimension.
And expanding the weighted feature map into a one-dimensional feature vector through a tiling (Flatten) layer, and finally obtaining the output of the model through a full connection layer and a Softmax layer, namely the scene category to which the original voice signal belongs.
Fig. 4 shows the comparison of the classification results of the improved sound scene classification method of the present invention and other 4-class sound scene classification methods. According to the improved sound scene classification method, 5 classification models are compared on a data set: gaussian Mixture Model (GMM), k-nearest neighbor (kNN), Support Vector Machine (SVM), Random Forest (RF) and the dual path depth residual network model proposed by the present invention. Using 988-dimensional feature vectors extracted from audio as input of GMM, kNN, SVM and RF models, wherein the number of Gaussian distributions in GMM is 12, and each Gaussian distribution has different standard covariance matrixes; taking 7 as the nearest k of the kNN model in classification; the penalty coefficient of the SVM is 1.8, a Gaussian kernel function is adopted, and the classification mode is OvO, namely one-to-one classification; the number of decision trees contained in the RF is 200, and the decision trees adopt a kini index as an optimal feature selection criterion when node splitting is performed. The kini index represents the probability of a randomly selected sample being divided in a sample set, and the kini index is the probability of the selected sample multiplied by the probability of the divided sample, and the property of the kini index is the same as the information entropy, and measures the uncertainty of a random variable: the larger the kini index, the higher the uncertainty in the data; the smaller the kini index, the lower the uncertainty in the representation data; the kini index is 0, indicating that all samples in the dataset are of the same class. The selected original voice data set comprises 14400 pieces of audio data of 10 types of sound scenes of airports, buses, subways, subway stations, parks, squares, shopping malls, walking streets, traffic streets and electric car tracks. Experimental results show that the improved sound scene classification method provided by the invention achieves the average accuracy of 81.6% on a data set, and is far higher than other 4 sound scene identification methods.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (9)

1. An attention mechanism and dual-path depth residual error network-based sound scene classification method is characterized by comprising the following steps:
step 1, preprocessing an original voice signal, calculating an original voice spectrogram, respectively enhancing horizontal lines and vertical lines in the original voice spectrogram to obtain a horizontal spectrogram and a vertical spectrogram, and respectively transforming the horizontal spectrogram and the vertical spectrogram to obtain two new paths of time domain signals;
step 2, respectively calculating a logarithmic Mel spectrogram, a first-order difference logarithmic Mel spectrogram and a second-order difference logarithmic Mel spectrogram of the original voice signal and the new two paths of time domain signals, and fusing on channel dimensions to obtain a fused spectrogram;
step 3, averagely dividing the fusion spectrogram into a high-frequency spectrogram and a low-frequency spectrogram on a frequency axis;
step 4, building a double-path depth residual error network with an attention layer;
and 5, inputting the high-frequency spectrogram and the low-frequency spectrogram in the step 3 into the depth residual error network in the step 4, and outputting the sound scene type to which the original voice signal belongs.
2. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein in the step 1:
Figure FDA0002553104870000011
wherein, X h Is a horizontal spectrogram, X p Is a vertical spectrogram, X is an original voice spectrogram; kappa and lambda are weight smoothing factors; f and t represent frequency and time, respectively; minimize the cost function J, order
Figure FDA0002553104870000012
And
Figure FDA0002553104870000013
then the horizontal spectrogram X can be obtained h And vertical spectrogram X p
3. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein in the step 2:
S a (T,F)=(S X (T,F),S H (T,F),S P (T,F))
wherein S is a Representing a fused spectrogram; s X A logarithmic Mel spectrum, a first order difference logarithmic Mel spectrum and a second order difference logarithmic Mel spectrum representing the original speech signal; s H Representing a horizontal spectrumGenerating a logarithmic Mel spectrogram, a first order difference logarithmic Mel spectrogram and a second order difference logarithmic Mel spectrogram of the time domain signal; s P Representing a logarithmic mel spectrum generated from a vertical spectrogram, a first order difference logarithmic mel spectrum and a second order difference logarithmic mel spectrum; t and F denote a time axis and a frequency axis, respectively.
4. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein the step 5 comprises the following steps:
step 51, inputting the high-frequency spectrogram and the low-frequency spectrogram into double paths of a depth residual error network, and then respectively outputting a high-frequency characteristic map and a low-frequency characteristic map;
step 52, fusing the high-frequency characteristic diagram and the low-frequency characteristic diagram on the frequency axis dimension to obtain a fused characteristic diagram, obtaining a multi-channel characteristic diagram through the fused characteristic diagram, and calculating through the multi-channel characteristic diagram to obtain an attention coefficient;
step 53, applying the attention coefficient to the multi-channel feature map to obtain a weighted feature map;
and step 54, expanding the weighted feature map into a one-dimensional feature vector, and outputting the sound scene type to which the original voice signal belongs through the feature vector.
5. The method for sound scene classification based on attention mechanism and dual-path depth residual error network as claimed in claim 4, wherein in the step 52:
M P (T,F)=(M P1 (T,F L ),M P2 (T,F H ))
wherein M is P (T, F) represents a fusion feature map; m P1 (T,F L ) And M P2 (T,F H ) Respectively representing a low-frequency characteristic diagram and a high-frequency characteristic diagram; t represents the height of the feature map; F. f L And F H The widths of the fused feature map, the low frequency feature map, and the high frequency feature map are shown, respectively.
6. The method for sound scene classification based on attention mechanism and dual-path depth residual error network as claimed in claim 4, wherein in the step 52:
α=σ(W 2 ReLU(W 1 z))
Figure FDA0002553104870000021
wherein α ∈ R C Representing an attention coefficient vector;
Figure FDA0002553104870000022
and
Figure FDA0002553104870000023
representing a weight; sigma represents a sigmoid activation function; m represents a multi-channel feature map; t and F respectively represent the height and width of the multi-channel feature map; c, channel dimensions of the multi-channel feature map are shown; r represents a scaling factor.
7. The method according to claim 4, wherein in the step 53:
Figure FDA0002553104870000024
wherein,
Figure FDA0002553104870000025
as a weighted profile
Figure FDA0002553104870000026
The kth channel value of (1); m k (T, F) is the kth channel value in the multi-channel feature map M (T, F); alpha is alpha k Is the kth value in the attention coefficient vector α; t and F respectively represent the height and width of the feature map; c denotes the channel dimension.
8. The method according to claim 4, wherein each path of the depth residual network comprises a residual block;
the residual block comprises a batch normalization layer, a ReLU active layer, a convolution layer, a batch normalization layer, a ReLU active layer and a convolution layer which are connected in sequence, and outputs a low-frequency characteristic diagram and a high-frequency characteristic diagram;
after the low-frequency characteristic diagram and the high-frequency characteristic diagram are fused, inputting a batch normalization layer, a ReLU activation layer and a convolution block which are sequentially connected, wherein the convolution block comprises a convolution layer and a batch normalization layer which are sequentially connected, and outputting a multi-channel characteristic diagram;
inputting a global average pooling layer and a full-connection layer which are sequentially connected by a multi-channel feature map, and outputting an attention coefficient vector;
and (4) merging the attention coefficient vector and the multi-channel feature map, inputting the sequentially connected flat layer, full-connection layer and Softmax layer, and outputting a classification result.
9. The method for sound scene classification based on the attention mechanism and the dual-path depth residual error network as claimed in claim 1, wherein the preprocessing in step 1 comprises: down-sampling or up-sampling an original voice signal to 48kHZ, and then performing pre-emphasis, framing and windowing; in the frame division, each 2048 sampling points are divided into one frame, and the frame overlapping rate is 50%; when windowing, the window function used is a hamming window.
CN202010585359.5A 2020-06-23 2020-06-23 Sound scene classification method based on attention mechanism and double-path depth residual error network Active CN111754988B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010585359.5A CN111754988B (en) 2020-06-23 2020-06-23 Sound scene classification method based on attention mechanism and double-path depth residual error network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010585359.5A CN111754988B (en) 2020-06-23 2020-06-23 Sound scene classification method based on attention mechanism and double-path depth residual error network

Publications (2)

Publication Number Publication Date
CN111754988A CN111754988A (en) 2020-10-09
CN111754988B true CN111754988B (en) 2022-08-16

Family

ID=72678467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010585359.5A Active CN111754988B (en) 2020-06-23 2020-06-23 Sound scene classification method based on attention mechanism and double-path depth residual error network

Country Status (1)

Country Link
CN (1) CN111754988B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112418181B (en) * 2020-12-13 2023-05-02 西北工业大学 Personnel falling water detection method based on convolutional neural network
CN112700794B (en) * 2021-03-23 2021-06-22 北京达佳互联信息技术有限公司 Audio scene classification method and device, electronic equipment and storage medium
CN113361546A (en) * 2021-06-18 2021-09-07 合肥工业大学 Remote sensing image feature extraction method integrating asymmetric convolution and attention mechanism
CN113327626B (en) * 2021-06-23 2023-09-08 深圳市北科瑞声科技股份有限公司 Voice noise reduction method, device, equipment and storage medium
CN113793627B (en) * 2021-08-11 2023-12-29 华南师范大学 Attention-based multi-scale convolution voice emotion recognition method and device
CN113793622B (en) * 2021-09-10 2023-08-29 中国科学院声学研究所 Audio scene recognition method, system and device
CN114863938A (en) * 2022-05-24 2022-08-05 西南石油大学 Bird language identification method and system based on attention residual error and feature fusion
CN115602165B (en) * 2022-09-07 2023-05-05 杭州优航信息技术有限公司 Digital employee intelligent system based on financial system
CN116863957B (en) * 2023-09-05 2023-12-12 硕橙(厦门)科技有限公司 Method, device, equipment and storage medium for identifying operation state of industrial equipment
CN117975994B (en) * 2024-04-01 2024-06-11 华南师范大学 Quality classification method and device for voice data and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050080647A (en) * 2004-02-10 2005-08-17 삼성전자주식회사 Appratuses and methods for detecting and discriminating acoustical impact
CN107945811A (en) * 2017-10-23 2018-04-20 北京大学 A kind of production towards bandspreading resists network training method and audio coding, coding/decoding method
CN109859771A (en) * 2019-01-15 2019-06-07 华南理工大学 A kind of sound field scape clustering method of combined optimization deep layer transform characteristics and cluster process
CN109978034A (en) * 2019-03-18 2019-07-05 华南理工大学 A kind of sound scenery identification method based on data enhancing
CN110600054A (en) * 2019-09-06 2019-12-20 南京工程学院 Sound scene classification method based on network model fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050080647A (en) * 2004-02-10 2005-08-17 삼성전자주식회사 Appratuses and methods for detecting and discriminating acoustical impact
CN107945811A (en) * 2017-10-23 2018-04-20 北京大学 A kind of production towards bandspreading resists network training method and audio coding, coding/decoding method
CN109859771A (en) * 2019-01-15 2019-06-07 华南理工大学 A kind of sound field scape clustering method of combined optimization deep layer transform characteristics and cluster process
CN109978034A (en) * 2019-03-18 2019-07-05 华南理工大学 A kind of sound scenery identification method based on data enhancing
CN110600054A (en) * 2019-09-06 2019-12-20 南京工程学院 Sound scene classification method based on network model fusion

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
A real-time environmental sound recognition;PILLOS A,ALGHAMIDI K,ALZAMEL N,et al.;《http://www.cs.tut.fi/sgn/arg/dcase2016/documents/workshop/Pillos-》;20190220;全文 *
Acoustic scene classification with matrix;Bisot V,Serizel R,Essid S,et al.;《Proc of 2016IEEE International 》;20161231;全文 *
Environmental sound classification with convolutional neural networks;PICZAK K J.;《Proceedings of the IEEE 25th International Workshop on Machine Learning for 》;20151231;全文 *
基于双重数据增强策略的音频分类方法;周迅等;《武汉科技大学学报》;20200415(第02期);全文 *
基于梅尔倒谱系数、深层卷积和Bagging的环境音分类方法;王天锐等;《计算机应用》;20191231(第12期);全文 *
基于残差网络和随机森林的音频识别方法;张晓龙等;《计算机工程与科学》;20190415(第04期);全文 *

Also Published As

Publication number Publication date
CN111754988A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
CN111754988B (en) Sound scene classification method based on attention mechanism and double-path depth residual error network
CN110120218B (en) Method for identifying highway large-scale vehicles based on GMM-HMM
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
Devi et al. Automatic speaker recognition from speech signals using self organizing feature map and hybrid neural network
CN109346084A (en) Method for distinguishing speek person based on depth storehouse autoencoder network
CN110047501B (en) Many-to-many voice conversion method based on beta-VAE
CN112735435A (en) Voiceprint open set identification method with unknown class internal division capability
WO2024140070A1 (en) Small sample speech separation method based on data generation
Chowdhuri Phononet: multi-stage deep neural networks for raga identification in hindustani classical music
Awais et al. Speaker recognition using mel frequency cepstral coefficient and locality sensitive hashing
CN112562725A (en) Mixed voice emotion classification method based on spectrogram and capsule network
Zheng et al. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios
Nawas et al. Speaker recognition using random forest
CN111243621A (en) Construction method of GRU-SVM deep learning model for synthetic speech detection
Kamaruddin et al. Features extraction for speech emotion
Chon et al. Acoustic scene classification using aggregation of two-scale deep embeddings
Alrehaili et al. Arabic speech dialect classification using deep learning
Chen et al. ACGAN-based data augmentation integrated with long-term scalogram for acoustic scene classification
CN112466333A (en) Acoustic scene classification method and system
CN105006231A (en) Distributed large population speaker recognition method based on fuzzy clustering decision tree
JP5091202B2 (en) Identification method that can identify any language without using samples
CN114970695B (en) Speaker segmentation clustering method based on non-parametric Bayesian model
Sangeetha et al. Analysis of machine learning algorithms for audio event classification using Mel-frequency cepstral coefficients
Chen et al. Long-term scalogram integrated with an iterative data augmentation scheme for acoustic scene classification
Nemati et al. RETRACTED CHAPTER: A Novel Text-Independent Speaker Verification System Using Ant Colony Optimization Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant