CN113793622B

CN113793622B - Audio scene recognition method, system and device

Info

Publication number: CN113793622B
Application number: CN202111064395.8A
Authority: CN
Inventors: 张鹏远; 王猛; 颜永红
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-08-29
Anticipated expiration: 2041-09-10
Also published as: CN113793622A

Abstract

The invention relates to an audio scene recognition method, which comprises the following steps: acquiring audio to be identified; extracting wavelet characteristics of the audio to be identified to determine wavelet characteristics corresponding to the audio to be identified; inputting the wavelet characteristics corresponding to the audio to be identified into a neural network embedded characteristic extractor with a residual network structure to obtain at least one depth embedded characteristic sequence; and inputting the wavelet characteristics corresponding to the audio to be identified and at least one depth embedded characteristic sequence into a neural network classifier together to determine an audio scene corresponding to the audio to be identified. The invention extracts the wavelet characteristics of the voice data to be recognized and can adapt to the requirement of time-frequency signal analysis. Meanwhile, the neural network embedded feature extractor with the residual network structure can ensure that the extracted depth embedded features have higher accuracy in recognition when in large data training, and can also greatly improve the recognition performance of short-time audio.

Description

Audio scene recognition method, system and device

Technical Field

The invention relates to the field of audio identification, in particular to an audio scene identification method, an audio scene identification system and an audio scene identification device based on wavelet characteristics and a one-dimensional residual neural network.

Background

Sound is an important channel for communicating information in human life, and in daily life, sound can be largely classified into voice and environmental sound. Communication is typically performed by voice for people. The environmental sound is different from the voice, is independent of the current speaker, and contains rich information of nature and human activities.

Audio scene recognition is a fundamental task for understanding ambient sound and is also an important research direction for audio information processing. The main aim is to identify a specific scene tag of a section of audio so as to achieve the aim of sensing the surrounding environment. At present, the technology of audio scene recognition is widely applied to intelligent robots and numerous terminal devices.

However, conventional audio scene recognition methods generally employ machine learning methods, such as nearest neighbor algorithms, hidden markov models, support vector machines, and the like. However, such conventional methods do not perform well with large amounts of data, and all have reached a bottleneck.

In recent years, the method based on the deep neural network is also rapidly developed in the direction of audio scene recognition, and the deep neural network can extract the deeper audio features, so that better classification can be performed. The current method of adopting the deep neural network is a fully-connected convolutional neural network based on two-dimensional convolution. The network works well when judging long-term audio, for example, 10 seconds and more. However, the performance is significantly degraded when short-time audio is judged, for example, about 1 second. Clearly, there is a need for an audio scene recognition scheme that overcomes the above-mentioned problems.

Disclosure of Invention

The invention relates to an audio scene recognition method, which extracts a depth embedded feature sequence from wavelet features by extracting wavelet features of voice data to be recognized and extracting a depth embedded feature sequence from the wavelet features according to a neural network embedded feature extractor with a residual network structure, so that an audio scene corresponding to audio to be recognized can be determined by using the extracted depth embedded feature sequence. The wavelet features can meet the requirement of time-frequency signal analysis in a self-adaptive manner, and can ensure that the extracted deep embedded features have higher accuracy in recognition when being used for training a large amount of data by combining the neural network with a residual error network structure, and greatly improve the recognition performance of short-time audio.

To achieve the above object, a first aspect of the present invention provides an audio scene recognition method, including: acquiring audio to be identified; extracting wavelet characteristics of the audio to be identified to determine wavelet characteristics corresponding to the audio to be identified; inputting the wavelet characteristics corresponding to the audio to be identified into a neural network embedded characteristic extractor with a residual network structure to obtain at least one depth embedded characteristic sequence; and inputting the wavelet characteristics corresponding to the audio to be identified and at least one depth embedded characteristic sequence into a neural network classifier together to determine an audio scene corresponding to the audio to be identified. The invention extracts the wavelet characteristics of the voice data to be recognized and can adapt to the requirement of time-frequency signal analysis. Meanwhile, the neural network embedded feature extractor with the residual network structure can ensure that the extracted depth embedded features have higher accuracy in recognition when in large data training, and can also greatly improve the recognition performance of short-time audio.

Preferably, the extracting of the wavelet characteristics of the audio to be identified to determine the wavelet characteristics corresponding to the audio to be identified includes: determining a frequency spectrum corresponding to the audio to be identified; and obtaining wavelet characteristics corresponding to the audio to be identified by the frequency spectrum through a plurality of wavelet filters.

Preferably, determining the corresponding spectrum in the audio to be identified comprises: pre-emphasis is carried out on the audio to be identified; carrying out framing windowing on the pre-emphasized audio to be identified, and determining multi-frame pre-emphasized audio to be identified; and performing fast Fourier transform on each frame in the audio to be identified after multi-frame pre-emphasis to determine the frequency spectrum corresponding to each frame.

Preferably, the framing and windowing comprises: frame-shifting by 171 ms at 512 ms per frame; and windowing with a hamming window as a window function. The method for framing and windowing can effectively improve the accuracy of audio scene identification.

Preferably, the obtaining the wavelet characteristics corresponding to the audio to be identified by passing the frequency spectrum through a plurality of wavelet filters includes: squaring the spectrum to determine an energy spectrum; and inputting the energy spectrum into a plurality of wavelet filters to obtain wavelet characteristics corresponding to the audio to be identified. The invention obtains wavelet characteristics through the wavelet filter so as to adapt to the requirement of time-frequency signal analysis.

Squaring the spectrum to determine an energy spectrum may preferably include: and squaring the frequency spectrum corresponding to each frame to determine the energy spectrum corresponding to each frame.

Preferably, the wavelet features are wavelet feature spectrograms corresponding to frames in the audio to be identified; or the wavelet characteristics are wavelet characteristic sequences corresponding to the audio to be identified, and the wavelet characteristic sequences comprise wavelet characteristic spectrograms corresponding to each frame.

Preferably, the number of the wavelet filters is 290; the wavelet feature spectrogram is a one-dimensional wavelet feature vector containing 290 parameters; the wavelet feature sequence is a two-dimensional wavelet feature vector with n x 290 parameters, wherein n is the number of frames of the audio to be identified, and n is a positive integer. The invention adopts 290 wavelet filters to obtain the wavelet feature spectrogram containing 290 parameters or the wavelet feature sequence with n.290 parameters, thereby ensuring that more accurate depth embedding features can be extracted later so as to improve the accuracy of the audio scene identification.

Preferably, the neural network embedded feature extractor having a residual network structure includes: and at least one network block, wherein each network block comprises a two-way convolution layer, the two-way convolution layer has two convolution paths, and each network block combines the results of the two convolution paths in the network block to determine a depth embedded feature sequence output by the network block. The invention adopts a double-way convolution mode to ensure that the extracted depth embedded feature sequence is more accurate in the subsequent identification of the audio scene.

Preferably, the number of network blocks is 4; one of the two-way convolution layers comprises a first convolution layer, a first normalization layer and an average pooling layer, and the other one comprises a second convolution layer and a second normalization layer.

Preferably, the neural network classifier comprises a feature stitching layer and a full-connection classification layer, wherein the full-connection classification layer comprises at least one full-connection mapping layer and a result output layer; the wavelet characteristics corresponding to the audio to be identified and at least one depth embedded characteristic sequence are input into a neural network classifier together to determine an audio scene corresponding to the audio to be identified, and the method comprises the following steps: inputting wavelet features corresponding to the audio to be identified and at least one depth embedded feature sequence into a feature splicing layer for stretching and splicing to form a one-dimensional depth feature vector; inputting the one-dimensional depth feature vector to at least one fully connected mapping layer to determine audio scene classification features; inputting the classification characteristics of the audio scenes into a result output layer to determine probability values of all the audio scenes; and determining the audio scene corresponding to the audio to be identified according to the probability value of each audio scene.

Preferably, determining an audio scene corresponding to the audio to be identified according to the probability value of each audio scene includes: and taking the audio scene with the maximum probability value as the audio scene corresponding to the audio to be identified.

Preferably, the method further comprises: in the training stage, if the number of the full-connection mapping layers is greater than or equal to 2, the other full-connection mapping layers except the last full-connection mapping layer adopt a random inactivation mode to mask part of neurons with preset probability. In the training stage, the method can effectively relieve the phenomenon of over-fitting of the audio scene recognition in a random inactivation mode.

A second aspect of the present invention provides an audio scene recognition system, the system comprising: the device comprises a signal processing and feature extractor, a neural network embedded feature extractor with a residual error network structure and a neural network classifier; the signal processing and characteristic extractor is used for acquiring the audio to be identified; extracting wavelet characteristics of the audio to be identified to determine wavelet characteristics corresponding to the audio to be identified; the neural network embedded feature extractor is provided with a residual network structure and is used for obtaining at least one depth embedded feature sequence according to wavelet features corresponding to the audio to be identified; the neural network classifier is used for determining an audio scene corresponding to the audio to be identified according to the wavelet features corresponding to the audio to be identified and at least one depth embedded feature sequence. The invention extracts the wavelet characteristics of the voice data to be recognized and can adapt to the requirement of time-frequency signal analysis. Meanwhile, the neural network embedded feature extractor with the residual network structure can ensure that the extracted depth embedded features have higher accuracy in recognition when in large data training, and greatly improve the recognition performance of short-time audio.

Preferably, the signal processing and feature extractor is further configured to: determining a frequency spectrum corresponding to the audio to be identified; and obtaining wavelet characteristics corresponding to the audio to be identified by the frequency spectrum through a plurality of wavelet filters.

Preferably, the signal processing and feature extractor is further configured to: pre-emphasis is carried out on the audio to be identified; carrying out framing windowing on the pre-emphasized audio to be identified, and determining multi-frame pre-emphasized audio to be identified; and performing fast Fourier transform on each frame in the audio to be identified after multi-frame pre-emphasis to determine the frequency spectrum corresponding to each frame.

Preferably, the signal processing and feature extractor is further configured to: squaring the spectrum to determine an energy spectrum; and inputting the energy spectrum into a plurality of wavelet filters to obtain wavelet characteristics corresponding to the audio to be identified. The invention obtains the wavelet characteristic spectrogram through the wavelet filter, thereby being capable of adapting to the requirement of time-frequency signal analysis.

Preferably, the signal processing and feature extractor is further configured to: and squaring the frequency spectrum corresponding to each frame to determine the energy spectrum corresponding to each frame.

Preferably, the neural network classifier comprises a feature stitching layer and a full-connection classification layer, wherein the full-connection classification layer comprises at least one full-connection mapping layer and a result output layer; the neural network classifier is also used to: inputting wavelet features corresponding to the audio to be identified and at least one depth embedded feature sequence into a feature splicing layer for stretching and splicing to form a one-dimensional depth feature vector; inputting the one-dimensional depth feature vector to at least one fully connected mapping layer to determine audio scene classification features; inputting the classification characteristics of the audio scenes into a result output layer to determine probability values of all the audio scenes; and determining the audio scene corresponding to the audio to be identified according to the probability value of each audio scene.

Preferably, the neural network classifier is further configured to: and taking the audio scene with the maximum probability value as the audio scene corresponding to the audio to be identified.

Preferably, the neural network classifier is further configured to: in the training stage, if the number of the full-connection mapping layers is greater than or equal to 2, the other full-connection mapping layers except the last full-connection mapping layer adopt a random inactivation mode to mask part of neurons with preset probability. In the training stage, the method can effectively relieve the phenomenon of over-fitting of the audio scene recognition in a random inactivation mode.

A third aspect of the present invention provides an audio scene recognition apparatus, the apparatus comprising: the processor is used for coupling with the memory and reading and executing the instructions stored in the memory; the processor is pre-stored with the execution codes of the neural network embedded feature extractor and the neural network classifier with residual network structures; executing instructions when the processor is running, so that the processor is used for acquiring the audio to be identified; extracting wavelet characteristics of the audio to be identified to determine wavelet characteristics corresponding to the audio to be identified; inputting the wavelet characteristics corresponding to the audio to be identified into a neural network embedded characteristic extractor with a residual network structure to obtain at least one depth embedded characteristic sequence; and inputting the wavelet characteristics corresponding to the audio to be identified and at least one depth embedded characteristic sequence into a neural network classifier together to determine an audio scene corresponding to the audio to be identified. The invention extracts the wavelet characteristics of the voice data to be recognized and can adapt to the requirement of time-frequency signal analysis. Meanwhile, the neural network embedded feature extractor with the residual network structure can ensure that the extracted depth embedded features have higher accuracy in recognition when in large data training, and greatly improve the recognition performance of short-time audio.

Preferably, the processor is further configured to: determining a frequency spectrum corresponding to the audio to be identified; and obtaining wavelet characteristics corresponding to the audio to be identified by the frequency spectrum through a plurality of wavelet filters.

Preferably, the processor is further configured to: pre-emphasis is carried out on the audio to be identified; carrying out framing windowing on the pre-emphasized audio to be identified, and determining multi-frame pre-emphasized audio to be identified; and performing fast Fourier transform on each frame in the audio to be identified after multi-frame pre-emphasis to determine the frequency spectrum corresponding to each frame.

Preferably, the processor is further configured to: squaring the spectrum to determine an energy spectrum; and inputting the energy spectrum into a plurality of wavelet filters to obtain wavelet characteristics corresponding to the audio to be identified. The invention obtains the wavelet characteristic spectrogram through the wavelet filter, thereby being capable of adapting to the requirement of time-frequency signal analysis.

Preferably, the processor is further configured to: and squaring the frequency spectrum corresponding to each frame to determine the energy spectrum corresponding to each frame.

Preferably, the neural network classifier comprises a feature stitching layer and a full-connection classification layer, wherein the full-connection classification layer comprises at least one full-connection mapping layer and a result output layer; the processor is further configured to: stretching and splicing wavelet features corresponding to the audio to be identified and at least one depth embedded feature sequence through a feature splicing layer to form a one-dimensional depth feature vector; determining the classification characteristics of the audio scene by the one-dimensional depth characteristic vector through at least one full-connection mapping layer; the audio scene classification features pass through a result output layer to determine probability values of all audio scenes; and determining the audio scene corresponding to the audio to be identified according to the probability value of each audio scene.

Preferably, the processor is further configured to: and taking the audio scene with the maximum probability value as the audio scene corresponding to the audio to be identified.

Preferably, the processor is further configured to: in the training stage, if the number of the full-connection mapping layers is greater than or equal to 2, the other full-connection mapping layers except the last full-connection mapping layer adopt a random inactivation mode to mask part of neurons with preset probability. In the training stage, the method can effectively relieve the phenomenon of over-fitting of the audio scene recognition in a random inactivation mode.

The invention realizes an audio scene recognition method, which extracts a depth embedded feature sequence from wavelet features by extracting wavelet features of voice data to be recognized and adopting a neural network embedded feature extractor with a residual network structure, thereby determining an audio scene corresponding to audio to be recognized according to the depth embedded feature sequence. The wavelet features in the invention can adapt to the requirement of time-frequency signal analysis, and the neural network with the residual network structure can ensure that the extracted deep embedding features have higher accuracy in recognition and greatly improve the recognition performance of short-time audio when a large amount of data is trained.

Drawings

Fig. 1 is a schematic diagram of an audio scene recognition system according to an embodiment of the present invention;

fig. 2 is a flowchart of an audio scene recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network embedded feature extractor according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a neural network classifier according to an embodiment of the present invention;

FIG. 5 is a flowchart of another method for identifying audio scenes according to an embodiment of the present invention;

Fig. 6 is a schematic diagram of an audio scene recognition device according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

The method and the device are mainly applied to the scene of the audio scene recognition. For example, the current ambient sound is obtained and identified to determine what audio scene the user is currently in. Such as a class in which the user is in a lecture, a noisy square, a dense road of vehicles, etc.

However, the existing scheme has design defects, so that performance after training by adopting large-scale training data in the training stage can not be improved. Some schemes use deep neural networks, such as two-dimensional convolutional fully-connected neural networks. But such networks suffer from significant degradation in performance in the face of short-term audio. Obviously this is due to the design of the two-dimensional convolutional fully-connected neural network itself.

Therefore, the invention provides an audio scene recognition method, which adopts a neural network embedded feature extractor with a residual network structure to extract depth features, and can solve the problem that the full-connection neural network with two-dimensional convolution has obviously reduced short-time audio performance. Meanwhile, the neural network embedded feature extractor with the residual network structure is a deep neural network, can adapt to large-scale training data adopted in the training process, and can obviously improve the performance of the trained neural network. In addition, the invention can adapt to the requirement of time-frequency signal analysis in the recognition process by extracting the wavelet characteristics of the voice data to be recognized, thereby greatly improving the accuracy of finally recognizing the audio scene.

In order to more clearly illustrate the solution of the present invention, the following describes the technical solution of the embodiment of the present invention in detail with reference to the accompanying drawings in the embodiment of the present invention.

Fig. 1 is a schematic diagram of an audio scene recognition system according to an embodiment of the present invention.

As shown in fig. 1, the present invention provides an audio scene recognition system 100, the system 100 comprising a signal processing and feature extraction module 101, a neural network embedded feature extractor 102 having a residual network structure, and a neural network classifier 103.

The signal processing and feature extraction module 101 is mainly configured to obtain the audio to be identified, and perform wavelet feature extraction on the audio to be identified to determine wavelet features corresponding to the audio to be identified. The audio to be identified is the audio ready for audio scene identification. The audio signal may be acquired by the signal processing and feature extraction module 101, or may be acquired in advance and input to the signal processing and feature extraction module 101, or may be configured in the signal processing and feature extraction module 101 in advance, and the signal processing and feature extraction module 101 only needs to read the audio signal to be identified stored in advance.

The neural network embedded feature extractor 102 with the residual network structure is mainly used for performing deep feature extraction on the wavelet features extracted by the signal processing and feature extraction module 101 to obtain a deep embedded feature sequence. It will be appreciated that the depth-embedded feature sequence is a feature obtained by extracting a depth feature from the wavelet features, and that the depth-embedded feature sequence can obviously more accurately represent a corresponding audio scene than the wavelet features.

The neural network classifier 103 performs classification recognition mainly according to the deep embedded feature sequence output by the neural network embedded feature extractor 102, so that the recognition result of the audio scene can be output.

The operation of the audio scene recognition system 100 will be described in more detail below.

Fig. 2 is a flowchart of an audio scene recognition method according to an embodiment of the present invention.

As shown in fig. 2, the method for identifying an audio scene provided by the present invention can be applied to the audio scene identification system shown in fig. 1. The method may comprise the steps of:

s201, acquiring audio to be identified.

First, the signal processing and feature extraction module 101 obtains audio to be identified, which is ready for audio scene identification. In some examples, the audio to be identified may be collected in real-time by the signal processing and feature extraction module 101. In other examples, the audio to be identified may be pre-acquired and input to the signal processing and feature extraction module 101. Of course, in still other examples, the audio to be identified may be already configured in advance in the signal processing and feature extraction module 101 to be subjected to audio scene identification, and the present invention is not limited thereto.

S202, determining a frequency spectrum corresponding to the audio to be identified.

After the signal processing and feature extraction module 101 obtains the audio to be identified in S201, a frequency spectrum corresponding to the audio to be identified may be determined. In some examples, a spectrum corresponding to each frame in the audio to be identified may be determined.

In one example, the signal processing and feature extraction module 101 may pre-emphasis the audio to be identified to boost the high frequency portion of the audio to be identified. And then framing and windowing the audio to be identified after the pre-emphasis treatment. For example, the frame length of each frame may be set to 512 ms and the frame is shifted by 171 ms for framing. A windowing process is then performed for each frame of data. For example, a hamming window may be used as the window function for windowing. It will be appreciated that the size of each window corresponds to the size of the frame, for example the hamming window size and frame length are each 512 milliseconds. Of course, the specific size settings for each frame length and frame shift may be arbitrarily modified according to the actual situation. The above example is merely a preferred solution, and it may be ensured that after framing and windowing is performed under the condition of the parameter setting, the data may extract more effective features, so as to subsequently improve accuracy of audio scene recognition.

Thereafter, for each frame of data after the frame windowing, a fast fourier transform (fast fourier transform, FFT) may be performed to obtain a frequency spectrum corresponding to each frame of audio to be identified.

Of course, in some examples, if the audio to be identified is not frame-windowed, the audio to be identified may also be directly subjected to FFT to obtain a frequency spectrum corresponding to the audio to be identified.

And S203, the frequency spectrum is subjected to a plurality of wavelet filters to obtain wavelet characteristics corresponding to the audio to be identified.

The signal processing and feature extraction module 101 may obtain, for a spectrum corresponding to the audio to be identified, an energy spectrum corresponding to the audio to be identified by squaring. Of course, in some examples, if the signal processing and feature extraction module 101 obtains a spectrum corresponding to each frame in the audio to be identified in S201, the square may be performed on each frame spectrum to obtain an energy spectrum corresponding to each frame.

Then, the energy spectrum corresponding to the audio to be identified can be input into a wavelet filter bank for calculation, and then the logarithm is taken to obtain the wavelet characteristics corresponding to the audio to be identified. Or, the energy spectrum corresponding to each frame may be input to a wavelet filter bank for calculation, and then the logarithm of the filtering result is taken to obtain the wavelet feature corresponding to each frame. It can be understood that, if the signal processing and feature extraction module 101 in S201 obtains the spectrum corresponding to each frame in the audio to be identified, the spectrum corresponding to each frame may also be input into the wavelet filter bank together for calculation, and then the logarithm is taken for the filtering result of each frame to obtain the wavelet feature corresponding to the audio to be identified. Obviously, the wavelet features corresponding to each frame are contained in the wavelet features corresponding to the identification audio.

The wavelet features corresponding to each frame may be a wavelet feature spectrogram. Multiple wavelet feature patterns may be combined to obtain a wavelet feature sequence. In one example, the wavelet feature spectrograms corresponding to each frame in the audio to be identified may be all combined to obtain the wavelet feature sequence. Therefore, the wavelet characteristics corresponding to each frame are wavelet characteristics spectrograms. The wavelet features corresponding to the audio to be identified may be wavelet feature sequences.

In one example, the wavelet filter bank may include a plurality of wavelet filters. For example, 290 wavelet filters may be preferably selected. Thus, the wavelet feature spectrogram may be a one-dimensional wavelet feature vector containing 290 parameters. And the wavelet feature sequence may be represented as a two-dimensional wavelet feature vector of n x 290 parameters. Where n is the number of frames of the audio to be identified and n is a positive integer. Obviously, the consumption of resources is reduced on the premise that the number of the wavelet filters is 290, so that the accuracy of identifying the audio scene is improved.

It can be appreciated that, for the wavelet feature sequence, since the wavelet feature spectrogram corresponding to each frame is combined, time and resource waste caused by transmitting data can be saved when transmitting.

It should be noted that the above-mentioned number of wavelet filters is only a preferred embodiment, and in other examples, any number of wavelet filters may be selected according to practical situations, and the present invention is not limited herein.

S204, inputting the wavelet characteristics corresponding to the audio to be identified into a neural network embedded feature extractor with a residual network structure to obtain at least one depth embedded feature sequence.

After the signal processing and feature extraction module 101 determines the wavelet features corresponding to the audio to be identified, the wavelet features corresponding to the audio to be identified may be input to the neural network embedded feature extractor 102 with the residual network structure for depth feature extraction, so as to obtain at least one depth embedded feature sequence.

In one example, depth feature extraction may be performed for each frame when depth feature extraction is performed for the neural network embedded feature extractor 102 having a residual network structure. Therefore, if the wavelet features corresponding to the audio to be identified include the wavelet features corresponding to each frame, the wavelet features corresponding to each frame may be respectively input into the neural network embedded feature extractor 102 with the residual network structure for depth feature extraction. That is, the wavelet feature spectrogram corresponding to each frame may be input into the neural network embedded feature extractor 102 having a residual network structure for depth feature extraction.

In one example, the specific structure of the neural network embedded feature extractor 102 may be as shown in fig. 3. Fig. 3 is a schematic structural diagram of a neural network embedded feature extractor according to an embodiment of the present invention. It can be seen that the neural network embedded feature extractor 102 may include a plurality of network blocks. For example, 4 network blocks, namely network block 1 301, network block 2 302, network block 3 303 and network block 4 304, may preferably be selected. Wherein the structure of each network block is identical.

Each network block may include a two-way convolution layer, as the name implies, that includes two-way convolutions. One of which may include a first convolution layer 3011, a first batch normalization layer 3012, and an average pooling layer 3013. The other way may include a second convolution layer 3014 and a second set of normalization layers 3015. The two convolved outputs are then superimposed by accumulator 3016 to obtain the depth embedded feature sequence for the network block. The above two-way convolution can perform depth mapping on the features through a larger Receptive Field (RF), wherein the receptive field is the area size mapped by the pixels on the feature map (feature map) output by each layer of the convolutional neural network on the input image. Again, the interpretation of popular points is that one point on the feature map corresponds to an area on the input map. Of course, if the input is a one-dimensional vector, the output feature map is also a one-dimensional vector, and the receptive field is that a point in the output one-dimensional feature vector corresponds to an area on the input one-dimensional vector.

It can be appreciated that if the convolution kernels in the first convolution layer 3011 and the second convolution layer 3014 are one-dimensional convolution kernels, the neural network embedded feature extractor 102 may perform depth feature extraction on the wavelet feature spectrogram corresponding to each frame. If the convolution kernels in the first and second convolution layers 3011 and 3014 are two-dimensional convolution kernels, the neural network embedded feature extractor 102 must take a wavelet feature sequence as input when performing depth feature extraction. The reason is that the wavelet feature sequence is a two-dimensional vector, the wavelet feature spectrogram is a one-dimensional vector, and obviously the one-dimensional vector cannot be subjected to two-dimensional convolution.

In one example, the network block 1 is taken as an example and the input of the network block 1 may be a wavelet feature spectrogram output by the signal processing and feature extraction module 101, for example, a one-dimensional wavelet feature vector containing 290 parameters. Wherein the first convolution layer 3011 may include a convolution kernel of 1*3, and perform a convolution operation with a step size of 1. In one example, the number of convolution kernels may be set to 4 so that the network block can output a 4-channel depth embedded feature sequence. The first convolution layer 3011 convolves the input one-dimensional wavelet feature vector containing 290 parameters, and then inputs the convolved data into the first normalization layer 3012 for normalization processing. The process is mainly to normalize the features for gradient propagation. The normalized data is then compressed by the average pooling layer 3013. Wherein the step size of pooling in the average pooling layer 3013 may be set to 2 to simplify the network complexity while also removing some redundant information.

Preferably, after the first normalization layer 3012 outputs the normalized data, a nonlinear mapping may also be performed through a linear rectification function (rectified linear unit, reLU), so as to increase the nonlinear factor. The problem of limited network performance caused by the linear function is avoided. The data with the added non-linearity factor can then be input to the average pooling layer 3013 for pooling compression.

For better extraction of depth embedded features by the network block, the two-way convolution layer also includes another convolution, i.e., the second convolution layer 3014 may include a convolution kernel of 1*1, with a step size of 2. In order to match the above-mentioned one-way convolved output, the number of convolution kernels in the second convolution layer 3014 is the same as the number of convolution kernels in the first convolution layer 3011, for example 4. It can be understood that the input of the second convolution layer 3014 is the same as the input of the first convolution layer 3011, and is the wavelet feature spectrogram output by the signal processing and feature extraction module 101. Then, the convolution result of the second convolution layer 3014 is input to the second normalization layer 3015 to perform normalization processing. The network block 1 301 may then combine the results of the two-way convolution, e.g., by adding via accumulator 3016, to output the final deep embedded feature sequence for the network block 1. It can be seen that the two-way convolution layer combines the results of the two-way convolution, thereby forming a residual network structure.

It will be appreciated that the other convolution is primarily used to dimensionally transform the wavelet characteristics of the input to match the characteristics of the one convolution output. And combining the two paths of convolution results can further improve the information quantity of the depth embedded features. And the subsequent audio scene recognition is facilitated.

In some examples, the wavelet feature spectrogram output by the signal processing and feature extraction module 101 may be a two-channel one-dimensional wavelet feature vector comprising 290 parameters. The two channels can be, for example, a left channel and a right channel; or the average value of the left channel and the right channel is the same as the average value of the left channel and the right channel. It is understood that the data of each channel in the dual channels can be set arbitrarily according to the actual situation, and the number of channels can be set more or less according to the actual situation, which is not limited by the present invention. It is apparent that if the network block 1 301 inputs a two-channel one-dimensional wavelet feature vector containing 290 parameters, a two-way convolution is performed through each of the 1*3 convolution kernels and the 1*1 convolution kernels. If the number of convolution kernels is 4, the output of network block 1 301 is a 4-channel depth embedded feature sequence containing 145 parameters. Obviously, each channel in the depth-embedded feature sequence can also be regarded as a one-dimensional vector.

In some examples, a plurality of network blocks, e.g., 4, are included in the neural network embedded feature extractor 102 having a residual network structure. The structure of network block 2 302 is the same as that of network block 1 301 except that the number of convolution kernels in network block 2 302 is twice the number of convolution kernels in network block 1 301. For example, the number of convolution kernels in network block 1 301 is 4 and the number of convolution kernels in network block 2 302 is 8. When the network block 2 302 takes the output of the network block 1 301 as an input, assuming that the output of the network block 1 301 is a 4-channel deep embedded feature sequence containing 145 parameters, the output of the network block 2 302 is an 8-channel deep embedded feature sequence containing 72 parameters.

Similarly, the structures of the network blocks 3 and 4 and 304 are the same as those of the network blocks 1 and 2 and 302. The difference is only that the number of convolution kernels in network block 3 303 is twice the number of convolution kernels in network block 2 302. The number of convolution kernels in network block 4 304 is twice the number of convolution kernels in network block 3 303. For example, the number of convolution kernels in network block 2 302 is 8, the number of convolution kernels in network block 3 303 is 16, and the number of convolution kernels in network block 4 304 is 32. When the network block 3 303 takes the output of the network block 2 302 as input, assuming that the output of the network block 2 302 is an 8-channel depth-embedded feature sequence containing 72 parameters, the output of the network block 3 303 is a 16-channel depth-embedded feature sequence containing 36 parameters. And, when the network block 4 304 takes the output of the network block 3 303 as an input, assuming that the output of the network block 3 303 is a 16-channel deep embedded feature sequence containing 36 parameters, the output of the network block 4 304 is a 32-channel deep embedded feature sequence containing 18 parameters.

It will be appreciated that in the above-described deep embedding feature extraction, the wavelet feature spectrogram is preferably used for one-dimensional convolution. Of course, in some examples, the two-dimensional convolution may also be performed using a wavelet feature sequence, or the one-dimensional convolution may be performed using a wavelet feature sequence, which is not limited by the present invention.

In some examples, the output of the neural network embedded feature extractor 102 with the residual network structure may be the output of at least one network block or the output of at least one network block may be output along with the input wavelet features. In other examples, if there are multiple network blocks, the output of the neural network embedded feature extractor 102 with the residual network structure may select the output of any one or more network blocks, or the wavelet features of the output and input of any one or more network blocks. It will be appreciated that the more outputs of the neural network embedded feature extractor 102 with the residual network structure, the more advantageous the mapping in the neural network classifier 103. But at the same time, it also results in more parameters for some layers in the neural network classifier 103 and increases the complexity of the neural network classifier 103. Therefore, the neural network embedded feature extractor 102 having the residual network structure may preferably select the output of the above 4 network blocks and the input wavelet features to output to the neural network classifier 103 together. The mapping in the neural network classifier 103 is guaranteed to be facilitated, parameters in some layers in the neural network classifier 103 are also made to be moderate, and higher complexity of the neural network classifier 103 is avoided.

S205, inputting wavelet features corresponding to the audio to be identified and at least one depth embedded feature sequence into a neural network classifier together to determine an audio scene corresponding to the audio to be identified.

After the neural network embedded feature extractor 102 with the residual network structure determines at least one depth embedded feature sequence in S204, the wavelet features corresponding to the audio to be identified and the at least one depth embedded feature sequence input by the neural network embedded feature extractor may be input into the neural network classifier 103 together for mapping, so as to identify the audio scene corresponding to the audio to be identified.

In an example, the structure of the neural network classifier 103 may be as shown in fig. 4, and fig. 4 is a schematic structural diagram of a neural network classifier according to an embodiment of the present invention. It can be seen that the neural network classifier 103 can include a feature stitching layer 401 and a full connection classification layer 402. Further, the full connection classification layer 402 may include at least one full connection mapping layer and a result output layer 4024. In one example, the number of full connection mapping layers may be preferably 3, namely a first full connection mapping layer 4021, a second full connection mapping layer 4022, and a third full connection mapping layer 4023.

For example, first, the feature stitching layer 401 performs feature stitching on all the input data, for example, stretches all the depth-embedded feature sequences into one-dimensional vectors. It will be appreciated that if the depth embedded feature sequence output by the neural network embedded feature extractor 102 is a one-dimensional vector, the feature stitching layer 401 is configured to stretch-stitch at least one depth embedded feature sequence that may include multiple channels and a one-dimensional wavelet feature vector. For example, if the output of the neural network embedded feature extractor 102 is a two-channel one-dimensional wavelet feature vector containing 290 parameters, a 4-channel depth embedded feature sequence containing 145 parameters, an 8-channel depth embedded feature sequence containing 72 parameters, a 16-channel depth embedded feature sequence containing 36 parameters, and a 32-channel depth embedded feature sequence containing 18 parameters, then the feature stitching layer 401 may first stretch into a feature sequence of 1×580, 1×576, and may be regarded as stretching multiple channels into a single channel. Then, the signature sequences of 1×580, 1×576 and 1×576 are spliced to obtain a signature sequence of 1×2888, i.e. a deep embedded feature comprising 2888 parameters.

In another example, if the depth embedded feature sequence output by the neural network embedded feature extractor 102 is a two-dimensional vector, the feature stitching layer 401 may be configured to perform stitching only. For example, the two-dimensional vectors of each multi-channel are directly spliced into two-dimensional vectors with more channels, and stretching is not needed. When the dimensions of parameters in the different depth embedded feature sequences are different, the parameters can be deleted and filled according to a preset rule, and the method is not limited. In some examples, padding may be performed in a 0-padding manner.

It will be appreciated that the present invention preferably uses one-dimensional vectors for stretching and splicing.

Of course, in some examples, if the feature stitching layer 401 is stitched into two-dimensional vectors, before inputting to the fully-connected classification layer 402, a global pooling (global pooling) layer is further required to pool the multi-channel vectors, for example, pool (pooling) is performed on each channel, so as to obtain a one-dimensional vector including the pooling result on each channel. The one-dimensional vector is then input to the fully connected classification layer 402 for audio scene recognition. In one example, a global pooling (global average pooling) layer may employ global average pooling. Obviously, the number of parameters in the vector after global averaging and pooling depends on how many channels are available, so in the present invention S204, it is preferable to use a single-frame one-dimensional wavelet feature vector for depth feature extraction, so that more parameters can be in the depth embedded feature after the feature stitching layer 401 is stitched.

In one example, the number of full connection mapping layers may be 3, where the number of neurons in the first full connection mapping layer 4021 may be 2048, the number of neurons in the second full connection mapping layer 4022 may be 1024, and the number of neurons in the third full connection mapping layer 4023 may be 1024. The input to the first fully connected mapping layer 4021 is a one-dimensional vector spliced by the feature splicing layer 401, for example, a depth embedded feature containing 2888 parameters. After the depth embedded feature containing 2888 parameters passes through the first fully connected mapping layer 4021, a first depth mapping feature containing 2048 parameters can be output, so that dimension reduction mapping of the depth embedded feature is achieved. The first depth map feature including 2048 parameters output by the first full-connection mapping layer 4021 is subjected to further dimension reduction mapping by the second full-connection mapping layer 4022 to obtain the second depth map feature including 1024 parameters, so that more useful information is further extracted. Then, the second depth mapping feature including 1024 parameters output by the second full-connection mapping layer 4022 is subjected to the third full-connection mapping layer 4023 to continue the dimension-reduction mapping, so as to obtain the feature more conducive to the classification of the audio scene, namely, the third depth mapping feature including 1024 parameters.

It is understood that the number of neurons in each fully-connected mapping layer can be arbitrarily adjusted according to practical situations, and the invention is not limited.

The number of neurons in the result output layer 4024 may be 10, i.e., represent different audio scenes in 10. Of course, the number of the audio scenes can be increased or decreased according to the actual situation so as to adapt to different numbers of the audio scenes. In one example, the result output layer 4024 employs a normalized exponential function (softmax) for probability prediction. And calculating a third depth mapping feature which is output by the third full-connection mapping layer 4023 and contains 1024 parameters as input so as to obtain probability values of all the audio scenes. Then, according to a preset rule and the probability value of each audio scene, determining the audio scene corresponding to the audio to be identified.

In one example, the preset rule may be to use an audio scene corresponding to the maximum probability value as an audio scene corresponding to the audio to be identified. Of course, the preset weights of different audio scenes and the corresponding probability values can be combined to calculate, so that the audio scene corresponding to the calculated highest value can be used as the audio scene corresponding to the audio to be identified. It can be understood that any equivalent manner may be adopted to determine the audio scene corresponding to the audio to be identified according to the probability value of each audio scene.

In one example, if the wavelet features output in S204 are wavelet feature spectrograms and at least one depth-embedded feature sequence is extracted for each frame, then the audio scene predicted in S205 is also on a per frame basis. Therefore, further, after determining the audio scene corresponding to each frame in the audio to be identified, the audio scene with the largest frame number can be classified according to each audio scene, and the audio scene with the largest frame number can be used as the audio scene corresponding to the audio to be identified. Of course, weights may be set for different frames, and then the duty ratio of each audio scene in the whole audio to be identified is calculated, and the audio scene with the highest duty ratio is used as the audio scene corresponding to the audio to be identified. Obviously, any equivalent manner may be adopted to determine the audio scene corresponding to the audio to be identified based on the audio scene corresponding to each frame.

The wavelet features in the invention can adapt to the requirement of time-frequency signal analysis, and the neural network with the residual network structure can ensure that the extracted deep embedding features have higher accuracy in recognition and greatly improve the recognition performance of short-time audio when a large amount of data is trained.

Fig. 5 is a flowchart of another audio scene recognition method according to an embodiment of the present invention.

As shown in fig. 5, the present invention also provides another audio scene recognition method, which is performed before S201, and the method may include the steps of:

s501, training a neural network embedded feature extractor and a neural network classifier with a residual network structure by adopting an audio data training set.

Prior to S201, training of the neural network embedded feature extractor 102 and the neural network classifier 103 having a residual network structure is required. For example, the neural network embedded feature extractor 102 and the neural network classifier 103 described above are trained by a pre-configured training set of audio data.

In one example, when the number of full-connection mapping layers in the neural network classifier 103 is greater than or equal to 2 during the training process, a random inactivation (dropout) manner may be used to mask part of neurons for all of the full-connection mapping layers except for the last full-connection mapping layer. This approach helps to mitigate the overfitting phenomenon that occurs during training. Meanwhile, gradient back propagation can be performed during training by adopting a cross entropy loss function, so that parameters in each layer in the neural network embedded feature extractor 102 and the neural network classifier 103 are dynamically adjusted, update of the neural network embedded feature extractor 102 and the neural network classifier 103 is realized, and the neural network embedded feature extractor 102 and the neural network classifier 103 can converge to a good result.

After the training of a large amount of data in the audio data training set is completed, S201 and subsequent steps are performed. So that the trained neural network embedded feature extractor 102 and the neural network classifier 103 may be used in S204, S205. It will be appreciated that in use, the neural network classifier 103 no longer masks some neurons in a dropout fashion.

It will be appreciated that the data in the training set of audio data used in training is a tag-carrying, tag for each data, i.e. the audio scene to which the data corresponds.

As shown in fig. 6, the present invention further provides an audio scene recognition apparatus 600. The apparatus 600 may include: a processor 610, a memory 620, and a bus 630. The processor 610 and the memory 620 in the device 600 may establish a communication connection through a bus 630.

The memory 620 is configured to store instructions that, when invoked by the processor 610, cause the processor 610 to perform any of the methods described above in relation to the audio scene recognition system of fig. 1-5.

Wherein the processor 610 may be a CPU.

The memory 620 may include volatile memory (RAM), such as random-access memory (RAM); the memory 620 may also include a nonvolatile memory (english: non-volatile memory), such as a read-only memory (ROM), a flash memory, a hard disk (HDD) or a solid state disk (solid state drive, SSD); memory 620 may also include a combination of the types of memory described above.

In one example, codes corresponding to the trained neural network embedded feature extractor 102 and the neural network classifier 103 may be pre-stored in memory so that audio scene recognition may be invoked and performed by the processor 610.

In other examples, during the training phase, the codes corresponding to the original neural network embedded feature extractor 102 and the neural network classifier 103, as well as the corresponding training set of audio data, may be pre-stored in memory so that model training may be invoked and performed by the processor 610.

Of course, it should be understood that the apparatus 600 may include many other possible hardware devices, such as I/O interfaces for transmitting data, transmitters, receivers, etc., and the invention is not limited thereto.

The method can have better resolution on the time-frequency domain based on wavelet characteristics, and the residual neural network based on one-dimensional convolution can be used for training more efficiently, so that the performance of audio scene recognition is improved.

The scheme of the invention can solve the problem that the performance of the traditional audio scene recognition system reaches the bottleneck on large-scale training data. Compared with the traditional audio scene recognition system, the system can train on a large-scale data set, and breaks through performance bottlenecks. Meanwhile, the invention can also solve the problem that the performance of the audio scene recognition system based on the two-dimensional convolution is obviously reduced when recognizing the scene of short-time audio. Compared with an audio scene recognition system based on two-dimensional convolution, the system can achieve better recognition effect on short-time audio, improve the performance of the audio scene recognition system, and can be rapidly deployed in various fields.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of audio scene recognition, the method comprising:

acquiring audio to be identified;

extracting wavelet characteristics of the audio to be identified to determine wavelet characteristics corresponding to the audio to be identified;

inputting the wavelet characteristics corresponding to the audio to be identified into a neural network embedded feature extractor with a residual network structure to obtain at least one depth embedded feature sequence;

The wavelet characteristics corresponding to the audio to be identified and the at least one depth embedded characteristic sequence are input into a neural network classifier together to determine an audio scene corresponding to the audio to be identified;

the neural network classifier comprises a feature splicing layer and a full-connection classification layer, wherein the full-connection classification layer comprises at least one full-connection mapping layer and a result output layer;

the step of inputting the wavelet features corresponding to the audio to be identified and the at least one depth embedded feature sequence into a neural network classifier together to determine an audio scene corresponding to the audio to be identified, includes:

inputting the wavelet features corresponding to the audio to be identified and the at least one depth embedded feature sequence into the feature splicing layer for stretching and splicing to form a one-dimensional depth feature vector;

inputting the one-dimensional depth feature vector to the at least one fully connected mapping layer to determine audio scene classification features;

inputting the audio scene classification features to the result output layer to determine probability values of all audio scenes;

and determining the audio scene corresponding to the audio to be identified according to the probability value of each audio scene.

2. The method of claim 1, wherein the extracting the wavelet features of the audio to be identified to determine the wavelet features corresponding to the audio to be identified comprises:

determining a frequency spectrum corresponding to the audio to be identified;

and the frequency spectrum is subjected to a plurality of wavelet filters to obtain wavelet characteristics corresponding to the audio to be identified.

3. The method of claim 2, wherein the determining the spectrum corresponding to the audio to be identified comprises:

pre-emphasis is carried out on the audio to be identified;

carrying out frame windowing on the pre-emphasized audio to be identified, and determining multi-frame pre-emphasized audio to be identified;

and performing fast Fourier transform on each frame in the audio to be identified after the multi-frame pre-emphasis to determine the frequency spectrum corresponding to each frame.

4. The method according to claim 2, wherein said passing the spectrum through a plurality of wavelet filters to obtain wavelet features corresponding to the audio to be identified comprises:

squaring the spectrum to determine an energy spectrum;

and inputting the energy spectrum into the wavelet filters to obtain wavelet characteristics corresponding to the audio to be identified.

5. The method according to any one of claims 1-4, wherein the wavelet features are wavelet feature spectrograms corresponding to frames in the audio to be identified; or the wavelet features are wavelet feature sequences corresponding to the audio to be identified, and the wavelet feature sequences comprise wavelet feature spectrograms corresponding to frames.

6. The method of claim 1, wherein the neural network embedded feature extractor having a residual network structure comprises: at least one network block, wherein each network block comprises a two-way convolution layer, the two-way convolution layer has two convolution paths, and each network block combines the results of the two convolution paths in the network block to determine a deep embedded feature sequence output by the network block.

7. The method according to claim 1, wherein the method further comprises:

in the training stage, if the number of the full-connection mapping layers is greater than or equal to 2, the other full-connection mapping layers except the last full-connection mapping layer adopt a random inactivation mode to mask part of neurons with preset probability.

8. An audio scene recognition system, the system comprising: the device comprises a signal processing and feature extractor, a neural network embedded feature extractor with a residual error network structure and a neural network classifier;

The signal processing and feature extractor is used for acquiring the audio to be identified; extracting wavelet characteristics of the audio to be identified to determine wavelet characteristics corresponding to the audio to be identified;

the neural network embedded feature extractor with the residual network structure is used for obtaining at least one depth embedded feature sequence according to the wavelet features corresponding to the audio to be identified;

the neural network classifier is used for determining an audio scene corresponding to the audio to be identified according to the wavelet characteristics corresponding to the audio to be identified and the at least one depth embedded characteristic sequence; the neural network classifier comprises a feature splicing layer and a full-connection classification layer, wherein the full-connection classification layer comprises at least one full-connection mapping layer and a result output layer;

the feature splicing layer is used for carrying out stretching splicing on the wavelet features corresponding to the audio to be identified and the at least one depth embedded feature sequence so as to form a one-dimensional depth feature vector;

the at least one fully-connected classification layer is used for determining the classification characteristics of the audio scene according to the one-dimensional depth characteristic vector;

the result output layer is used for determining probability values of all the audio scenes according to the audio scene classification characteristics; and determining the audio scene corresponding to the audio to be identified according to the probability value of each audio scene.

9. An audio scene recognition device, the device comprising:

a processor for coupling with a memory and reading and executing instructions stored in the memory;

the instructions when executed by the processor cause the processor to perform the method of any of the preceding claims 1-7.