CN111899760B

CN111899760B - Audio event detection method and device, electronic equipment and storage medium

Info

Publication number: CN111899760B
Application number: CN202010693055.0A
Authority: CN
Inventors: 王俊; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2024-05-07
Anticipated expiration: 2040-07-17
Also published as: CN111899760A

Abstract

The disclosure relates to a method, a device, an electronic device and a storage medium for detecting an audio event, wherein the method comprises the following steps: acquiring audio characteristics corresponding to audio data to be identified; dividing the audio features according to the frequency domain information of the audio features to generate a plurality of sub-band features; respectively extracting the characteristics of the plurality of sub-band characteristics to obtain a plurality of sub-band target characteristics; and obtaining a category detection result and a time detection result of each audio event in the audio data according to the target characteristics of the plurality of sub-frequency bands. According to the method, the deep learning network is adopted to learn the plurality of sub-band features with the frequency band level difference, the frequency band level difference is applied to the audio event classification model based on the deep learning, so that the classification performance of the audio event classification model can be improved, various audio events contained in audio data to be identified can be identified and obtained by the audio event classification model, and accordingly the audio event detection is more comprehensive and has higher accuracy.

Description

Audio event detection method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of voice recognition, and in particular relates to a method and a device for detecting an audio event, electronic equipment and a storage medium.

Background

Sound carries a lot of information and plays an important role in our daily life. We can determine where the sound occurs (called audio scene, such as subway, street, etc.) and what is happening (called audio event, such as alarm, dog, etc.) from the received sound. With the rapid development of artificial intelligence, computers can also make decisions on audio scenes and audio events, with accuracy even exceeding that of humans.

For audio events, detection of audio events can be used for sensing computation and providing better response for users in the fields of Internet of things, mobile navigation equipment and the like and in the case of ambiguous visual information. A piece of audio contains a variety of audio events that tend to overlap, i.e., multiple audio events may occur simultaneously during the same time period. For example, on a bus, we may hear the sound of the bus engine, the sound of the crowd speaking, and the sound of traffic at the same time. In the related art, detection of audio events is increasingly prone to deep learning methods. For example, the audio event category is obtained by identifying the audio features corresponding to the audio data through a trained convolutional neural network. However, the deep learning method in the related art generally detects only one audio event when identifying audio features, resulting in insufficient overall and accurate detection of audio events.

Disclosure of Invention

The disclosure provides a method, a device, an electronic device and a storage medium for detecting an audio event, so as to at least solve the problem that the detection of the audio event in the related art is not comprehensive and accurate enough. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a method for detecting an audio event, including:

Acquiring audio characteristics corresponding to audio data to be identified;

dividing the audio features according to the frequency domain information of the audio features to generate a plurality of sub-band features;

respectively extracting the characteristics of the plurality of sub-band characteristics to obtain a plurality of sub-band target characteristics;

and obtaining a category detection result and a time detection result of each audio event in the audio data according to the target characteristics of the plurality of sub-frequency bands.

In one embodiment, feature extraction is performed on a plurality of sub-band features to obtain a plurality of sub-band target features, including:

and inputting the plurality of sub-band features into a first neural network to obtain a plurality of sub-band target features, wherein the first neural network comprises a plurality of sub-band networks, and each sub-band network corresponds to one sub-band feature.

In one embodiment, the sub-band network comprises a plurality of sequentially connected local attention blocks; inputting the plurality of sub-band features into a first neural network to obtain a plurality of sub-band target features, including:

Inputting each sub-band feature into a first local attention block in a sub-band network corresponding to each sub-band feature for each sub-band feature to obtain a sub-band intermediate feature;

And sequentially inputting the intermediate sub-band features to the next local attention block until the sub-band target features corresponding to each sub-band feature are output.

In one embodiment, obtaining a category detection result and a time detection result of each audio event in the audio data according to the plurality of sub-band target features includes:

Fusing a plurality of sub-band target features to generate audio fusion features corresponding to the audio data;

inputting the audio fusion characteristics into a second neural network to obtain audio target characteristics corresponding to audio data;

and determining a category detection result and a time detection result of each audio event in the audio data according to the audio target characteristics.

In one embodiment, fusing a plurality of sub-band target features to generate an audio fusion feature corresponding to audio data includes:

Splicing a plurality of sub-band target features according to the frequency domain information;

and rolling and pooling the spliced sub-band target features to obtain audio fusion features.

In one embodiment, the second neural network comprises two second sub-neural networks, the two second sub-neural networks comprising different activation functions; inputting the audio fusion characteristic into a second neural network to obtain an audio target characteristic corresponding to the audio data, wherein the audio target characteristic comprises:

inputting the audio fusion characteristics into each second sub-neural network to obtain audio intermediate characteristics output by each second sub-neural network;

And according to the frequency domain information, splicing the audio intermediate characteristics respectively output by each second sub-neural network to obtain the audio target characteristics.

In one embodiment, determining a category detection result and a time detection result of each audio event in the audio data according to the audio target feature includes:

acquiring each frame of audio target characteristics in the audio target characteristics;

Inputting each frame of audio target characteristics into a full-connection layer containing different activation functions respectively, and outputting detection results corresponding to each frame of audio target characteristics;

And determining a category detection result and a time detection result of each audio event in the audio data according to the detection result corresponding to each frame of audio target characteristic.

In one embodiment, the audio feature is divided according to frequency domain information of the audio feature, and a plurality of sub-band features are generated, including:

Acquiring a plurality of frequency ranges which are configured in advance;

The audio features are divided according to a plurality of frequency ranges, and sub-band features corresponding to each frequency range are generated.

In one embodiment, the time detection result includes a start frame number and an end frame number for each audio event; after the category detection result and the time detection result of each audio event in the audio data are obtained, the method further comprises the following steps:

Acquiring the corresponding time length of each frame of audio data;

And generating the starting time and the ending time corresponding to each audio event according to the time length corresponding to each frame of audio data, and the starting frame number and the ending frame number of each audio event.

According to a second aspect of embodiments of the present disclosure, there is provided a detection apparatus for an audio event, including:

The audio feature acquisition module is configured to acquire audio features corresponding to the audio data to be identified;

the sub-band feature generation module is configured to divide the audio features according to the frequency domain information of the audio features to generate a plurality of sub-band features;

the first characteristic generating module is configured to perform characteristic extraction on the plurality of sub-band characteristics respectively to obtain a plurality of sub-band target characteristics;

and the detection result generation module is configured to execute the detection result generation module to obtain a category detection result and a time detection result of each audio event in the audio data according to the target characteristics of the plurality of sub-frequency bands.

In one embodiment, the first feature generation module is configured to perform inputting the plurality of sub-band features into a first neural network, resulting in a plurality of sub-band target features, the first neural network including a plurality of sub-band networks, each sub-band network corresponding to one sub-band feature.

In one embodiment, the sub-band network comprises a plurality of sequentially connected local attention blocks; a first feature generation module configured to perform:

In one embodiment, the detection result generating module includes:

the feature fusion module is configured to perform fusion of the plurality of sub-band target features and generate audio fusion features corresponding to the audio data;

the second feature generation module is configured to input the audio fusion feature into the second neural network to obtain an audio target feature corresponding to the audio data;

and the time and category determining module is configured to determine a category detection result and a time detection result of each audio event in the audio data according to the audio target characteristics.

In one embodiment, the feature fusion module includes:

A first splicing unit configured to perform splicing of a plurality of sub-band target features according to the frequency domain information;

And the fusion unit is configured to perform convolution and pooling processing on the spliced sub-band target characteristics to obtain audio fusion characteristics.

In one embodiment, the second neural network comprises two second sub-neural networks, the two second sub-neural networks comprising different activation functions; a second feature generation module comprising:

The feature generation unit is configured to input the audio fusion feature to each second sub-neural network to obtain an audio intermediate feature output by each second sub-neural network;

and the second splicing unit is configured to splice the audio intermediate characteristics respectively output by each second sub-neural network according to the frequency domain information to obtain audio target characteristics.

In one embodiment, the time and category determination module includes:

an acquisition unit configured to perform acquisition of each frame of audio target features of the audio target features;

each frame of audio detection result generating unit is configured to perform the steps of respectively inputting each frame of audio target characteristics to a full-connection layer containing different activation functions, and outputting detection results corresponding to each frame of audio target characteristics;

and a time and category determining unit configured to perform determination of a category detection result and a time detection result for each audio event in the audio data based on the detection result corresponding to the audio target feature of each frame.

In one embodiment, the subband feature generating module is configured to perform:

Acquiring a plurality of frequency ranges which are configured in advance;

In one embodiment, the time detection result includes a start frame number and an end frame number for each audio event; the acquisition module is further configured to perform acquisition of a time length corresponding to each frame of audio data;

the apparatus further comprises: and the start-stop time generation module is configured to generate a start time and an end time corresponding to each audio event according to the time length corresponding to each frame of audio data and the start frame number and the end frame number of each audio event.

According to a third aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, causing the device to perform the method of detecting an audio event as described in any of the embodiments of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

Wherein the processor is configured to execute the instructions to implement the method of detecting an audio event as described in any of the embodiments of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method of detecting an audio event described in any one of the embodiments of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

Acquiring audio characteristics corresponding to audio data to be identified; dividing the audio features according to the frequency domain information of the audio features to generate a plurality of sub-band features; the deep learning network is adopted to learn a plurality of sub-band features with band level differences, the band level differences are applied to the audio event classification model based on the deep learning, so that the classification performance of the audio event classification model can be improved, various audio events contained in audio data to be identified can be identified and obtained by the audio event classification model, and accordingly the audio event detection is more comprehensive and has higher accuracy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is an application environment diagram illustrating a method of detecting an audio event according to an exemplary embodiment.

Fig. 2 is an application environment diagram illustrating another method of detecting an audio event according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a method of detecting an audio event according to an exemplary embodiment.

Fig. 4 is a schematic diagram illustrating a split sub-band feature according to an example embodiment.

Fig. 5 is a schematic diagram illustrating a structure of a first neural network according to an exemplary embodiment.

Fig. 6 is a schematic diagram illustrating a sub-band network structure according to an exemplary embodiment.

Fig. 7 is a schematic diagram of a local attention block structure, according to an exemplary embodiment.

Fig. 8 is a flowchart illustrating a method of determining a detection result according to an exemplary embodiment.

Fig. 9 is a schematic diagram of a deep learning network, according to an example embodiment.

Fig. 10 is a flowchart illustrating a method of determining a detection result according to an exemplary embodiment.

FIG. 11 is a schematic diagram illustrating a time distribution of resulting audio events, according to an example embodiment.

Fig. 12 is a flowchart illustrating a method of detecting an audio event according to an exemplary embodiment.

Fig. 13 is a schematic diagram of a deep learning network, according to an example embodiment.

Fig. 14 is a block diagram illustrating an audio event detection apparatus according to an exemplary embodiment.

FIG. 15 is a block diagram of an audio event detection device according to an exemplary embodiment

Fig. 16 is an internal structural diagram of an electronic device, which is shown according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The method for detecting the audio event provided by the disclosure can be applied to an application environment as shown in fig. 1. Wherein the audio acquisition device 110 is interconnected with the terminal 120. The audio acquisition device 110 may be a separate device or may be a built-in component in the terminal 120. The terminal 120 is deployed with a trained deep learning network for detecting and obtaining a category detection result and a time detection result of an audio event in the audio data to be identified. Specifically, the terminal 120 acquires audio data to be identified from the audio acquisition device 110; the terminal 120 processes the audio data to be identified to obtain corresponding audio characteristics; dividing the audio features according to the frequency domain information of the audio features to generate a plurality of sub-band features; respectively extracting the characteristics of the plurality of sub-band characteristics to obtain a plurality of sub-band target characteristics; and obtaining a category detection result and a time detection result of each audio event in the audio data according to the target characteristics of the plurality of sub-frequency bands. The audio collecting apparatus 110 may be, but not limited to, various microphones, recording apparatuses, etc., and the terminal 120 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable apparatuses.

In another exemplary embodiment, the method for detecting an audio event provided by the present disclosure may also be applied to an application environment as shown in fig. 2. Wherein the terminal 210 and the server 220 interact through a network. The trained deep learning network for audio event classification may be deployed in the terminal 210 or in the server 220. Take the example of deployment in server 220. The user may trigger a detection instruction of the audio event through the terminal 210, so that the server 220 performs detection of the audio event according to the detection instruction. For example, for the short video recommendation field, the server 220 may automatically parse the audio data stream uploaded by the user to obtain the type detection result and the time detection result of the audio event in the audio data stream, so that the video clips containing interesting sounds may be screened and intercepted for recommendation. The terminal 210 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, and portable wearable devices, and the server 220 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

Fig. 3 is a flowchart illustrating a method for detecting an audio event according to an exemplary embodiment, and as shown in fig. 3, the method for detecting an audio event is illustrated as being used in the terminal 120 of fig. 1, and includes the following steps.

In step S310, audio features corresponding to the audio data to be identified are acquired.

Specifically, after the audio data to be identified is obtained, feature extraction may be performed on the audio data to obtain audio features corresponding to the audio data. Feature extraction of audio data may be achieved in the following manner. First, each speech signal sample is pre-emphasized by a high pass filter. Because the audio data has short-time stationarity, each audio data can be subjected to framing according to time steps, each time step is called a frame, and the time step corresponding to each frame can take a preset value, for example, any value between 20ms and 30 ms. In order to avoid excessive variation between adjacent frames, an overlap region may be provided between adjacent frames. Each frame is then windowed to increase the continuity of the left and right ends of the frame, e.g. calculated using a window of 25ms, shifted every 10 ms. And then, carrying out Fourier transform on the windowed audio data to obtain a spectrogram, and filtering to enable the spectrogram to be more compact. Finally, spectrum or cepstrum analysis can be used to obtain the audio features corresponding to the audio data.

In step S320, the audio features are divided according to the frequency domain information of the audio features, and a plurality of sub-band features are generated.

Specifically, the audio features obtained by extracting features from the audio data include time domain information and frequency domain information. The frequency domain information corresponds to a frequency axis and is a coordinate system used for describing the characteristics of the signal in the frequency aspect; the time domain information corresponds to a time axis and may be referred to as a frame number. For different audio events, they may be distributed on different frequencies, so that the audio feature may be divided according to the frequency domain information of the audio feature, to obtain sub-band features corresponding to a plurality of different frequency domain information respectively.

In step S330, feature extraction is performed on each of the plurality of subband features to obtain a plurality of subband target features.

In step S340, a category detection result and a time detection result of each audio event in the audio data are obtained according to the plurality of sub-band target features.

Specifically, after obtaining the plurality of sub-band features, the deep learning network may be used to process each sub-band feature separately to obtain the plurality of sub-band target features. Based on the target characteristics of a plurality of sub-frequency bands, different activation functions are adopted to respectively obtain a category detection result and a time detection result of the audio event. The deep learning network may be any network that can be used for extracting features of the frequency band, for example, a recurrent neural network, a convolutional neural network, or a network formed by a combination of the recurrent neural network and the convolutional neural network. The time detection result may be a start frame number and an end frame number of each audio event.

In the method for detecting the audio event, the audio features are divided according to the frequency domain information of the audio features, so that a plurality of sub-band features are generated; the deep learning network is adopted to learn a plurality of sub-band features with band level differences, the band level differences are applied to the audio event classification model based on the deep learning, so that the classification performance of the audio event classification model can be improved, various audio events contained in audio data to be identified can be identified and obtained by the audio event classification model, and accordingly the audio event detection is more comprehensive and has higher accuracy.

In an exemplary embodiment, dividing the audio feature according to the frequency domain information of the audio feature to generate a plurality of sub-band features includes: acquiring a plurality of frequency ranges which are configured in advance; the audio features are divided according to a plurality of frequency ranges, and sub-band features corresponding to each frequency range are generated.

Fig. 4 schematically shows a diagram of a plurality of sub-band features from a plurality of frequency range divisions. Specifically, a certain overlapping area can be set between adjacent frequency ranges, so that when the deep learning network identifies different sub-band characteristics, the deep learning network can learn part of the same knowledge and also can learn different knowledge, and the deep learning network has consistency. To ensure classification performance (e.g., accuracy and recall) of the deep learning network, different overlap ratios may be set, and the best overlap ratio is found by repeated experiments. In this embodiment, by configuring different frequency ranges in advance, a certain repetition is set between each frequency range, so that the classification performance of the deep learning network can be improved, and the accuracy of audio event detection can be improved.

In an exemplary embodiment, in step S330, feature extraction is performed on the plurality of subband features to obtain a plurality of subband target features, including: and inputting the plurality of sub-band features into a first neural network to obtain a plurality of sub-band target features, wherein the first neural network comprises a plurality of sub-band networks, and each sub-band network corresponds to one sub-band feature.

Specifically, after audio data is acquired, audio data features are extracted and frequency-divided to obtain a plurality of sub-band features, the plurality of sub-band features are input to a deep learning network. And extracting the characteristics of each sub-band characteristic by adopting a first neural network in the deep learning network to obtain a corresponding sub-band target characteristic, thereby completing modeling of a frequency axis (namely space). Fig. 5 exemplarily shows a schematic structural diagram of the first neural network in the present embodiment. As shown in fig. 5, the first neural network includes a plurality of subband networks. Each subband network may be a convolutional neural network, or a recurrent neural network. The neural network structure of each sub-band network may be the same or different, and each sub-band network is configured to process a sub-band feature corresponding thereto. After the plurality of sub-band features are input into the first neural network, the sub-band features are extracted through the sub-band network corresponding to each sub-band feature, and sub-band target features are obtained.

In this embodiment, the efficiency of audio detection may be improved by adopting a plurality of independent subband networks to process the corresponding subband features respectively.

In an exemplary embodiment, the sub-band network comprises a plurality of sequentially connected local attention blocks; inputting the plurality of sub-band features into a first neural network to obtain a plurality of sub-band target features, including: inputting each sub-band feature into a first local attention block in a sub-band network corresponding to each sub-band feature for each sub-band feature to obtain a sub-band intermediate feature; and sequentially inputting the intermediate sub-band features to the next local attention block until the sub-band target features corresponding to each sub-band feature are output.

Because the structure and the network scale of the neural network affect the classification and detection performance, in order to facilitate the optimization and adjustment of the model, in this embodiment, the network structure of the subband network corresponding to each subband feature may be the same. Fig. 6 schematically shows a schematic structure of a sub-band network. As shown in fig. 6, the sub-band network includes a plurality of local attention blocks connected in sequence. The number of local attention blocks in each sub-band network depends on the actual situation. After the sub-band characteristics are input into the corresponding sub-band network, each local attention block in the sub-band network is sequentially processed until the last local attention block is processed, and the corresponding sub-band target characteristics are output.

Fig. 7 illustrates a schematic diagram of the structure of a local attention block in a sub-band network. In fig. 7, conv2D represents a two-dimensional convolution layer; BN (Batch Normalization) represents a batch normalization layer; sigmoid represents the layer of activation functions, also called logistic functions; liner represents a linear activation function layer; global Max Pooling represents a global max pooling layer; pooling 2D represents the pooling layer.

The processing procedure of the local attention block is explained below. Taking the first local attention module as an example, the input sub-band features are first passed through a two-dimensional convolution layer and a batch normalization layer. Dividing the output of the batch normalization layer into half according to the number of feature graphs, wherein half adopts a sigmoid activation function, and the other half adopts a liner activation function. The outputs of the two activation functions are subjected to corresponding element point multiplication. Then, the result after dot multiplication passes through a two-dimensional convolution layer and a batch normalization layer, the output obtained by the batch normalization layer is divided into half according to the number of feature graphs, half of the output is passed through a sigmoid activation function, and the other half of the output is a liner activation function; the outputs of the two activation functions are subjected to dot multiplication of the corresponding elements. Finally, the output (set as A) is subjected to global maximum pooling to obtain a vector, the vector is subjected to two-layer full connection to obtain a new vector, and the new vector is up-sampled to obtain a feature map (set as B) with the same dimension as A. And (3) carrying out point multiplication on the corresponding elements of the A and the B to obtain a new feature map, and carrying out pooling treatment on the new feature map to obtain intermediate features of the sub-bands. In this embodiment, in order to ensure accuracy of time information of audio event detection, only a frequency axis may be pooled when the audio event detection is performed, and no pooling is performed on a time axis.

The sub-band feature is input to the second local attention block, and processing is continued on the sub-band intermediate feature with reference to the processing procedure of the first local attention block until the last local attention block outputs the sub-band target feature.

Since a piece of audio data contains much information, part of the information is useful for audio event detection, and part of the information is useless. In this embodiment, by adopting the sub-band network based on the local attention mechanism, the transmission of the information flow in the sub-band network is controlled, and the important information is transmitted downwards, while the unimportant information is suppressed, so that the time distribution of the audio event can be obtained on the basis of ensuring the classification performance.

In an exemplary embodiment, as shown in fig. 8, in step S340, a category detection result and a time detection result of each audio event in the audio data are obtained according to the plurality of sub-band target features, including the steps of:

In step S341, a plurality of sub-band target features are fused, and an audio fusion feature corresponding to the audio data is generated.

In step S342, the audio fusion feature is input to the second neural network, so as to obtain an audio target feature corresponding to the audio data.

In step S343, a category detection result and a time detection result of each audio event in the audio data are determined according to the audio target feature.

Fig. 9 exemplarily shows a schematic structural diagram of the deep learning network in the present embodiment. Wherein the second neural network may employ a convolutional neural network or a recurrent neural network. Specifically, after audio data is acquired, audio data features are extracted and frequency-divided to obtain a plurality of sub-band features, the plurality of sub-band features are input to a deep learning network. And respectively extracting the characteristics of the corresponding sub-band by adopting each sub-band network in the first neural network to obtain the corresponding sub-band target characteristics, thereby completing modeling of a frequency axis (namely space).

Each sub-band feature may be regarded as a low-level feature and the sub-band target feature obtained after each sub-band feature has been processed through the sub-band network may be regarded as a high-level feature. High-level features are more discriminative than low-level features. In order to fully utilize the information of the advanced features and preserve the time domain information of the audio event, a plurality of sub-band target features can be spliced on the frequency domain, and the spliced features are used as audio fusion features corresponding to the audio data. Inputting the obtained audio fusion characteristics into a second neural network, and modeling a time axis through the second neural network to obtain audio target characteristics. And predicting audio event categories in the audio data based on the audio target features, and a time distribution corresponding to each audio event category. Different activation functions can be adopted to respectively pool the audio target characteristics, so that the category and the time category of the audio event and the time information corresponding to each category are obtained.

In the embodiment, the first neural network is adopted to carry out space modeling, so that the high-level characteristic with distinguishing property is captured; further fusion and time information modeling are carried out based on the advanced features, so that the classification performance of the deep learning network can be improved, and the accuracy of audio event detection can be improved.

In an exemplary embodiment, a manner of fusing multiple sub-band target features using a feature fusion network is described. In step S341, the merging of the multiple sub-band target features to generate an audio merging feature corresponding to the audio data includes: splicing a plurality of sub-band target features according to the frequency domain information; and rolling and pooling the spliced sub-band target features to obtain audio fusion features.

In particular, each sub-band feature may be regarded as a low-level feature, and a sub-band target feature obtained after each sub-band feature is processed through the sub-band network may be regarded as a high-level feature. High-level features are more discriminative than low-level features. In order to fully exploit the information of the advanced features and preserve the time domain information of the audio event, multiple sub-band target features may be spliced in the frequency domain. After splicing, continuing to carry out two-layer convolution and pooling on the spliced features to obtain new features, and taking the new features as audio fusion features.

In this embodiment, after the sub-band target features are spliced, the spliced features are further rolled and pooled, so that the fusion effect of the features can be improved.

In an exemplary embodiment, a manner of processing audio fusion features using a second neural network is described. The second neural network comprises two second sub-neural networks, and the two second sub-neural networks comprise different activation functions; in step S342, the audio fusion feature is input to the second neural network, so as to obtain an audio target feature corresponding to the audio data, which includes: inputting the audio fusion characteristics into each second sub-neural network to obtain audio intermediate characteristics output by each second sub-neural network; and according to the frequency domain information, splicing the audio intermediate characteristics respectively output by each second sub-neural network to obtain the audio target characteristics.

In particular, the second neural network and the second sub-neural network may be recurrent neural networks, such as Bi-gated linear units (Bidirectional gated recurrent unit, BGRU), bi-recurrent neural networks (Bidirectional RNN, bi-RNN), long Short term memory networks (Long Short-Term Memory networks, LSTM). Often times of audio events in a section of audio are continuous, so modeling time information with a recurrent neural network can improve the prediction accuracy of the time results of the audio events. The different activation functions are used to model the audio fusion feature in time information, respectively, and may be a sigmoid activation function and a liner activation function, respectively, for example. That is, one second sub-neural network employs a sigmoid activation function, and the other second sub-neural network employs a liner activation function. Two audio intermediate features can be obtained through the second sub-neural network. And splicing the two audio intermediate features according to the frequency axis to obtain the audio target features. And further predicts a category and a temporal distribution of the audio event based on the audio target feature.

In an exemplary embodiment, an implementation of predicting category detection results and time detection results of audio events based on audio target features is described. As shown in fig. 10, in step S343, a category detection result and a time detection result of each audio event in the audio data are determined according to the audio target feature, comprising the steps of:

in step S3431, each frame of audio target features among the audio target features is acquired.

In step S3432, each frame of audio target feature is input to the full-connection layer including different activation functions, and a detection result corresponding to each frame of audio target feature is output.

In step S3433, a category detection result and a time detection result of each audio event in the audio data are determined according to the detection result corresponding to the audio object feature of each frame.

Wherein, different activation functions can be used for predicting the category and time of the audio event respectively, and the different activation functions are not limited to any two of sigmoid functions, softplus functions, softmax (normalized exponential functions) and the like. Specifically, since the time axis is not pooled through the first neural network, the feature fusion network and the second neural network, the number of frames of the audio target features output by the second neural network layer is the same as the number of frames of the audio features obtained by feature extraction of the audio data. In this embodiment, the features of each frame in the audio target features output by the second neural network layer pass through two independent full-connection layers respectively, and the number of neurons of the full-connection layers is determined according to the number of categories of audio events. The two full-connection layers adopt different activation functions, and the output of the different activation functions of the audio target characteristics of each frame can be obtained through the two full-connection layers. And respectively predicting the detection results of the category and time of the audio event according to the output of different activation functions.

Illustratively, one of the two fully connected layers may employ a sigmoid function and the other may employ a softmax function. For the temporal prediction of audio events, the sigmoid output of all frames may be taken as the output of the temporal detection of audio events, from which the start-stop time of each audio event in the audio data may be derived. For class prediction of audio events, the sigmoid output of all frames and the softmax output of all frames can be subjected to corresponding element point multiplication, namely, the softmax of all frames and the sigmoid of all frames are subjected to weighted summation on a time axis according to corresponding weights, and the obtained result is taken as a molecule; summing the softmax outputs of all frames on a time axis, taking the obtained result as a denominator; the numerator and denominator are divided and then averaged according to a time axis to be used as a classification result of the audio event. And determining the category of the audio event according to the classification result.

In an exemplary embodiment, the time detection result includes a start frame number and an end frame number for each audio event; after the category detection result and the time detection result of each audio event in the audio data are acquired, the method further comprises the following steps: acquiring the corresponding time length of each frame of audio data; and generating the starting time and the ending time corresponding to each audio event according to the time length corresponding to each frame of audio data, and the starting frame number and the ending frame number of each audio event.

Specifically, the audio event category of the audio data and the time distribution corresponding to each category can be obtained through the deep learning network. Fig. 11 schematically shows the result of the deep learning network output. Referring to fig. 5, the abscissa of the deep learning network output is the number of frames, and the ordinate is the class of classification, and the start and end frames are obtained for each audio event. The corresponding time length of each frame of audio data is acquired, for example, 10ms. The starting time and the ending time of each audio event can be obtained by multiplying the starting frame number and the ending frame number of each audio event by the time length of each frame.

In this embodiment, the timestamp of the audio event is obtained according to the result output by the deep learning network, so that the position of the audio event in the audio data can be directly obtained, thereby being convenient for directly performing time positioning.

Fig. 12 is a flowchart illustrating a method of detecting an audio event in particular, according to an exemplary embodiment, including the following steps, as shown in fig. 12.

In step S1201, audio data to be recognized is acquired. The audio data to be identified can be acquired by an audio acquisition device in real time, or can be obtained from the existing video and audio data.

In step S1202, feature extraction is performed on the audio data to be identified, and audio features are obtained. The process of feature extraction may be described with reference to the corresponding embodiment of fig. 3, and is not specifically described herein.

In step S1203, a plurality of frequency ranges configured in advance are acquired, and the audio feature is segmented according to the plurality of frequency ranges, so as to obtain a plurality of sub-band features corresponding to the plurality of frequency ranges, respectively.

In step S1204, a plurality of subband features are input to the deep learning network. Fig. 13 exemplarily shows a schematic structural diagram of the deep learning network. The deep learning network comprises a sub-band convolution neural network layer, a feature fusion network layer, a cyclic neural network layer and a weighted pooling layer, wherein the sub-band convolution neural network layer corresponds to the sub-band features respectively. Each subband convolutional neural network comprises a plurality of local attention blocks connected in sequence.

In step S1205, each of the subband features is input to the subband convolutional neural network corresponding to each of the subband features, and a plurality of subband target features are acquired.

In step S1206, a feature fusion network is adopted to splice a plurality of sub-band target features according to the frequency domain information; and rolling and pooling the spliced sub-band target features to obtain audio fusion features.

In step S1207, the audio fusion feature is input to the recurrent neural network, so as to obtain an audio target feature corresponding to the audio data.

The recurrent neural network may employ two bi-directional gated linear units BGRU. The audio fusion feature is passed through two BGRU, one bi-directional gating linear cell BGRU employs sigmoid as the activation function and the other bi-directional gating linear cell BGRU employs liner as the activation function. And splicing the two BGRU outputs of the two-way gating linear units according to the frequency axis to obtain the audio target characteristics.

In step S1208, the two full-connection layers in the weighted pooling layer are adopted to process the audio target features respectively, and the category detection result of each audio event in the audio data, and the time detection results of the start frame number and the end frame number are determined according to the output of the full-connection layers.

In step 1209, a time length corresponding to each frame of audio data is acquired; and generating the starting time and the ending time corresponding to each audio event according to the time length corresponding to each frame of audio data, and the starting frame number and the ending frame number of each audio event.

The training process of the deep learning network is described below. Deep learning networks employ end-to-end training. In order to improve the performance of the deep learning network, the video uploaded by the network can be utilized as an audio data sample for training the network. And respectively carrying out feature extraction on each audio data sample to obtain an audio feature sample. Labeling each audio feature sample to obtain a type label and a time label of an audio event of each audio data sample, and generating a training sample set. The training sample set is then input to a deep learning network to be trained, predicting the type and time of audio events for each audio feature sample. And calculating a loss value according to the prediction result and the labeling information by adopting a preset loss function (such as a cross soil moisture loss function), and updating model parameters by using a back propagation algorithm until a preset stop condition is reached. The preset stop condition may mean that a preset number of iterations is reached or that the loss value is no longer decreasing.

It should be understood that, although the steps in the flowcharts of fig. 1-13 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in FIGS. 1-13 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

Fig. 14 is a block diagram illustrating an audio event detection apparatus 1400 according to an example embodiment. Referring to fig. 14, the apparatus includes an audio feature acquisition module 1401, a subband feature generation module 1402, a first feature generation module 1403, and a detection result generation module 1404.

An audio feature acquisition module 1401 configured to perform acquisition of audio features corresponding to audio data to be identified;

A subband feature generating module 1402 configured to perform division of the audio features according to frequency domain information of the audio features, generating a plurality of subband features;

a first feature generating module 1403 configured to perform feature extraction on the plurality of subband features, respectively, to obtain a plurality of subband target features;

The detection result generation module 1404 is configured to execute a process according to a plurality of subband target features, the first neural network comprising a plurality of subband networks, each subband network corresponding to a subband feature.

In an exemplary embodiment, the first feature generation module 1403 is configured to perform inputting the plurality of subband features into a first neural network, resulting in a plurality of subband target features, the first neural network including a plurality of subband networks, each subband network corresponding to a subband feature.

In an exemplary embodiment, the sub-band network comprises a plurality of sequentially connected local attention blocks; a first feature generation module configured to perform: inputting each sub-band feature into a first local attention block in a sub-band network corresponding to each sub-band feature for each sub-band feature to obtain a sub-band intermediate feature; and sequentially inputting the intermediate sub-band features to the next local attention block until the sub-band target features corresponding to each sub-band feature are output.

In an exemplary embodiment, the detection result generation module 1404 includes: the feature fusion module is configured to perform fusion of the plurality of sub-band target features and generate audio fusion features corresponding to the audio data; the second feature generation module is configured to input the audio fusion feature into the second neural network to obtain an audio target feature corresponding to the audio data; and the time and category determining module is configured to determine a category detection result and a time detection result of each audio event in the audio data according to the audio target characteristics.

In an exemplary embodiment, the feature fusion module includes: a first splicing unit configured to perform splicing of a plurality of sub-band target features according to the frequency domain information; and the fusion unit is configured to perform convolution and pooling processing on the spliced sub-band target characteristics to obtain audio fusion characteristics.

In an exemplary embodiment, the second neural network comprises two second sub-neural networks, the two second sub-neural networks comprising different activation functions; a second feature generation module comprising: the feature generation unit is configured to input the audio fusion feature to each second sub-neural network to obtain an audio intermediate feature output by each second sub-neural network; and the second splicing unit is configured to splice the audio intermediate characteristics respectively output by each second sub-neural network according to the frequency domain information to obtain audio target characteristics.

In an exemplary embodiment, the time and category determination module includes: an acquisition unit configured to perform acquisition of each frame of audio target features of the audio target features; each frame of audio detection result generating unit is configured to perform the steps of respectively inputting each frame of audio target characteristics to a full-connection layer containing different activation functions, and outputting detection results corresponding to each frame of audio target characteristics; and a time and category determining unit configured to perform determination of a category detection result and a time detection result for each audio event in the audio data based on the detection result corresponding to the audio target feature of each frame.

In an exemplary embodiment, the subband feature generating module is configured to perform: acquiring a plurality of frequency ranges which are configured in advance; the audio features are divided according to a plurality of frequency ranges, and sub-band features corresponding to each frequency range are generated.

In an exemplary embodiment, the time detection result includes a start frame number and an end frame number for each audio event; the acquisition module is further configured to perform acquisition of a time length corresponding to each frame of audio data; the apparatus further comprises: and the start-stop time generation module is configured to generate a start time and an end time corresponding to each audio event according to the time length corresponding to each frame of audio data and the start frame number and the end frame number of each audio event.

Fig. 15 is a block diagram of a specific audio event detection apparatus 1500, according to an example embodiment. Referring to fig. 15, the apparatus includes an audio feature extraction module 1510, an audio feature segmentation module 1520, and an audio feature detection module 1530. Wherein, the audio feature extraction module 1510 is configured to perform feature extraction on the audio data to be identified, so as to obtain audio features. An audio feature segmentation module 1520 configured to perform segmentation on the audio feature according to the acquired plurality of frequency ranges, resulting in a plurality of sub-band features respectively corresponding to the plurality of frequency ranges. The audio feature detection module 1530 is configured to perform detection of a plurality of sub-band features for categories and times of audio events.

The audio feature detection module 1530 includes a plurality of sub-band feature extraction modules 1531, a feature concatenation module 1532, an audio event classification module 1533, and an audio event detection module 1534, among others. The subband feature extraction module 1531 is configured to perform feature extraction of the subband features using a subband convolutional neural network to obtain subband target features. The feature cascade module 1532 is configured to perform fusion of the plurality of sub-band target features, generate an audio fusion feature corresponding to the audio data, and obtain an audio target feature corresponding to the audio data according to the audio fusion feature by using the recurrent neural network. The audio event classification module 1533 is configured to perform deriving a class result for the audio event from the audio target feature. An audio time detection module 1534 configured to perform time results of the audio event based on the audio target feature.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 16 is a block diagram illustrating an electronic device 1600 for detection of audio events, according to an example embodiment. For example, electronic device 1600 may be a mobile telephone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 16, the electronic device 1600 may include one or more of the following components: a processing component 1602, a memory 1604, a power component 1606, a multimedia component 1608, an audio component 1610, an input/output (I/O) interface 1612, a sensor component 1614, and a communication component 1616.

The processing component 1602 generally controls overall operation of the electronic device 1600, such as operations associated with display, telephone call, data communication, camera operation, and recording operations. The processing component 1602 may include one or more processors 1620 to execute instructions to perform all or part of the steps of the methods described above. In addition, the processing component 1602 may include one or more modules that facilitate interactions between the processing component 1602 and other components. For example, the processing component 1602 may include a multimedia module to facilitate interactions between the multimedia component 1608 and the processing component 1602.

The memory 1604 is configured to store various types of data to support operations at the electronic device 1600. Examples of such data include instructions for any application or method operating on the electronic device 1600, contact data, phonebook data, messages, pictures, video, and so forth. The memory 1604 may be implemented by any type of volatile or nonvolatile memory device or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read Only Memory (EEPROM), erasable Programmable Read Only Memory (EPROM), programmable Read Only Memory (PROM), read Only Memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 1606 provides power to the various components of the electronic device 1600. Power supply component 1606 can include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 1600.

The multimedia component 1608 includes a screen between the electronic device 1600 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1608 includes a front-facing camera and/or a rear-facing camera. When the electronic device 1600 is in an operational mode, such as a capture mode or a video mode, the front-facing camera and/or the rear-facing camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1610 is configured to output and/or input audio signals. For example, the audio component 1610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 1604 or transmitted via the communication component 1616. In some embodiments, the audio component 1610 further includes a speaker for outputting audio signals.

The I/O interface 1612 provides an interface between the processing component 1602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1614 includes one or more sensors for providing status assessment of various aspects of the electronic device 1600. For example, the sensor assembly 1614 may detect an on/off state of the electronic device 1600, a relative positioning of the components, such as a display and keypad of the electronic device 1600, the sensor assembly 1614 may also detect a change in position of the electronic device 1600 or a component of the electronic device 1600, the presence or absence of a user's contact with the electronic device 1600, an orientation or acceleration/deceleration of the electronic device 1600, and a change in temperature of the electronic device 1600. The sensor assembly 1614 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 1614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1614 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1616 is configured to facilitate communication between the electronic device 1600 and other devices, either wired or wireless. The electronic device 1600 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 1616 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 1616 also includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 1604 that includes instructions executable by the processor 1620 of the electronic device 1600 to perform the above-described methods. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for detecting an audio event, comprising:

Acquiring audio characteristics corresponding to audio data to be identified;

Obtaining a category detection result and a time detection result of each audio event in the audio data according to the plurality of sub-band target characteristics;

The step of obtaining a category detection result and a time detection result of each audio event in the audio data according to the plurality of sub-band target features includes: splicing the plurality of sub-band target features on a frequency domain to obtain audio fusion features corresponding to the audio data; inputting the audio fusion characteristic into a second neural network to obtain an audio target characteristic corresponding to the audio data; and determining a category detection result and a time detection result of each audio event in the audio data according to the audio target characteristics.

2. The method for detecting an audio event according to claim 1, wherein the feature extraction is performed on the plurality of subband features to obtain a plurality of subband target features, respectively, and the method comprises:

And inputting the plurality of sub-band features into a first neural network to obtain the plurality of sub-band target features, wherein the first neural network comprises a plurality of sub-band networks, and each sub-band network corresponds to one sub-band feature.

3. The method of detecting an audio event according to claim 2, wherein the sub-band network comprises a plurality of sequentially connected local attention blocks; the inputting the plurality of sub-band features to a first neural network to obtain the plurality of sub-band target features includes:

Inputting each sub-band feature into a first local attention block in a sub-band network corresponding to each sub-band feature to obtain a sub-band intermediate feature;

4. The method for detecting an audio event according to claim 1, wherein the splicing the plurality of sub-band target features in a frequency domain to obtain an audio fusion feature corresponding to the audio data includes:

Splicing the plurality of sub-band target features according to the frequency domain information;

and rolling and pooling the spliced sub-band target features to obtain the audio fusion features.

5. The method of claim 1, wherein the second neural network comprises two second sub-neural networks, the two second sub-neural networks comprising different activation functions; inputting the audio fusion characteristic to a second neural network to obtain an audio target characteristic corresponding to the audio data, wherein the audio target characteristic comprises:

6. The method for detecting audio events according to claim 1, wherein determining a category detection result and a time detection result of each audio event in the audio data according to the audio target feature comprises:

Respectively inputting the audio target characteristics of each frame to a full-connection layer containing different activation functions, and outputting detection results corresponding to the audio target characteristics of each frame;

and determining a category detection result and a time detection result of each audio event in the audio data according to the detection result corresponding to the audio target characteristics of each frame.

7. The method for detecting an audio event according to any one of claims 1 to 6, wherein the dividing the audio feature according to the frequency domain information of the audio feature to generate a plurality of subband features includes:

Acquiring a plurality of frequency ranges which are configured in advance;

Dividing the audio features according to the plurality of frequency ranges to generate sub-band features corresponding to each frequency range.

8. The method according to any one of claims 1 to 6, wherein the time detection result includes a start frame number and an end frame number of each audio event; after the category detection result and the time detection result of each audio event in the audio data are obtained, the method further comprises the following steps:

Acquiring the corresponding time length of each frame of audio data;

9. An audio event detection apparatus, comprising:

A sub-band feature generation module configured to perform division of the audio features according to frequency domain information of the audio features, generating a plurality of sub-band features;

A detection result generation module configured to perform obtaining a category detection result and a time detection result of each audio event in the audio data according to the plurality of sub-band target features;

Wherein, the detection result generation module comprises: the feature fusion module is configured to splice the plurality of sub-band target features on a frequency domain to obtain audio fusion features corresponding to the audio data; the second feature generation module is configured to input the audio fusion feature into a second neural network to obtain an audio target feature corresponding to the audio data; and a time and category determination module configured to perform determining a category detection result and a time detection result for each audio event in the audio data based on the audio target feature.

10. The apparatus according to claim 9, wherein the first feature generation module is configured to perform inputting the plurality of subband features into a first neural network, to obtain the plurality of subband target features, the first neural network including a plurality of subband networks, each subband network corresponding to one subband feature.

11. The apparatus for detecting an audio event according to claim 10, wherein the sub-band network comprises a plurality of local attention blocks connected in sequence; the first feature generation module is configured to perform:

12. The apparatus for detecting an audio event according to claim 9, wherein the feature fusion module comprises:

a first splicing unit configured to perform splicing of the plurality of sub-band target features according to the frequency domain information;

And the fusion unit is configured to perform convolution and pooling processing on the spliced sub-band target characteristics to obtain the audio fusion characteristics.

13. The apparatus for detecting an audio event according to claim 9, wherein the second neural network comprises two second sub-neural networks, the two second sub-neural networks containing different activation functions; the second feature generation module includes:

A feature generating unit configured to perform inputting the audio fusion feature to each second sub-neural network, so as to obtain an audio intermediate feature output by each second sub-neural network;

And the second splicing unit is configured to splice the audio intermediate characteristics respectively output by each second sub-neural network according to the frequency domain information to obtain the audio target characteristics.

14. The apparatus for detecting an audio event according to claim 9, wherein the time and class determination module comprises:

And the time and category determining unit is configured to determine a category detection result and a time detection result of each audio event in the audio data according to the detection result corresponding to the audio target feature of each frame.

15. The apparatus according to any one of claims 9 to 14, wherein the subband feature generating module is configured to perform:

Acquiring a plurality of frequency ranges which are configured in advance;

16. The apparatus according to any one of claims 9 to 14, wherein the time detection result includes a start frame number and an end frame number of each audio event; the acquisition module is further configured to perform acquisition of a time length corresponding to each frame of audio data;

17. An electronic device, comprising:

A processor;

A memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of detecting an audio event as claimed in any one of claims 1 to 8.

18. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of detecting an audio event according to any one of claims 1 to 8.