CN113593603A

CN113593603A - Audio category determination method and device, storage medium and electronic device

Info

Publication number: CN113593603A
Application number: CN202110853406.4A
Authority: CN
Inventors: 张锦铖; 史巍; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2021-11-02

Abstract

The embodiment of the invention provides a method, a device, a storage medium and an electronic device for determining audio categories, wherein the method comprises the following steps: determining target characteristics of the obtained target audio data; analyzing the target features by using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, wherein the target model is trained by machine learning by using a plurality of groups of target training data, and each group of data in the plurality of groups of target training data comprises: the characteristics of the audio data and the sound category of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers; a target sound category to which each frame of audio data included in the target audio data belongs is determined based on the first probability. The invention solves the problem of high requirement of determining the audio category on the training audio in the related technology.

Description

Audio category determination method and device, storage medium and electronic device

Technical Field

The embodiment of the invention relates to the field of communication, in particular to a method and a device for determining audio types, a storage medium and an electronic device.

Background

Because sound can provide abundant information like video, can provide monitoring information in all directions, and does not need to have dead angles like a video camera, audio event detection is more and more paid more and more attention to in all trades.

The conventional means commonly used for detecting audio events are gmm, hmm and svm, because the realization is simple, the hardware resource consumption is better and is always a hotspot for audio detection and recognition research, but along with the development of AI chips, the calculation is greatly improved, deep learning gradually enters the visual field of people and gradually becomes a hotspot for research in recent years, meanwhile, because of the successful application in the field of computer vision, various deep learning networks are gradually used in various fields of audio, such as voice recognition, audio event detection, audio scene classification, speaker recognition and the like, and the performance is improved greatly compared with the conventional machine learning algorithm. However, deep learning depends on the size of the data set, and if the quality or size of the data set is not satisfactory, the effect of deep learning may be worse than that of the conventional machine learning algorithm.

Therefore, the problem that the requirement for training audio for determining the audio category is high exists in the related art.

In view of the above problems in the related art, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining audio categories, a storage medium and an electronic device, which are used for at least solving the problem that the requirement of determining the audio categories on training audio is high in the related art.

According to an embodiment of the present invention, there is provided an audio class determination method including: determining target characteristics of the obtained target audio data; analyzing the target features by using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, wherein the target model is trained by machine learning by using a plurality of sets of target training data, and each set of data in the plurality of sets of target training data comprises: the target model comprises characteristics of audio data and sound types of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers; determining a target sound class to which each frame of audio data included in the target audio data belongs based on the first probability.

According to another embodiment of the present invention, there is provided an apparatus for determining an audio class, including: the first determining module is used for determining the target characteristics of the acquired target audio data; an analysis module, configured to analyze the target feature using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, where the target model is trained through machine learning using multiple sets of target training data, and each set of data in the multiple sets of target training data includes: the target model comprises characteristics of audio data and sound types of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers; a second determining module, configured to determine, based on the first probability, a target sound category to which each frame of audio data included in the target audio data belongs.

According to yet another embodiment of the invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program, when executed by a processor, implements the steps of the method as set forth in any of the above.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method and the device, the target characteristics of the acquired target audio data are determined, the target characteristic data are analyzed by using the target model, so that the first probability that each frame of audio data included in the target audio data belongs to each sound category is determined, and the target sound category to which each frame of audio data in the target audio data belongs is determined according to the first probability. The target model is trained through the characteristics of the audio data subjected to enhancement processing and the sound category of each frame of audio, and the data of the training target model is subjected to enhancement processing, so that model training can be realized without acquiring audio data with high quality, the problem that the requirement of determining the audio category on the training audio is high in the related technology can be solved, and the effects of conveniently acquiring the training data and improving the accuracy of determining the audio category are achieved.

Drawings

Fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of audio class determination according to an embodiment of the present invention;

FIG. 3 is a diagram of a target model structure according to an exemplary embodiment of the present invention;

FIG. 4 is a flow chart of a method for determining audio class according to an embodiment of the present invention;

fig. 5 is a block diagram of the structure of an apparatus for determining an audio class according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In office and home scenarios, there are rich sound events, including washing machines, refrigerators, dust collectors, alarms, keyboard sounds, chair dragging sounds, and so on. If the type of sound event can be determined, more information can be provided to the post-algorithm, such as suppression of a particular sound, first identifying the type of sound and then applying the suppression algorithm. However, some voices are difficult to acquire, and the data richness of a specific type is insufficient, so that the sample imbalance problem is caused. Meanwhile, the deep learning classification network has a large requirement on data volume, so that overfitting can be caused if the requirement cannot be met, and the performance is seriously reduced. And because many network models are complicated and large, such as an lstm network which can not be paralleled, the network model can not be applied to a neural network with a complicated low-performance embedded platform.

In view of the above problems, the following embodiments are proposed.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the method running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of the method for determining an audio class according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the audio category determination method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In the present embodiment, a method for determining an audio class is provided, and fig. 2 is a flowchart of a method for determining an audio class according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, determining the target characteristics of the acquired target audio data;

step S204, analyzing the target features by using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, wherein the target model is trained by machine learning using multiple sets of target training data, and each set of data in the multiple sets of target training data includes: the target model comprises characteristics of audio data and sound types of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers;

step S206, determining a target sound category to which each frame of audio data included in the target audio data belongs based on the first probability.

In the above embodiment, the target feature may be obtained through audio feature extraction and normalization processing. Namely, the selection of the characteristic quantity and the systematic normalization processing of the channel and the amplitude are completed. The main link of feature extraction can adopt log Mel-band Spectrum which is widely influenced on audio pattern recognition, and mix-up and Spectrum evaluation methods adopted by audio data enhancement are enhancement processing on the log Mel-band Spectrum. The normalization link can adopt min-max normalization to control the dynamic range of the audio; for example, for two-channel audio, the average value is calculated and converted into a single channel to be input into the system. And for the transformed spectrogram, the min-max normalization is adopted to control the dynamic range of the audio frequency, so that the network convergence can be faster.

In the above-described embodiment, the sound category may be a plurality of categories set in advance, for example, crying, laughing, door opening, talking (male, female), device sound, and the like. The specific sound category may be set according to the scene of the application.

In the above embodiment, after the target sound category is determined, the corresponding operation may be performed according to the target sound category. For example, in the case where it is determined that the target sound category is baby crying, the smart speaker may be controlled to play stories, music, and the like. Or in case it is determined that the infant is crying in an environment without the sound of other people, sending a notification message to the bound device to inform the family that the infant is crying.

Optionally, the main body of the above steps may be a background processor, or other devices with similar processing capabilities, and may also be a machine integrated with at least an audio acquisition device and a data processing device, where the audio acquisition device may include a sound acquisition module such as a microphone, and the data processing device may include a terminal such as a computer and a mobile phone, but is not limited thereto.

In an exemplary embodiment, before analyzing the target feature using the target model, the method further comprises: acquiring a plurality of groups of training audio data; performing enhancement processing on the multiple groups of training audio data to obtain multiple groups of target training data; and training an initial model by utilizing the multiple groups of target training data to obtain the target model. In this embodiment, the audio data used for training the target model may be audio downloaded from the internet, recorded audio, or audio collected by a sound collection device. After the audio data is acquired, enhancement processing may be performed on the audio data. And training by using the audio subjected to the enhancement processing to obtain a target model. The target model is a convolutional neural network model, the target model can comprise a plurality of convolutional layers, and the stacking of the convolutional neural network is adopted under the condition that the effect is optimal, so that the real-time rate is obviously improved, the hardware resource consumption is reduced, and the algorithm of the embedded equipment with low resource degree is favorably grounded.

In an exemplary embodiment, the enhancement processing the plurality of sets of training audio data to obtain the plurality of sets of target training data includes at least one of: performing fusion processing on continuous two frames of sub-audio data included in each group of training audio data in the multiple groups of training audio data to obtain multiple groups of target training data; performing deformation processing on frequency domain data and/or time domain data of each group of training audio data in the plurality of groups of training audio data to obtain a plurality of groups of target training data; and performing tone modification processing on each group of training audio data in the plurality of groups of training audio data to obtain a plurality of groups of target training data. In this embodiment, one or more of fusion processing, transformation processing, and modulation processing may be performed on each of the sets of training audio data. The audio training data may be subjected to fusion processing, deformation processing and pitch change processing, or three types of enhancement processing may be performed on a set of audio training data in sequence, and the order of the fusion processing, the deformation processing and the pitch change processing is not limited.

In an exemplary embodiment, the fusing the two consecutive frames of sub-audio data included in each of the plurality of sets of training audio data includes: performing the fusion processing on the audio data in a Mixup enhancement mode; the step of performing deformation processing on the frequency domain data and/or the time domain data of each group of training audio data in the plurality of groups of training audio data comprises the following steps: performing the warping processing on the audio data by a Spec-aggregation enhancement mode; the tonal modification processing of each group of training audio data in the plurality of groups of training audio data comprises: and performing the tonal modification processing on the audio data in a Pitch-shift enhancement mode. In this embodiment, three audio enhancement methods, namely, Mixup, Spec-estimation and Pitch-shift, can be applied to the sound data enhancement. Wherein, Mix-up is to fuse the sound data and label of two successive batchs, and respectively select data x from the two batchs₁And x₂The label is y₁And y₂If the fused data is x ═ α x₁+(1-α)x₂The label is y ═ α y₁+(1-α)y₂. The method is equivalent to performing continuous difference on voice signals and labels thereof, so that modeling of data among classes is increased, and generalization capability is increased.

Spec-augmentation: the time frequency spectrum can be deformed from two directions on the time frequency domain, the distortion richness of data is increased, and the disturbance resistance of the model is further enhanced. The first is time waring, which increases the robustness of the network to deformation capabilities in the time direction. Adding to a continuous Mel spectrum of frequency f, adding [ f₀，f₀+f]Is masked off, wherein there is a uniform parameter from 0 to F, F₀Is selected from [0, v-f]. v represents the number of channels of the mel-frequency spectrum. The second approach is to randomly partially mask the log-mel spectrum by means of Frequency masking to achieve robustness to local detail loss.

Pitch-shift: obtaining the fundamental frequency of the audio signal and then taking the fundamental frequency f of the original signal₀For reference, the amplitude of the disturbance of fundamental frequency is beta-0.1 f₀Then the new perturbation range of fundamental frequency is [ f₀-β，f₀+β]。

In the embodiment, the mix-up, Spec-augmentation and Pitch-shift comprehensive enhancement method is used as a data enhancement mode of the audio data set, and the mix-up and other data enhancement methods can remarkably increase the richness of the data set and effectively reduce the problem that the performance is seriously reduced when the deep neural network is easily over-fitted in practical use.

In one exemplary embodiment, analyzing the target features using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound class includes: processing the target feature by using a first convolution layer, a batch normalization layer, a first activation layer, a plurality of target layers and a softmax function included in the target model in sequence to determine the first probability, wherein the target layers include a second convolution layer, a second activation layer and a discard layer. The stack of the convolutional neural network is adopted, so that the parallel light operation can be realized, and the consumption of hardware resources is greatly reduced. The number of the target layers may be 3 (this value is merely an exemplary illustration, and the number of the target layers may also be 2, 4, and the like, which is not limited in the present invention), and when the number of the target layers is 3, the target model structure diagram may be referred to in fig. 3.

In one exemplary embodiment, determining, based on the first probability, a target sound class to which each frame of audio data included in the target audio data belongs includes: filtering the first probability corresponding to each frame of audio data to obtain a target probability of each frame of audio data; determining the target sound category to which the each frame of audio data belongs based on the target probability. In this embodiment, the CNN classifier generates a probability of a class for each frame of audio data output, and takes a label with the highest class probability as the frame of audio output. The output of each frame of audio should have certain continuity in time, but the data of the audio itself has no continuity, so that the first probability can be filtered to eliminate abrupt change and discontinuity of event output in time caused by classification error. And after filtering, obtaining a target probability, and determining the target sound category to which each frame of audio data belongs according to the target probability. The problem that the identification result is discontinuous due to the fact that the identification result is possibly interfered by abnormal data such as mutation and the like is solved.

In an exemplary embodiment, the filtering the first probability corresponding to each frame of audio data to obtain the target probability of each frame of audio data includes: performing the following operations on the first probability of each frame of audio data to obtain the target probability of each frame of audio data: respectively determining a second probability that continuous first number of frames of audio data which are in front of the current frame of audio data and adjacent to the current frame of audio belong to each sound category; determining a third probability that a second number of consecutive frames of audio data following the current frame of audio data and adjacent to the current frame of audio data belong to each sound class, respectively; and performing median filtering smoothing processing on the first probability, the second probability and the third probability to determine the target probability corresponding to the current frame of audio data. In this embodiment, a median-filter smoothing algorithm may be used to perform filtering processing, and a window length is (2 × N +1) frame, where N may be 5 (this value is merely an exemplary illustration, and may also be 3, 4, 6, 7, 10, and the like, which is not limited in this disclosure). The Hop size may take the value of 1frame, i.e. one frame. The smoothing algorithm formula is as follows: y (i) ═ mid (sort (y (i-N), y (i-N +1), …, y (i), …, y (i + N-1), y (i + N))). And smoothing is performed on the final classification result by using mean-filter smoothing, an abnormal value is eliminated, the detection accuracy of abnormal sound is ensured, and the accuracy of an output event result is improved.

In an exemplary embodiment, each of the plurality of convolution layers included in the target model includes 3 × 3Convolutional Layer, and in this embodiment, the classification network employs a Convolutional neural network having a small Convolutional kernel, so that when the effect is optimal, the real-time rate can be significantly improved, the hardware resource consumption can be reduced, and the algorithm of the embedded device with low resource level can be favorably landed on the ground. The problems that the forward reasoning speed of the deep learning network is low, hardware consumption resources are large, and the requirement on the data volume of the deep learning network is large are solved.

The following describes a method for determining an audio type with reference to a specific embodiment:

fig. 4 is a flowchart of a method for determining an audio class according to an embodiment of the present invention, and the flowchart includes a training phase, a testing phase, and a post-processing phase, as shown in fig. 4. The training stage comprises enhancement, normalization, feature extraction and CNN classification of original waveform data. After training is finished, original waveform data is enhanced (optional), normalized and feature extracted, a classifier model is carried out to obtain event probability, and the event probability is output and smoothed by a median filter.

In the embodiment, the real-time parallel CNN network is adopted, so that the running speed of the deep neural network of the embedded platform is greatly increased, and meanwhile, the classification performance of the neural network is improved due to the fact that data samples are enriched by adopting a data enhancement means.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a device for determining an audio category is further provided, where the device is used to implement the foregoing embodiment and the preferred embodiments, and details of the foregoing description are omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 5 is a block diagram showing the structure of an apparatus for determining an audio class according to an embodiment of the present invention, as shown in fig. 5, the apparatus including:

a first determining module 52, configured to determine a target feature of the acquired target audio data;

an analysis module 54, configured to analyze the target feature using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, where the target model is trained through machine learning using multiple sets of target training data, and each set of data in the multiple sets of target training data includes: the target model comprises characteristics of audio data and sound types of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers;

a second determining module 56, configured to determine, based on the first probability, a target sound class to which each frame of audio data included in the target audio data belongs.

In one exemplary embodiment, the apparatus may be configured to obtain a plurality of sets of training audio data prior to analyzing the target features using a target model; performing enhancement processing on the multiple groups of training audio data to obtain multiple groups of target training data; and training an initial model by utilizing the multiple groups of target training data to obtain the target model.

In an exemplary embodiment, the apparatus may perform enhancement processing on the plurality of sets of training audio data to obtain a plurality of sets of target training data by at least one of: performing fusion processing on continuous two frames of sub-audio data included in each group of training audio data in the multiple groups of training audio data to obtain multiple groups of target training data; performing deformation processing on frequency domain data and/or time domain data of each group of training audio data in the plurality of groups of training audio data to obtain a plurality of groups of target training data; and performing tone modification processing on each group of training audio data in the plurality of groups of training audio data to obtain a plurality of groups of target training data.

In an exemplary embodiment, the apparatus may perform the fusion process on two consecutive frames of sub-audio data included in each of the plurality of sets of training audio data by: performing the fusion processing on the audio data in a Mixup enhancement mode; the device can realize the deformation processing of the frequency domain data and/or the time domain data of each group of training audio data in a plurality of groups of training audio data in the following way: performing the warping processing on the audio data by a Spec-aggregation enhancement mode; the device can realize the tonal modification treatment of each group of training audio data in a plurality of groups of training audio data by the following modes: and performing the tonal modification processing on the audio data in a Pitch-shift enhancement mode.

In an exemplary embodiment, the analysis module 54 may implement the analyzing the target feature using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound class by: processing the target feature by using a first convolution layer, a batch normalization layer, a first activation layer, a plurality of target layers and a softmax function included in the target model in sequence to determine the first probability, wherein the target layers include a second convolution layer, a second activation layer and a discard layer.

In an exemplary embodiment, the second determining module 56 may determine the target sound category to which each frame of audio data included in the target audio data belongs based on the first probability by: filtering the first probability corresponding to each frame of audio data to obtain a target probability of each frame of audio data; determining the target sound category to which the each frame of audio data belongs based on the target probability.

In an exemplary embodiment, the second determining module 56 may filter the first probability corresponding to each frame of audio data to obtain the target probability of each frame of audio data by: performing the following operations on the first probability of each frame of audio data to obtain the target probability of each frame of audio data: respectively determining a second probability that continuous first number of frames of audio data which are in front of the current frame of audio data and adjacent to the current frame of audio belong to each sound category; determining a third probability that a second number of consecutive frames of audio data following the current frame of audio data and adjacent to the current frame of audio data belong to each sound class, respectively; and performing median filtering smoothing processing on the first probability, the second probability and the third probability to determine the target probability corresponding to the current frame of audio data.

In one exemplary embodiment, a plurality of the convolution layers included in the target model each include 3 × 3 volumetric Layer.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method as set forth in any of the above.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining an audio category, comprising:

determining target characteristics of the obtained target audio data;

analyzing the target features by using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, wherein the target model is trained by machine learning by using a plurality of sets of target training data, and each set of data in the plurality of sets of target training data comprises: the target model comprises characteristics of audio data and sound types of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers;

determining a target sound class to which each frame of audio data included in the target audio data belongs based on the first probability.

2. The method of claim 1, wherein prior to analyzing the target feature using a target model, the method further comprises:

acquiring a plurality of groups of training audio data;

performing enhancement processing on the multiple groups of training audio data to obtain multiple groups of target training data;

and training an initial model by utilizing the multiple groups of target training data to obtain the target model.

3. The method of claim 2, wherein performing enhancement processing on the sets of training audio data to obtain the sets of target training data comprises at least one of:

performing fusion processing on continuous two frames of sub-audio data included in each group of training audio data in the multiple groups of training audio data to obtain multiple groups of target training data;

performing deformation processing on frequency domain data and/or time domain data of each group of training audio data in the plurality of groups of training audio data to obtain a plurality of groups of target training data;

and performing tone modification processing on each group of training audio data in the plurality of groups of training audio data to obtain a plurality of groups of target training data.

4. The method of claim 3,

the fusion processing of the continuous two-frame sub-audio data included in each group of training audio data in the plurality of groups of training audio data comprises: performing the fusion processing on the audio data in a Mixup enhancement mode;

the step of performing deformation processing on the frequency domain data and/or the time domain data of each group of training audio data in the plurality of groups of training audio data comprises the following steps: performing the warping processing on the audio data by a Spec-aggregation enhancement mode;

the tonal modification processing of each group of training audio data in the plurality of groups of training audio data comprises: and performing the tonal modification processing on the audio data in a Pitch-shift enhancement mode.

5. The method of claim 1, wherein analyzing the target features using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound class comprises:

processing the target feature by using a first convolution layer, a batch normalization layer, a first activation layer, a plurality of target layers and a softmax function included in the target model in sequence to determine the first probability, wherein the target layers include a second convolution layer, a second activation layer and a discard layer.

6. The method of claim 1, wherein determining a target sound class to which each frame of audio data included in the target audio data belongs based on the first probability comprises:

filtering the first probability corresponding to each frame of audio data to obtain a target probability of each frame of audio data;

determining the target sound category to which the each frame of audio data belongs based on the target probability.

7. The method of claim 6, wherein filtering the first probability corresponding to each frame of audio data to obtain the target probability of each frame of audio data comprises:

performing the following operations on the first probability of each frame of audio data to obtain the target probability of each frame of audio data:

respectively determining a second probability that continuous first number of frames of audio data which are in front of the current frame of audio data and adjacent to the current frame of audio belong to each sound category;

determining a third probability that a second number of consecutive frames of audio data following the current frame of audio data and adjacent to the current frame of audio data belong to each sound class, respectively;

and performing median filtering smoothing processing on the first probability, the second probability and the third probability to determine the target probability corresponding to the current frame of audio data.

8. The method of any of claims 1 to 7, wherein each of the plurality of convolution layers included in the target model includes a 3 x 3 volumetric Layer.

9. An apparatus for determining an audio class, comprising:

the first determining module is used for determining the target characteristics of the acquired target audio data;

an analysis module, configured to analyze the target feature using a target model to determine a first probability that each frame of audio data included in the target audio data belongs to each sound category, where the target model is trained through machine learning using multiple sets of target training data, and each set of data in the multiple sets of target training data includes: the target model comprises characteristics of audio data and sound types of each frame of audio, each group of data is data obtained after enhancement processing, and the target model comprises a plurality of convolutional layers;

a second determining module, configured to determine, based on the first probability, a target sound category to which each frame of audio data included in the target audio data belongs.

10. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

11. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 8.