CN116030800A

CN116030800A - Audio classification recognition method, system, computer and readable storage medium

Info

Publication number: CN116030800A
Application number: CN202310323314.4A
Authority: CN
Inventors: 邱晓健; 连峰; 邱正峰; 崔韧; 吴鼎元
Original assignee: Nanchang Hang Tian Guang Xin Technology Co ltd
Current assignee: Nanchang Hang Tian Guang Xin Technology Co ltd
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-04-28

Abstract

The invention provides an audio classification recognition method, an audio classification recognition system, a computer and a readable storage medium, wherein the method comprises the steps of obtaining training audio data and classified audio data of audio; preprocessing training audio data and classified audio data; respectively carrying out semantic segmentation and recombination on the first Mel spectrogram and the second Mel spectrogram; respectively carrying out multidimensional convolution dimension reduction processing on the first frequency spectrum characteristic and the second frequency spectrum characteristic; the first dimension reduction feature is input into a preset CNN neural network recognition model for training, and the second dimension reduction feature is input into the preset CNN neural network recognition model for classification recognition, so that classification labels corresponding to the classified audio data are output. According to the invention, semantic segmentation and recombination are carried out on the Mel spectrogram so as to improve the accuracy of model classification, and meanwhile, the spectrum features are processed in a multidimensional convolution dimension reduction mode so as to further improve the accuracy of model classification.

Description

Audio classification recognition method, system, computer and readable storage medium

Technical Field

The invention belongs to the technical field of audio classification, and particularly relates to an audio classification recognition method, an audio classification recognition system, audio classification recognition equipment and a readable storage medium.

Background

The audio frequency refers to the sound frequency which can be perceived by human body, and is defined as the body sound of 20-20000 HZ. Sound is sound waves generated by vibration of an object. Is a wave phenomenon that propagates through a medium (air or solid, liquid) and can be perceived by human or animal auditory organs.

In a general acoustic scene, a period of time in the acoustic scene can be defined as a semantic, when the acoustic scene has randomness, namely, the semantic of different contents and meanings can appear in the acoustic scene, and the acoustic scene also has continuity, namely, the same or similar semantic can repeatedly appear for a plurality of times in a short period of time, so that in the process of classifying the audio, the classification model cannot well apply the semantic information of the acoustic scene, noise in the acoustic scene can influence the audio data, so that the robustness of the audio data is not strong, and further, the accuracy of the audio data is lower when the classification model classifies the audio, and the final classification result is influenced.

Meanwhile, in the prior art, most of the two-dimensional convolution is independently adopted to process the audio information, but because the convolution has the characteristic of translation and no deformation, namely the sound does not have the characteristic of layering on a spectrogram, and the observed frequency on the spectrogram possibly does not belong to an independent sound, but is the effect of superposition combination of a plurality of sound sources, the difficulty of training and model parameters can be greatly increased in the mode of processing the audio information by the two-dimensional convolution, and the accuracy of classification models on audio classification is further affected.

Disclosure of Invention

In order to solve the technical problems, the invention provides an audio classification recognition method, an audio classification recognition system, audio classification recognition equipment and a readable storage medium, which are used for solving the technical problems in the prior art.

In a first aspect, the present invention provides the following technical solutions, and an audio classification recognition method, where the method includes:

acquiring training audio data and classified audio data, wherein the training audio data comprises training audio and a classified label corresponding to the training audio;

preprocessing the training audio data and the classified audio data to obtain a first mel spectrogram of the training audio data and a second mel spectrogram of the classified audio data respectively;

respectively carrying out semantic segmentation and recombination on the first Mel spectrogram and the second Mel spectrogram to obtain a first frequency spectrum characteristic and a second frequency spectrum characteristic;

respectively carrying out multidimensional convolution dimension reduction processing on the first frequency spectrum feature and the second frequency spectrum feature to obtain a first dimension reduction feature and a second dimension reduction feature, and carrying out normalization processing on the first dimension reduction feature and the second dimension reduction feature;

inputting the normalized first dimension reduction features and the corresponding classification labels into a preset CNN neural network recognition model to train the preset CNN neural network recognition model, and inputting the normalized second dimension reduction features into the trained preset CNN neural network recognition model to carry out classification recognition so as to output the classification labels corresponding to the classified audio data.

Compared with the prior art, the beneficial effects of this application are: the method comprises obtaining training audio data and classified audio data of audio, wherein the training audio data comprises training audio and classified labels corresponding to the training audio, the training audio data is used for training a model, the classified audio data is audio to be classified, the training audio data and the classified audio data are preprocessed to obtain a first Mel spectrogram of the training audio data and a second Mel spectrogram of the classified audio data respectively, extracting features of the audio by extracting the Mel spectrogram, then carrying out semantic segmentation and recombination on the first Mel spectrogram and the second Mel spectrogram respectively to obtain a first spectrum feature and a second spectrum feature, carrying out semantic segmentation and recombination on the Mel spectrogram to reduce the influence of noise on the audio data, so that the model can better utilize semantic information of acoustic scenes, the robustness of the data is enhanced to improve the accuracy of model classification, then the multi-dimensional convolution dimensionality reduction processing is respectively carried out on the first frequency spectrum feature and the second frequency spectrum feature to obtain a first dimensionality reduction feature and a second dimensionality reduction feature, the normalization processing is carried out on the first dimensionality reduction feature and the second dimensionality reduction feature, the high-dimensional feature of the audio is extracted by adopting the two-dimensional convolution, the channel relation in the high-dimensional feature is modeled by using the one-dimensional convolution to further improve the accuracy of model classification, finally the normalized first dimensionality reduction feature and the corresponding classification label are input into a preset CNN neural network recognition model to train the preset CNN neural network recognition model, the normalized second dimensionality reduction feature is input into the preset CNN neural network recognition model after training to carry out classification recognition, according to the invention, semantic segmentation and recombination are carried out on the Mel spectrogram to improve the accuracy of model classification, and a multi-dimensional convolution dimension reduction mode is adopted to process the frequency spectrum characteristics so as to further improve the accuracy of model classification.

Preferably, the step of preprocessing the training audio data and the classified audio data includes:

and sequentially carrying out pre-emphasis, framing, windowing, discrete Fourier transformation and Mel filter group filtering on the training audio data and the classified audio data.

Preferably, the step of performing semantic segmentation and recombination on the first mel spectrogram and the second mel spectrogram to obtain a first spectral feature and a second spectral feature includes:

sequentially dividing the first Mel spectrogram and the second Mel spectrogram in the time direction to obtain a first spectrum semantic fragment group and a second spectrum semantic fragment group respectively;

respectively acquiring the first segment number in the first spectrum semantic segment group and the second segment number in the second spectrum semantic segment group;

establishing a first sorting group and a second sorting group based on the first fragment number and the second fragment number respectively, wherein the element numbers in the first sorting group and the second sorting group are the same as the first fragment number and the second fragment number respectively;

and carrying out semantic recombination on the first spectrum semantic segment group and the second spectrum semantic segment group according to the first sequencing group and the second sequencing group so as to obtain a first spectrum characteristic and a second spectrum characteristic.

Preferably, the step of semantically recombining the first spectrum semantic segment group and the second spectrum semantic segment group according to the first sorting group and the second sorting group to obtain a first spectrum feature and a second spectrum feature includes:

randomly sorting the elements in the first sorting group and the second sorting group to obtain a first random sorting group and a second random sorting group;

indexing and sorting the semantic segments in the first spectrum semantic segment group according to the sorting relation of the first random sorting group to obtain a first sorting semantic segment group, and indexing and sorting the semantic segments in the second spectrum semantic segment group according to the sorting relation of the second random sorting group to obtain a second sorting semantic segment group;

and performing semantic stitching on the first ordered semantic segment group and the second ordered semantic segment group in a semantic dimension to obtain a first spectrum feature and a second spectrum feature.

Preferably, the step of performing multidimensional convolution dimension reduction processing on the first spectrum feature and the second spectrum feature to obtain a first dimension reduction feature and a second dimension reduction feature includes:

Performing two-dimensional convolution feature extraction on the first frequency spectrum feature and the second frequency spectrum feature respectively to obtain a first high-dimensional frequency spectrum feature and a second high-dimensional frequency spectrum feature;

stretching and dimension-reducing the first high-dimensional frequency spectrum characteristic and the second high-dimensional frequency spectrum characteristic in the time dimension and the frequency dimension to obtain a first stretched frequency spectrum characteristic and a second stretched frequency spectrum characteristic;

and determining a first channel feature matrix and a second channel feature matrix based on the first stretching spectrum feature and the second stretching spectrum feature, and obtaining a first dimension reduction feature and a second dimension reduction feature according to the first channel feature matrix and the second channel feature matrix.

Preferably, the step of determining a first channel feature matrix and a second channel feature matrix based on the first stretched spectrum feature and the second stretched spectrum feature, and obtaining a first dimension reduction feature and a second dimension reduction feature according to the first channel feature matrix and the second channel feature matrix includes:

sequentially carrying out one-dimensional convolution, excitation operation and channel compression on the first stretching spectrum characteristic and the second stretching spectrum characteristic to obtain a first excitation characteristic matrix and a second excitation characteristic matrix;

Performing feature recombination on the first excitation feature matrix and the second excitation feature matrix to obtain a first channel feature matrix and a second channel feature matrix;

performing element multiplication on the elements in the first stretching spectrum characteristic and the elements in the first channel characteristic matrix to obtain first multiplication elements, performing characteristic connection on the first multiplication elements in a residual mode to obtain first dimension reduction characteristics, performing element multiplication on the elements in the second stretching spectrum characteristic and the elements in the second channel characteristic matrix to obtain a plurality of second multiplication elements, and performing characteristic connection on the plurality of second multiplication elements in a residual mode to obtain second dimension reduction characteristics.

Preferably, the step of normalizing the first dimension-reduction feature and the second dimension-reduction feature includes:

and respectively carrying out normalization processing on the first dimension reduction feature and the second dimension reduction feature through a normalization formula, wherein the normalization formula is as follows:

；

in the method, in the process of the invention,yrepresenting the characteristic value before normalization, min #y) Representing the minimum value, max, of the eigenvalue before normalizationy) Represents the maximum value of the eigenvalues before normalization, y′Representing the normalized eigenvalues.

In a second aspect, the present invention provides an audio classification recognition system, which includes:

the system comprises an acquisition module, a classification module and a classification module, wherein the acquisition module is used for acquiring training audio data and classification audio data, and the training audio data comprises training audio and classification labels corresponding to the training audio;

the processing module is used for preprocessing the training audio data and the classified audio data to respectively obtain a first Mel spectrogram of the training audio data and a second Mel spectrogram of the classified audio data;

the segmentation module is used for carrying out semantic segmentation and recombination on the first Mel spectrogram and the second Mel spectrogram respectively so as to obtain a first frequency spectrum characteristic and a second frequency spectrum characteristic;

the dimension reduction module is used for respectively carrying out multidimensional convolution dimension reduction processing on the first frequency spectrum feature and the second frequency spectrum feature to obtain a first dimension reduction feature and a second dimension reduction feature, and carrying out normalization processing on the first dimension reduction feature and the second dimension reduction feature;

the classification module is used for inputting the normalized first dimension reduction features and the corresponding classification labels into a preset CNN neural network recognition model to train the preset CNN neural network recognition model, and inputting the normalized second dimension reduction features into the trained preset CNN neural network recognition model to carry out classification recognition so as to output the classification labels corresponding to the classified audio data.

In a third aspect, the present invention provides a computer, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for identifying audio classification as described above when executing the computer program.

In a fourth aspect, the present invention provides a readable storage medium, where a computer program is stored on the readable storage medium, and the computer program when executed by a processor implements the method for identifying audio classification as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an audio classification recognition method according to an embodiment of the present invention;

fig. 2 is a detailed flowchart of step S3 in the audio classification recognition method according to the first embodiment of the present invention;

Fig. 3 is a detailed flowchart of step S34 in the audio classification recognition method according to the first embodiment of the present invention;

fig. 4 is a detailed flowchart of step S4 in the audio classification recognition method according to the first embodiment of the present invention;

fig. 5 is a detailed flowchart of step S43 in the audio classification recognition method according to the first embodiment of the present invention;

fig. 6 is a block diagram of an audio classification recognition system according to a second embodiment of the present invention;

fig. 7 is a block diagram of a hardware structure of a computer according to another embodiment of the present invention.

Embodiments of the present invention will be further described below with reference to the accompanying drawings.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended to illustrate embodiments of the invention and should not be construed as limiting the invention.

In the description of the embodiments of the present invention, it should be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate description of the embodiments of the present invention and simplify description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present invention, the meaning of "plurality" is two or more, unless explicitly defined otherwise.

In the embodiments of the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly and include, for example, either permanently connected, removably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the embodiments of the present invention will be understood by those of ordinary skill in the art according to specific circumstances.

Example 1

As shown in fig. 1, in a first embodiment of the present invention, the present invention provides a method for identifying audio classification, the method comprising:

S1, acquiring training audio data and classified audio data, wherein the training audio data comprises training audio and a classified label corresponding to the training audio;

specifically, the training audio data is an audio file after being classified by a manual or classifier, the training audio data includes a classification tag corresponding to the training audio, the classification tag is an acoustic scene tag, for example, different classification tags in a public transportation, a subway, a park scene and the like, and the classification audio data is collected real-time audio data to be classified.

S2, preprocessing the training audio data and the classified audio data to respectively obtain a first Mel spectrogram of the training audio data and a second Mel spectrogram of the classified audio data;

specifically, through preprocessing the training audio data and the classified audio data, a mel spectrogram for reflecting audio characteristics can be obtained;

wherein the step of preprocessing the training audio data and the classified audio data comprises: the training audio data and the classified audio data are subjected to pre-emphasis, framing, windowing, discrete Fourier transform and Mel filtering group filtering in sequence, wherein the pre-emphasis step is used for eliminating effects caused by sound and lips in the sounding process and compensating attenuation of high-frequency information in the transmission process, the framing step is used for dividing original audio data into a plurality of frames, time domain information is prevented from being lost in the Fourier transform process, the windowing step is used for smoothing abrupt changes of edge signals in the framing process, the discrete Fourier transform step is used for carrying out Fourier transform on each window after windowing and obtaining frequency domain information in audio, the Mel filtering group filtering is used for enabling frequency distribution of the audio to be more in accordance with the perception scale of human ears, and the one-dimensional audio information can be converted into a two-dimensional Mel spectrogram through the preprocessing step.

S3, respectively carrying out semantic segmentation and recombination on the first Mel spectrogram and the second Mel spectrogram to obtain a first frequency spectrum characteristic and a second frequency spectrum characteristic;

specifically, as the acoustic scenes have continuity and randomness, semantics of different contents appear in one acoustic scene and semantics appear repeatedly for a plurality of times, so that accuracy of classification of a subsequent model is affected.

As shown in fig. 2, the step S3 includes:

s31, sequentially dividing the first Mel spectrogram and the second Mel spectrogram in the time direction to obtain a first spectrum semantic segment group and a second spectrum semantic segment group respectively;

specifically, a first spectrum semantic segment group and a second spectrum semantic segment group are obtained by sequentially dividing the first mel spectrogram and the second mel spectrogram in the time direction, wherein the first spectrum semantic segment group and the second spectrum semantic segment group comprise a plurality of semantic segments, and the first spectrum semantic segment group and the second spectrum semantic segment group comprise three dimensions of a channel, a time domain and a frequency spectrum.

S32, respectively acquiring the first segment number in the first spectrum semantic segment group and the second segment number in the second spectrum semantic segment group;

s33, respectively establishing a first sorting group and a second sorting group based on the first fragment number and the second fragment number, wherein the element numbers in the first sorting group and the second sorting group are respectively the same as the first fragment number and the second fragment number;

specifically, the number of elements in the first sorting group is the same as the number of the first fragments, the semantic fragments in the first spectrum semantic fragment group are in one-to-one correspondence with the elements in the first sorting group, and similarly, the semantic fragments in the second spectrum semantic fragment group are in one-to-one correspondence with the elements in the second sorting group.

S34, carrying out semantic recombination on the first spectrum semantic segment group and the second spectrum semantic segment group according to the first sorting group and the second sorting group so as to obtain a first spectrum feature and a second spectrum feature;

as shown in fig. 3, the step S34 includes:

s341, randomly sorting elements in the first sorting group and the second sorting group to obtain a first random sorting group and a second random sorting group;

S342, carrying out index sorting on semantic segments in the first spectrum semantic segment group according to the sorting relation of the first random sorting group to obtain a first sorting semantic segment group, and carrying out index sorting on semantic segments in the second spectrum semantic segment group according to the sorting relation of the second random sorting group to obtain a second sorting semantic segment group;

specifically, since the semantic segments in the first spectrum semantic segment group are in one-to-one correspondence with the elements in the first sorting group, and the semantic segments in the second spectrum semantic segment group are in one-to-one correspondence with the elements in the second sorting group, after the sequences of the elements in the first sorting group and the second sorting group are rearranged, the semantic segments in the first spectrum semantic segment group and the second spectrum semantic segment group are rearranged according to the corresponding relations between the semantic segments in the first spectrum semantic segment group and the second spectrum semantic segment group and the elements in the first sorting group and the second sorting group, so that the first sorting semantic segment group and the second sorting semantic segment group can be obtained;

s343, performing semantic stitching on the first ordering semantic segment group and the second ordering semantic segment group in a semantic dimension to obtain a first spectrum feature and a second spectrum feature;

Specifically, the first spectrum feature and the second spectrum feature obtained after the semantic splicing are three-dimensional features, and the semantic information of the audio is segmented, reordered and combined during the process by converting the two-dimensional first Mel spectrogram and the second Mel spectrogram into the three-dimensional first spectrum feature and the three-dimensional second spectrum feature, so that the influence of noise on the audio is reduced, semantic information in an acoustic scene can be better utilized, the robustness of the audio data is enhanced, and the accuracy of model classification is improved.

S4, respectively carrying out multidimensional convolution dimension reduction processing on the first frequency spectrum feature and the second frequency spectrum feature to obtain a first dimension reduction feature and a second dimension reduction feature, and carrying out normalization processing on the first dimension reduction feature and the second dimension reduction feature;

specifically, since the first spectrum feature and the second spectrum feature obtained in the step S3 are three-dimensional features, in order to reduce the classification difficulty and the training difficulty of the model, the model needs to be subjected to dimension reduction processing so as to avoid losing feature information in the feature processing process;

and the step of normalizing the first dimension-reduction feature and the second dimension-reduction feature comprises the following steps:

；

in the method, in the process of the invention,yrepresenting the characteristic value before normalization, min #y) Representing the minimum value, max, of the eigenvalue before normalizationy) Represents the maximum value of the eigenvalues before normalization,y′representing the normalized eigenvalues.

As shown in fig. 4, the step S4 includes:

s41, respectively extracting two-dimensional convolution characteristics of the first frequency spectrum characteristic and the second frequency spectrum characteristic to obtain a first high-dimensional frequency spectrum characteristic and a second high-dimensional frequency spectrum characteristic;

specifically, the first spectrum feature and the second spectrum feature are subjected to two-dimensional convolution feature extraction, and finally are connected with the original feature in a residual form so as to facilitate the subsequent one-dimensional convolution operation.

S42, stretching and dimension-reducing the first high-dimensional frequency spectrum characteristic and the second high-dimensional frequency spectrum characteristic in the time dimension and the frequency dimension to obtain a first stretched frequency spectrum characteristic and a second stretched frequency spectrum characteristic;

specifically, since the two-dimensional convolution is compared with the one-dimensional convolution, the dimension reduction operation is needed before the one-dimensional convolution is performed, and the information in the spectrum features can be prevented from being lost by stretching and dimension reduction of the first high-dimensional spectrum feature and the second high-dimensional spectrum feature in the time dimension and the frequency dimension.

S43, determining a first channel feature matrix and a second channel feature matrix based on the first stretching spectrum feature and the second stretching spectrum feature, and obtaining a first dimension reduction feature and a second dimension reduction feature according to the first channel feature matrix and the second channel feature matrix;

as shown in fig. 5, the step S43 includes:

s431, sequentially carrying out one-dimensional convolution, excitation operation and channel compression on the first stretching spectrum characteristic and the second stretching spectrum characteristic to obtain a first excitation characteristic matrix and a second excitation characteristic matrix;

specifically, in step S431, a sigmoid function is used to perform excitation operation on the first stretched spectrum feature and the second stretched spectrum feature after one-dimensional convolution, and then channel compression operation is performed on the first stretched spectrum feature and the second stretched spectrum feature, so that the channel numbers of the first excitation feature matrix and the second excitation feature matrix are the same as the channel numbers of the first spectrum feature and the second spectrum feature respectively.

S432, performing feature recombination on the first excitation feature matrix and the second excitation feature matrix to obtain a first channel feature matrix and a second channel feature matrix;

s433, carrying out element multiplication on the elements in the first stretching spectrum characteristic and the elements in the first channel characteristic matrix to obtain first multiplication elements, carrying out characteristic connection on the first multiplication elements in a residual mode to obtain first dimension reduction characteristics, carrying out element multiplication on the elements in the second stretching spectrum characteristic and the elements in the second channel characteristic matrix to obtain a plurality of second multiplication elements, and carrying out characteristic connection on the plurality of second multiplication elements in a residual mode to obtain second dimension reduction characteristics.

S5, inputting the normalized first dimension reduction features and the corresponding classification labels into a preset CNN neural network recognition model to train the preset CNN neural network recognition model, and inputting the normalized second dimension reduction features into the trained preset CNN neural network recognition model to perform classification recognition so as to output the classification labels corresponding to the classified audio data;

specifically, the first dimension reduction features corresponding to the training audio data perform model training on the preset CNN neural network recognition model, accuracy of model recognition respectively is further improved, the second dimension reduction features corresponding to the classified audio data are input into the trained preset CNN neural network recognition model to perform audio classification recognition, and classification labels corresponding to the classified audio data are output.

The first advantage of this embodiment is: the training audio data comprises training audio and a classification audio data corresponding to the training audio, the training audio data is used for training a model, the classification audio data is audio of required classification, the training audio data and the classification audio data are preprocessed to respectively obtain a first Mel spectrogram of the training audio data and a second Mel spectrogram of the classification audio data, the characteristics of the audio are extracted by extracting the Mel spectrogram so as to extract the characteristics of the audio, then the first Mel spectrogram and the second Mel spectrogram are subjected to semantic segmentation recombination respectively so as to obtain a first spectrum characteristic and a second spectrum characteristic, the influence of noise on the audio data can be reduced by performing semantic segmentation recombination on the Mel spectrogram, so that the semantic information of an acoustic scene can be better applied, the robustness of the model is enhanced, the accuracy of the model classification is improved, the first spectrum characteristic and the second spectrum characteristic are respectively subjected to multidimensional convolution and dimension reduction processing so as to obtain a first dimension reduction feature, the second dimension reduction feature is extracted by extracting the second dimension feature, the CNN is further subjected to the correspondence to the second dimension reduction feature, the CNN is further normalized, the CNN is input to the training feature is further normalized by using a CNN (CNN) to obtain the training feature, the training feature is further subjected to the two-dimensional reduction feature is normalized by using a CNN (CNN) normalization feature, according to the invention, semantic segmentation and recombination are carried out on the Mel spectrogram to improve the accuracy of model classification, and a multi-dimensional convolution dimension reduction mode is adopted to process the frequency spectrum characteristics so as to further improve the accuracy of model classification.

Example two

As shown in fig. 6, in a second embodiment of the present invention, there is provided an audio classification recognition system including:

the system comprises an acquisition module 1, a classification module and a classification module, wherein the acquisition module is used for acquiring training audio data and classification audio data, and the training audio data comprises training audio and classification labels corresponding to the training audio;

the processing module 2 is used for preprocessing the training audio data and the classified audio data to respectively obtain a first mel spectrogram of the training audio data and a second mel spectrogram of the classified audio data;

the segmentation module 3 is used for performing semantic segmentation and recombination on the first mel spectrogram and the second mel spectrogram respectively so as to obtain a first frequency spectrum characteristic and a second frequency spectrum characteristic;

the dimension reduction module 4 is configured to perform multidimensional convolution dimension reduction processing on the first spectral feature and the second spectral feature respectively, so as to obtain a first dimension reduction feature and a second dimension reduction feature, and perform normalization processing on the first dimension reduction feature and the second dimension reduction feature;

the classification module 5 is configured to input the normalized first dimension reduction feature and the corresponding classification label into a preset CNN neural network recognition model to train the preset CNN neural network recognition model, and input the normalized second dimension reduction feature into the trained preset CNN neural network recognition model to perform classification recognition so as to output the classification label corresponding to the classified audio data;

Wherein the segmentation module 3 comprises:

the segmentation submodule is used for sequentially segmenting the first Mel spectrogram and the second Mel spectrogram in the time direction to respectively obtain a first spectrum semantic segment group and a second spectrum semantic segment group;

the quantity calculation operator module is used for respectively acquiring the quantity of the first fragments in the first spectrum semantic fragment group and the quantity of the second fragments in the second spectrum semantic fragment group;

the establishing submodule is used for respectively establishing a first sorting group and a second sorting group based on the first fragment number and the second fragment number, wherein the element numbers in the first sorting group and the second sorting group are respectively the same as the first fragment number and the second fragment number;

and the recombination submodule is used for carrying out semantic recombination on the first spectrum semantic segment group and the second spectrum semantic segment group according to the first sequencing group and the second sequencing group so as to obtain a first spectrum characteristic and a second spectrum characteristic.

The reorganization submodule comprises:

the first ordering unit is used for randomly ordering the elements in the first ordering group and the second ordering group to obtain a first random ordering group and a second random ordering group;

The second sorting unit is used for carrying out index sorting on the semantic segments in the first spectrum semantic segment group according to the sorting relation of the first random sorting group to obtain a first sorting semantic segment group, and carrying out index sorting on the semantic segments in the second spectrum semantic segment group according to the sorting relation of the second random sorting group to obtain a second sorting semantic segment group;

the splicing unit is used for carrying out semantic splicing on the first ordering semantic segment group and the second ordering semantic segment group in a semantic dimension so as to obtain a first spectrum feature and a second spectrum feature.

The dimension reduction module 4 comprises:

the extraction submodule is used for respectively carrying out two-dimensional convolution feature extraction on the first frequency spectrum feature and the second frequency spectrum feature to obtain a first high-dimensional frequency spectrum feature and a second high-dimensional frequency spectrum feature;

the dimension reduction sub-module is used for carrying out stretching dimension reduction on the first high-dimensional frequency spectrum characteristic and the second high-dimensional frequency spectrum characteristic in the time dimension and the frequency dimension so as to obtain a first stretching frequency spectrum characteristic and a second stretching frequency spectrum characteristic;

the matrix determining submodule is used for determining a first channel feature matrix and a second channel feature matrix based on the first stretching frequency spectrum feature and the second stretching frequency spectrum feature, and obtaining a first dimension reduction feature and a second dimension reduction feature according to the first channel feature matrix and the second channel feature matrix.

The matrix determination submodule includes:

the characteristic processing unit is used for carrying out one-dimensional convolution, excitation operation and channel compression on the first stretching spectrum characteristic and the second stretching spectrum characteristic in sequence so as to obtain a first excitation characteristic matrix and a second excitation characteristic matrix;

the characteristic recombination unit is used for carrying out characteristic recombination on the first excitation characteristic matrix and the second excitation characteristic matrix to obtain a first channel characteristic matrix and a second channel characteristic matrix;

the multiplication unit is used for carrying out element multiplication on the elements in the first stretching spectrum characteristic and the elements in the first channel characteristic matrix to obtain first multiplication elements, carrying out characteristic connection on the first multiplication elements in a residual mode to obtain first dimension reduction characteristics, carrying out element multiplication on the elements in the second stretching spectrum characteristic and the elements in the second channel characteristic matrix to obtain a plurality of second multiplication elements, and carrying out characteristic connection on the plurality of second multiplication elements in a residual mode to obtain second dimension reduction characteristics.

In other embodiments of the present invention, a computer is provided in the embodiments of the present invention, including a memory 102, a processor 101, and a computer program stored in the memory 102 and executable on the processor 101, where the processor 101 implements the audio classification recognition method described above when executing the computer program.

In particular, the processor 101 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 102 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 102 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, solid state Drive (Solid State Drive, SSD), flash memory, optical Disk, magneto-optical Disk, tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 102 may include removable or non-removable (or fixed) media, where appropriate. The memory 102 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 102 is a Non-Volatile (Non-Volatile) memory. In a particular embodiment, the Memory 102 includes Read-Only Memory (ROM) and random access Memory (Random Access Memory, RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (Programmable Read-Only Memory, abbreviated PROM), an erasable PROM (Erasable Programmable Read-Only Memory, abbreviated EPROM), an electrically erasable PROM (Electrically Erasable Programmable Read-Only Memory, abbreviated EEPROM), an electrically rewritable ROM (Electrically Alterable Read-Only Memory, abbreviated EAROM), or a FLASH Memory (FLASH), or a combination of two or more of these. The RAM may be Static Random-Access Memory (SRAM) or dynamic Random-Access Memory (Dynamic Random Access Memory DRAM), where the DRAM may be a fast page mode dynamic Random-Access Memory (Fast Page Mode Dynamic Random Access Memory FPMDRAM), extended data output dynamic Random-Access Memory (Extended Date Out Dynamic Random Access Memory EDODRAM), synchronous dynamic Random-Access Memory (Synchronous Dynamic Random-Access Memory SDRAM), or the like, as appropriate.

Memory 102 may be used to store or cache various data files that need to be processed and/or communicated, as well as possible computer program instructions for execution by processor 101.

The processor 101 implements the above-described audio classification recognition method by reading and executing computer program instructions stored in the memory 102.

In some of these embodiments, the computer may also include a communication interface 103 and a bus 100. As shown in fig. 7, the processor 101, the memory 102, and the communication interface 103 are connected to each other via the bus 100 and perform communication with each other.

The communication interface 103 is used to implement communication between modules, devices, units, and/or units in the embodiments of the present application. The communication interface 103 may also enable communication with other components such as: and the external equipment, the image/data acquisition equipment, the database, the external storage, the image/data processing workstation and the like are used for data communication.

Bus 100 includes hardware, software, or both, coupling components of a computer to each other. Bus 100 includes, but is not limited to, at least one of: data Bus (Data Bus), address Bus (Address Bus), control Bus (Control Bus), expansion Bus (Expansion Bus), local Bus (Local Bus). By way of example, and not limitation, bus 100 may include a graphics acceleration interface (Accelerated Graphics Port), abbreviated AGP, or other graphics Bus, an enhanced industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) Bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (Industry Standard Architecture, ISA) Bus, a wireless bandwidth (InfiniBand) interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a micro channel architecture (Micro Channel Architecture, abbreviated MCa) Bus, a peripheral component interconnect (Peripheral Component Interconnect, abbreviated PCI) Bus, a PCI-Express (PCI-X) Bus, a serial advanced technology attachment (Serial Advanced Technology Attachment, abbreviated SATA) Bus, a video electronics standards association local (Video Electronics Standards Association Local Bus, abbreviated VLB) Bus, or other suitable Bus, or a combination of two or more of the foregoing. Bus 100 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

The computer can execute the audio classification recognition method based on the acquired audio classification recognition system, so that the classification recognition of the audio is realized.

In still other embodiments of the present invention, in combination with the above-described audio classification recognition method, embodiments of the present invention provide a readable storage medium having a computer program stored thereon, which when executed by a processor, implements the above-described audio classification recognition method.

Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of audio classification recognition, the method comprising:

2. The audio classification recognition method of claim 1, wherein the step of preprocessing the training audio data and the classified audio data comprises:

3. The method of claim 1, wherein the step of performing semantic segmentation and reconstruction on the first mel-spectrogram and the second mel-spectrogram to obtain a first spectral feature and a second spectral feature comprises:

4. The method of claim 3, wherein the step of semantically reorganizing the first set of spectral semantic segments and the second set of spectral semantic segments according to the first ordered set and the second ordered set to obtain a first spectral feature and a second spectral feature comprises:

5. The method of claim 1, wherein the step of performing multidimensional convolution dimension reduction processing on the first spectral feature and the second spectral feature to obtain a first dimension reduction feature and a second dimension reduction feature includes:

6. The method of claim 5, wherein the steps of determining a first channel feature matrix and a second channel feature matrix based on the first stretched spectral feature and the second stretched spectral feature, and obtaining a first dimension reduction feature and a second dimension reduction feature from the first channel feature matrix and the second channel feature matrix comprise:

7. The method of claim 1, wherein normalizing the first dimension-reduction feature and the second dimension-reduction feature comprises:

；

8. An audio classification recognition system, the system comprising:

9. A computer comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the audio classification recognition method of any of claims 1 to 7 when the computer program is executed.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a processor, implements the audio classification recognition method according to any one of claims 1 to 7.