CN113782051B - Broadcast effect classification method and system, electronic equipment and storage medium - Google Patents

Broadcast effect classification method and system, electronic equipment and storage medium Download PDF

Info

Publication number
CN113782051B
CN113782051B CN202110858717.XA CN202110858717A CN113782051B CN 113782051 B CN113782051 B CN 113782051B CN 202110858717 A CN202110858717 A CN 202110858717A CN 113782051 B CN113782051 B CN 113782051B
Authority
CN
China
Prior art keywords
audio
feature
layer
channel
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110858717.XA
Other languages
Chinese (zh)
Other versions
CN113782051A (en
Inventor
王方圆
王欣盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Mosi Technology Co ltd
Original Assignee
Beijing Zhongke Mosi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Mosi Technology Co ltd filed Critical Beijing Zhongke Mosi Technology Co ltd
Priority to CN202110858717.XA priority Critical patent/CN113782051B/en
Publication of CN113782051A publication Critical patent/CN113782051A/en
Application granted granted Critical
Publication of CN113782051B publication Critical patent/CN113782051B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a broadcast effect classification method and system, electronic equipment and storage medium, wherein the method comprises the following steps: based on the broadcast audio data to be classified, determining initial audio characteristics according to Fourier transformation; and inputting the initial audio features into a broadcast effect classification model to determine a target broadcast effect class. The degree of automation and the accuracy of the broadcast effect classification can be improved.

Description

Broadcast effect classification method and system, electronic equipment and storage medium
Technical Field
The present invention relates to the field of intelligent audio analysis technologies, and in particular, to a broadcast effect classification method and system, an electronic device, and a storage medium.
Background
The medium short wave broadcast has a long propagation distance and is a propaganda matrix without boundaries. In order to receive the landing effect of the medium-short wave broadcast wave, the broadcast frequency needs to be accurately classified. Because the broadcast signal is an analog signal, the transmission path is complex, and the difficulty of classifying the floor audio effect is increased.
In the past, the medium short wave broadcasting classification is extremely dependent on manual monitoring, the classification accuracy mainly depends on personnel quality and personnel experience, and the medium short wave broadcasting is complicated due to the channel environment, the signal to noise ratio is often lower, the hearing of the person on duty is damaged by long-time monitoring, and the intelligent level of medium short wave broadcasting effect classification needs to be improved by introducing technical means.
Aiming at the medium-short wave broadcasting effect classification in the prior art, the audio comparison needs to be carried out by referring to the source signal, the operation is complex, the automation degree is low, the broadcasting effect needs to be classified by relying on manpower, the classification accuracy depends on manual experience, and the classification accuracy is poor.
Therefore, how to provide a broadcast effect classification method and system, to improve the automation degree and accuracy of broadcast effect classification, is a problem to be solved.
Disclosure of Invention
Aiming at the defects in the prior art, the embodiment of the invention provides a broadcast effect classification method and system, electronic equipment and storage medium, and at least solves the technical problems of low automation degree and poor accuracy in the process of classifying broadcast effects.
Provided is a broadcast effect classification method, including:
based on the broadcast audio data to be classified, determining initial audio characteristics according to Fourier transformation;
and inputting the initial audio features into a broadcast effect classification model to determine a target broadcast effect class.
According to the broadcast effect classification method provided by the invention, the method for determining the initial audio characteristics based on the broadcast audio data to be classified according to the Fourier transform specifically comprises the following steps:
Based on the broadcast audio data to be classified, determining an audio frame data set according to a preset framing rule; wherein adjacent frames in the audio frame dataset are continuous and non-overlapping;
and converting time domain information in each frame in the audio frame data set into cepstral frequency information based on the Fourier transform, and determining the initial audio characteristics.
According to the broadcast effect classification method provided by the invention, the broadcast effect classification model comprises the following steps: an audio feature dimension reduction layer, a model attention layer and an effect classification layer;
inputting the initial audio features into a broadcast effect classification model, and determining a target broadcast effect category, wherein the method specifically comprises the following steps:
inputting the initial audio features into the audio feature dimension reduction layer, and determining low-dimensional audio features based on the audio feature dimension reduction layer;
inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer;
and inputting the target audio features into the effect classification layer, and determining the target broadcast effect category based on the effect classification layer.
According to the broadcast effect classification method provided by the invention, the model attention layer comprises the following steps: a temporal attention layer, a channel attention layer, a self attention layer, and a feature fusion layer;
The inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer specifically comprises the following steps:
inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer;
inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer;
inputting the second audio feature into the self-attention layer, determining a third audio feature according to a self-attention mechanism based on the self-attention layer;
inputting the first audio feature, the second audio feature and the third audio feature into the feature fusion layer, and determining the target audio feature based on the feature fusion layer.
According to the broadcast effect classification method provided by the invention, the channel attention layer comprises the following steps: a first channel attention layer, a second channel attention layer, and a channel feature fusion layer;
the inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, specifically including:
Inputting the first audio feature into the first channel attention layer, and determining a first channel audio feature according to a channel attention mechanism based on the first channel attention layer;
inputting the first channel audio feature into the second channel attention layer, and determining a second channel audio feature according to a channel attention mechanism based on the second channel attention layer;
inputting the first channel audio feature and the second channel audio feature into the channel feature fusion layer, and determining the second audio feature based on the channel feature fusion layer.
According to the broadcast effect classification method provided by the invention, the time attention layer comprises the following steps: a first feature segmentation layer, a first feature aggregation layer, and a temporal sub-attention layer;
the step of inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer specifically comprises the following steps:
inputting the low-dimensional audio features into the first feature segmentation layer, and determining a first initial feature set according to a first preset feature channel segmentation rule based on the first feature segmentation layer; wherein the first initial feature set comprises a plurality of time sub-features;
Inputting the first initial feature set into the first feature aggregation layer, processing the time sub-features according to a first preset feature processing rule based on the first feature aggregation layer, and aggregating the processed time sub-features to determine a first initial time feature;
inputting the first initial temporal feature into the temporal sub-attention layer, and determining the first audio feature according to a temporal attention mechanism based on the temporal sub-attention layer.
According to the broadcast effect classification method provided by the invention, the channel attention layer comprises the following steps: a second feature segmentation layer, a second feature aggregation layer, and a channel sub-attention layer;
the inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, specifically including:
inputting the first audio features into the second feature segmentation layer, and determining a second initial feature set according to a second preset feature channel segmentation rule based on the second feature segmentation layer; wherein the second initial feature set comprises a plurality of channel sub-features;
inputting the second initial feature set into the second feature aggregation layer, processing the channel sub-features according to a second preset feature processing rule based on the second feature aggregation layer, and aggregating the processed channel sub-features to determine second initial channel features;
Inputting the second initial channel feature into the channel sub-attention layer, and determining the second audio feature according to a channel attention mechanism based on the channel sub-attention layer.
The invention also provides a broadcast effect classification system, which comprises: an audio feature determining unit and a broadcast effect classifying unit;
the audio feature determining unit is used for determining initial audio features according to Fourier transformation based on the broadcast audio data to be classified;
the broadcast effect classification unit is used for inputting the initial audio features into a broadcast effect classification model to determine a target broadcast effect category.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the broadcast effect classification method as described in any one of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the broadcast effect classification method as described in any of the above.
According to the broadcast effect classification method and system, the electronic equipment and the storage medium, the broadcast audio data to be classified are acquired, the initial audio characteristics are determined according to Fourier transformation, and the target broadcast effect classification is determined by using a trained broadcast effect classification model. When the broadcast effect is automatically classified, the effect evaluation is only needed to be carried out on the audio collected by the receiving and measuring end, the processing and processing of additional audio such as a transmitting end or a reference source are not needed, the step and the intelligent degree of the broadcast effect classification are effectively reduced, the artificial experience of classification personnel is not needed, the classification standard is unified through a neural network, and the accuracy of the effect classification is improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a broadcast effect classification method provided by the invention;
fig. 2 is a schematic diagram of a broadcast effect classification model structure provided by the present invention;
FIG. 3 is a schematic diagram of a time attention layer structure according to the present invention;
fig. 4 is a schematic flow chart of a broadcast effect classification method according to the present invention;
FIG. 5 is a schematic flow chart of a test method of a broadcast effect classification model according to the present invention;
FIG. 6 is a graph of a human intervention-free interval divided according to an ideal confidence level provided by the invention;
fig. 7 is a schematic structural diagram of a broadcast effect classification system according to the present invention;
fig. 8 is a schematic diagram of an entity structure of an electronic device according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Aiming at the medium-short wave broadcasting effect classification in the prior art, two technical routes mainly exist:
the first method is to calculate the similarity between the received audio and the reference source audio based on audio comparison, if the similarity is high, the broadcasting is judged to be normal in the floor, otherwise, the broadcasting of the frequency point is judged to be seriously interfered, and by means of the information of the reference source audio, the method can obtain good evaluation effect.
The second is that the signal transmitting end embeds watermark into the audio frequency, and carries out watermark extraction at the receiving and testing end, if the watermark extraction is normal, the sending is normal. The method does not depend on the reference source audio frequency, but depends on the robustness of a watermarking algorithm seriously, the channel environment of the medium-short wave broadcasting is complex, signals are easy to be interfered by ionospheric reflection, and even if the medium-short wave broadcasting signals are normally transmitted, embedded watermark information is difficult to detect at a receiving and testing end, so the method still needs to be further broken through.
Therefore, in the two types of methods, the first type of method has the precondition of existence of a reference source signal, however, in practice, the problem of missing caused by the lack or interruption of acquisition of the reference source often occurs for some unpredictable reasons, and the audio comparison cannot be completed; the second method does not need to refer to a source signal, but has the defect that the technical robustness is not broken through.
Before describing the present invention in detail, related concepts related to the present invention will be described first.
In the medium-short wave broadcasting effect evaluation, the audio quality is compared with that of the normal broadcasting of the frequency point through the broadcasting of the object station, the result is intuitively expressed, and the classification rule is formulated according to the result. Separately and independently evaluating the broadcast sound quality S of the object station 1 And the sound quality S of the normal broadcast of the frequency point 2
Broadcast sound quality score S 1 ,S 2 The high-to-low ratio is from '5' to '0', the '5' part indicates that the broadcast sound has high and clear sound quality and no noise, and the '0' part indicates that the frequency point broadcast sound is not present in the audio. The evaluation result of a broadcast audio consists of the comparison of the broadcast sound quality of the object station and the broadcast sound quality of the frequency point, namely S 1 /S 2 Wherein S is 1 And S is 2 The method meets the following conditions:
S 1 +S 2 ≤5
S 1 +S 2 the larger the value of (2) the less noise. For example, the evaluation result of a section of audio is 0/5, which indicates that the normal broadcasting sound of the frequency point has clear pitch and no noise and the broadcasting sound of the frequency point of the object station; "2/0" means that a weaker object station broadcast sound can be heard and that the noise is loud, and that no frequency point normally broadcasts sound. Complete medium-short wave broadcast effect classification rule contrastThe table is shown in table 1.
Table 1 is a comparison table of medium-short wave broadcast effect classification rules, and table 1 lists all classification results in the broadcast effect classification.
Table 1 short wave broadcast effect classification rule comparison table
Wherein, "0/5", "0/4", "0/3", "1/4", "1/3" and "1/3" belong to the qualification class, the characteristic is that the normal broadcast sound of this frequency point is clearer, the broadcasting sound of the object station is weak or none, the noise is smaller; the '5/0', '4/0', '3/0', '4/1', '3/1', and '3/2', belong to disqualification category, and are characterized by being capable of hearing broadcasting sound of the object station, the normal broadcasting sound of the frequency point is weak, and the noise is low; "0/0", "0/1", "0/2", "1/0", "1/1", "1/2", "2/0", "2/1" and "2/2" belong to basic qualification categories, and are characterized by loud noise, and the broadcasting sound of the object station and the frequency point is not clear, so that the broadcasting content cannot be heard. For pass, fail and basic pass categories, the partial evaluation results may be classified according to actual conditions, e.g., data in a "2/2" result may actually need to be classified as pass.
Fig. 1 is a flowchart of a broadcast effect classification method provided by the present invention, and as shown in fig. 1, an embodiment of the present invention provides a broadcast effect classification method, including:
Step S1, based on broadcast audio data to be classified, determining initial audio characteristics according to Fourier transformation;
and S2, inputting the initial audio features into a broadcast effect classification model, and determining a target broadcast effect type.
In particular, the present invention is described taking the classification of broadcast audio effects of medium short waves (radio waves having wavelengths of 200m to 50m and frequencies of 1500kHZ to 6000 kHZ) as an example. It will be appreciated that the method may be widely applied to the classification of audio. Further, the present invention can be applied to other fields adaptively, and the present invention is not limited thereto.
In step S1, based on the broadcast audio data to be classified, initial audio features are determined according to feature information corresponding to the audio frames extracted by fourier transform.
It will be appreciated that the data processing of the audio signal according to the fourier transform is actually extracting cepstral domain information of the audio signal. In the actual application process of the invention, when audio data are converted into audio features according to Fourier transform, the preset framing rule and window width of Fourier transform can be adjusted according to actual conditions, and the invention is not limited to the above.
In step S2, the initial audio features are input into a trained broadcast effect classification model, and a target broadcast effect class is determined.
It can be understood that the model needs to be trained before the broadcast effect classification model is used, and the training model is a training sample, a training method and a specific structure of the model which are used can be adjusted according to actual requirements, which is not limited by the invention.
According to the broadcast effect classification method provided by the invention, the broadcast audio data to be classified is obtained, the initial audio characteristics are determined according to Fourier transformation, and the target broadcast effect class is determined by using a trained broadcast effect classification model. When the broadcast effect is automatically classified, the effect evaluation is only needed to be carried out on the audio collected by the receiving and measuring end, the processing and processing of additional audio such as a transmitting end or a reference source are not needed, the step and the intelligent degree of the broadcast effect classification are effectively reduced, the artificial experience of classification personnel is not needed, the classification standard is unified through a neural network, and the accuracy of the effect classification is improved.
Optionally, according to the broadcast effect classification method provided by the present invention, the determining the initial audio feature based on the broadcast audio data to be classified according to fourier transform specifically includes:
based on the broadcast audio data to be classified, determining an audio frame data set according to a preset framing rule; wherein adjacent frames in the audio frame dataset are continuous and non-overlapping;
And converting time domain information in each frame in the audio frame data set into cepstral frequency information based on the Fourier transform, and determining the initial audio characteristics.
Specifically, based on the broadcast audio data to be classified, determining the initial audio features according to fourier transform specifically includes:
based on the broadcast audio data to be classified, dividing the audio data into a plurality of continuous and non-overlapping audio frames between adjacent frames according to a preset frame dividing rule, and determining an audio frame data set.
For example: and framing the audio data according to a preset framing rule in a mode of 1 second and 1 frame without overlapping, and determining an audio frame data set.
Based on the fourier transform, the time domain information within each frame in the audio frame dataset is converted to cepstral frequency information, and initial audio features are determined.
For example: and converting the time domain information into cepstral domain information by short-time Fourier transform (short-time Fourier transform, or short-term Fourier transform, STFT) function on each intra-frame data in the audio frame data set, wherein the window width of the Fourier transform is 25 milliseconds, the window width is 10 milliseconds, and counting the cepstral frequency information in the frame to obtain a histogram feature serving as an initial audio feature input by the classification model.
It can be understood that the framing rule and the specific method of the fourier transform used affect the number of audio features extracted according to the audio frame and the calculation amount of the model, and in the actual application process of the present invention, the preset framing rule and the specific method of the fourier transform used can be set according to the actual requirements, which is not limited in the present invention.
According to the broadcast effect classification method provided by the invention, through presetting a framing rule and Fourier transformation, the high-dimensional initial audio characteristics of the audio data can be extracted, the characteristic information of each aspect of the audio is fully reflected, the characteristic information is used as the input of a broadcast effect classification model to be identified, the target broadcast effect type is determined, and the classification accuracy can be effectively improved. In addition, the classification process does not need to be manually participated, and the deep learning strong modeling capability greatly improves the basic consistency rate of the prediction category and the manual scoring label.
Optionally, according to the broadcast effect classification method provided by the present invention, the broadcast effect classification model includes: an audio feature dimension reduction layer, a model attention layer and an effect classification layer;
inputting the initial audio features into a broadcast effect classification model, and determining a target broadcast effect category, wherein the method specifically comprises the following steps:
Inputting the initial audio features into the audio feature dimension reduction layer, and determining low-dimensional audio features based on the audio feature dimension reduction layer;
inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer;
and inputting the target audio features into the effect classification layer, and determining the target broadcast effect category based on the effect classification layer.
In particular, since the amplitude spectrum of the audio signal is high-dimensional (including human voice, white noise, frequency, etc.), and contains features of redundant information, when classifying according to the features, the classifier cannot extract effective low-dimensional and de-redundant representations therefrom, which results in increasing subsequent calculation cost and reducing overall classification performance, the sample density of the original data is greatly reduced in the high-dimensional feature space, and accurate estimation of model classification parameters (i.e., decision boundaries thereof) becomes more difficult.
In order to solve the problems, the invention reduces the dimension of the initial audio feature and determines the target audio feature through an attention mechanism. By means of the method for gathering the channel and the context correlation characteristics in the audio, the classification accuracy is improved.
The broadcast effect classification model includes: an audio feature dimension reduction layer, a model attention layer and an effect classification layer. Inputting the initial audio characteristics into a broadcast effect classification model, and determining a target broadcast effect type, wherein the method specifically comprises the following steps:
and inputting the initial audio features into an audio feature dimension reduction layer, and reducing the dimension of the initial audio data based on the audio feature dimension reduction layer to complete low-dimension and redundancy-free representation of the audio amplitude spectrum, thereby determining the low-dimension audio features.
For example: fig. 2 is a schematic structural diagram of a broadcast effect classification model provided by the present invention, as shown in fig. 2, features with high discrimination and translational invariance can be extracted from mixed frequency domain signals according to a convolutional neural network, a convolutional layer with a convolutional kernel of 1×1 is matched with a Relu activation function to perform nonlinear transformation on audio features, and dimension reduction is performed on initial audio features to obtain feature vectors of 512-dimensional low-dimensional audio features.
It can be understood that, when the initial audio data is subjected to dimension reduction to determine the low-dimensional audio data, the specific structure of the used model and the dimension change rule can be adjusted according to the actual requirement, which is not limited by the invention.
The low-dimensional audio features are input into a model attention layer, and based on the model attention layer, the important feature information in the low-dimensional audio features is highlighted according to an attention mechanism, so that the target audio features are determined.
It will be appreciated that the types of attention mechanisms include: the type of the attention mechanism used in the actual application process of the present invention can be set according to the actual situation, and the present invention is not limited to this.
And inputting the target audio features into an effect classification layer, classifying the target audio features based on the effect classification layer, and determining the target broadcast effect category corresponding to the broadcast audio data to be classified.
It can be understood that the target broadcast effect category to be output may be the category with the largest prediction probability, or may output all the categories and the corresponding probabilities, or set a preset number of categories for output (the category with the top three output probabilities), and the specific output type may be set according to the actual requirement, which is not limited in the present invention.
It should be noted that, the specific category that the effect classification layer can classify is determined when the broadcast effect classification model is trained, and when the sample set of the broadcast effect classification model is determined, the category of the sample data is determined, and the specific category can be set according to the actual situation, which is not limited in the invention.
Further, it will be appreciated that the effect classification layer may choose multiple expert classification models, which may be trained simultaneously and integrated into a single task in a non-linear manner, depending on the source of the training data.
The multi-expert classification model is characterized in that a plurality of different classification models are separately trained according to different data sources, each classification model is called an expert, and a gating model is trained to select and call different expert classification models. The final classification result is a weighted combination of the individual "expert" classification models and the gating model.
In the practical application process of the invention, the specific structure of the model can be determined based on a ResNet model, a Res2Net model, a self-attention network model, a VGGNet model, an AlexNet model, a GoogleNet model and the like, the effect classification layer can be determined based on a Moe model, a full-connection model and the like, and a focus loss (FocalLoss) function, a Cross-entropy loss (Cross-entropy loss) function and the like can be used for training the model. The specific structure and training method of the broadcast effect classification model can be set according to actual conditions, and the invention is not limited to this.
According to the broadcast effect classification method provided by the invention, the scoring classification behavior adopted when the human is subjected to effect evaluation is simulated by constructing the broadcast effect classification model, the audio effect evaluation problem is converted into a specific audio effect classification problem to be processed, and the accuracy of the broadcast effect classification is improved by reducing the dimension of the initial audio characteristics and combining the attention mechanism. And constructing gene portraits of various audios based on strong nonlinear modeling capability of the deep neural network, so as to approach learning of scoring behaviors of human beings, and finally mapping machine prediction labels to the scoring labels of the human beings.
In addition, the method is used for constructing the broadcast effect classification model to identify the classification result, and compared with other schemes in the prior art, the method has the advantages of less external dependence, support of high-cohesion and low-coupling system integration and support of plug-in flexible development, does not prescribe a specific depth neural network model, can be accessed by a new model meeting the interface requirement, and realizes flexible application of the method.
Optionally, according to the broadcast effect classification method provided by the present invention, the model attention layer includes: a temporal attention layer, a channel attention layer, a self attention layer, and a feature fusion layer;
The inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer specifically comprises the following steps:
inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer;
inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer;
inputting the second audio feature into the self-attention layer, determining a third audio feature according to a self-attention mechanism based on the self-attention layer;
inputting the first audio feature, the second audio feature and the third audio feature into the feature fusion layer, and determining the target audio feature based on the feature fusion layer.
Specifically, the invention provides a multi-dimensional attention mixing process of the characteristics through a time attention layer, a channel attention layer and a self attention layer, and the determination of the characteristic attention is completed.
As shown in fig. 2, the model attention layer includes: a temporal attention layer, a channel attention layer, a self attention layer, and a feature fusion layer. Inputting the low-dimensional audio features into a model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer, wherein the method specifically comprises the following steps of:
The low-dimensional audio features are input into a time attention layer, different weights are distributed to different audio frames according to a time attention mechanism based on the time attention layer, the weights of key audio frames in the global are increased, and the first audio features are determined.
Inputting the first audio feature into a channel attention layer, assigning weights to different audio feature channels based on the channel attention layer according to a channel attention mechanism, enhancing important feature channels from a complex high-dimensional space, inhibiting useless feature channels, and determining the second audio feature.
Inputting the second audio features into the self-attention layer, calculating the mutual connection between each time frame according to a self-attention mechanism based on the self-attention layer, enhancing the extraction capability of semantic information in the audio, and determining the third audio features.
The method comprises the steps of inputting a first audio feature, a second audio feature and a third audio feature which are respectively obtained from a time attention layer, a channel attention layer and a self attention layer into a feature fusion layer, and combining three different audio features based on the feature fusion layer to determine a target audio feature.
It will be appreciated that the specific structures of the time attention layer, the channel attention layer, the self attention layer and the feature fusion layer in the present invention may be set according to actual requirements, which is not limited in the present invention.
According to the broadcast effect classification method provided by the invention, the weight of the key audio frame in the global can be effectively increased, the important channel characteristics are enhanced, and the extraction capability of semantic information in audio is enhanced by a position characteristic processing mode of a multidimensional attention mechanism. The model performance is effectively improved, the capability of the target audio features for reflecting key audio information is enhanced, and the accuracy of the classification result is improved.
And the low-dimensional audio features are processed by using a time attention mechanism, a channel attention mechanism and a self-attention mechanism in sequence, audio frames are screened in time, irrelevant audio frames are effectively removed, the processing of the subsequent channel attention mechanism and the self-attention mechanism is more targeted, the required computing resources are reduced, and the classification recognition speed is improved.
Optionally, according to the broadcast effect classification method provided by the present invention, the channel attention layer includes: a first channel attention layer, a second channel attention layer, and a channel feature fusion layer;
the inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, specifically including:
Inputting the first audio feature into the first channel attention layer, and determining a first channel audio feature according to a channel attention mechanism based on the first channel attention layer;
inputting the first channel audio feature into the second channel attention layer, and determining a second channel audio feature according to a channel attention mechanism based on the second channel attention layer;
inputting the first channel audio feature and the second channel audio feature into the channel feature fusion layer, and determining the second audio feature based on the channel feature fusion layer.
In particular, the present invention employs a dual channel attention mechanism to enhance the channel characteristics of audio. As shown in fig. 2, the channel attention layer includes: a first channel attention layer, a second channel attention layer, and a channel feature fusion layer. Inputting the first audio feature into a channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, wherein the method specifically comprises the following steps of:
inputting the first audio feature into a first channel attention layer, based on the first channel attention layer, assigning weights to different audio feature channels according to a channel attention mechanism, enhancing important feature channels from a complex high-dimensional space, inhibiting useless feature channels, and determining the audio features of the first channel.
Inputting the audio features of the first channel into a second channel attention layer, further, on the basis of the second channel attention layer and according to a channel attention mechanism, assigning weights to different audio feature channels on the previous basis, enhancing important feature channels from a complex high-dimensional space, inhibiting useless feature channels, and determining the audio features of the second channel.
Inputting the first channel audio feature and the second channel audio feature into a channel feature fusion layer, and combining three different channel audio features based on the channel feature fusion layer to determine the second audio feature.
According to the broadcast effect classification method provided by the invention, the channel characteristics of the audio are further enhanced by setting the double-layer channel attention mechanism, the capability of the target audio characteristics for reflecting key audio channel information is effectively improved, and the accuracy of the classification result is improved. The method and the device can accurately analyze the interference condition of the normal broadcasting of the frequency point to other broadcasting in the medium-short wave broadcasting, give out the identification result and realize the classification of the automatic broadcasting effect.
Optionally, according to the broadcast effect classification method provided by the present invention, the time attention layer includes: a first feature segmentation layer, a first feature aggregation layer, and a temporal sub-attention layer;
The step of inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer specifically comprises the following steps:
inputting the low-dimensional audio features into the first feature segmentation layer, and determining a first initial feature set according to a first preset feature channel segmentation rule based on the first feature segmentation layer; wherein the first initial feature set comprises a plurality of time sub-features;
inputting the first initial feature set into the first feature aggregation layer, processing the time sub-features according to a first preset feature processing rule based on the first feature aggregation layer, and aggregating the processed time sub-features to determine a first initial time feature;
inputting the first initial temporal feature into the temporal sub-attention layer, and determining the first audio feature according to a temporal attention mechanism based on the temporal sub-attention layer.
Specifically, in order to improve the nonlinear expression capability of the audio features, the time attention layer in the present invention includes: a first feature segmentation layer, a first feature aggregation layer, and a temporal sub-attention layer. Inputting the low-dimensional audio features into a time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer, wherein the method specifically comprises the following steps of:
Inputting the low-dimensional audio features into a first feature segmentation layer, segmenting the low-dimensional audio features into a plurality of time sub-features according to a first preset feature channel segmentation rule based on the first feature segmentation layer, and determining a first initial feature set according to all the time sub-features.
For example: fig. 3 is a schematic diagram of a time attention layer structure provided by the present invention, as shown in fig. 3, the low-dimensional audio features are 512 dimensions, and the low-dimensional audio features are equally divided into 4 parts according to the number of channels according to a first preset feature channel dividing rule, and each part is 128 dimensions. The first initial feature set is denoted as x i Where i e {1,2,3,4}.
Inputting the first initial feature set into a first feature aggregation layer, processing time sub-features according to a first preset feature processing rule based on the first feature aggregation layer, and aggregating the processed time sub-features to determine a first initial time feature.
For example: a first preset feature processing rule, x i K is performed by a 3X 3 convolution corresponding to the method i () After mapping transformation, y is obtained i The calculation formula is as follows:
and then y is obtained i The first initial temporal feature is determined by feature aggregation through a 1 x 1 convolutional layer.
Inputting the first initial time feature into a time sub-attention layer, distributing different weights to different audio frames according to a time attention mechanism based on the time sub-attention layer, increasing the weight of the key audio frames in the global, and determining the first audio feature.
It should be understood that the above-mentioned first preset feature channel splitting rule and first preset feature processing rule are merely illustrative of the present invention as a specific example, and besides, the rules may be adjusted according to actual requirements (for example, the manner of adjusting classification, the number of classifications, and the type of convolution layers, etc.), which is not limited in this invention.
Further, it can be understood that the network structure of the time attention layer is similar to the Res2Net model structure, and in the practical application process of the present invention, the specific structure of the model can be set according to the practical situation, which is not limited by the present invention.
According to the broadcast effect classification method provided by the invention, the characteristics are segmented into the plurality of sub-characteristics through the first preset characteristic channel segmentation rule, the self-posts are processed according to the first preset characteristic processing rule, and the processed sub-characteristics are aggregated, so that the nonlinear expression capacity of the audio characteristics can be effectively improved, and the accuracy of identifying and classifying the models is further improved.
Optionally, according to the broadcast effect classification method provided by the present invention, the channel attention layer includes: a second feature segmentation layer, a second feature aggregation layer, and a channel sub-attention layer;
The inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, specifically including:
inputting the first audio features into the second feature segmentation layer, and determining a second initial feature set according to a second preset feature channel segmentation rule based on the second feature segmentation layer; wherein the second initial feature set comprises a plurality of channel sub-features;
inputting the second initial feature set into the second feature aggregation layer, processing the channel sub-features according to a second preset feature processing rule based on the second feature aggregation layer, and aggregating the processed channel sub-features to determine second initial channel features;
inputting the second initial channel feature into the channel sub-attention layer, and determining the second audio feature according to a channel attention mechanism based on the channel sub-attention layer.
Specifically, in order to improve the nonlinear expression capability of the audio features, the channel attention layer in the present invention includes: a second feature segmentation layer, a second feature aggregation layer, and a channel sub-attention layer. Inputting the first audio feature into a channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, wherein the method specifically comprises the following steps of:
Inputting the first audio feature into a second feature segmentation layer, segmenting the first audio feature into a plurality of channel sub-features according to a second preset feature channel segmentation rule based on the second feature segmentation layer, and determining a second initial feature set according to all the channel sub-features.
It can be understood that the second preset feature channel splitting rule setting rule and the first preset feature channel splitting rule setting rule may be the same or different, and the specific setting rule may be adjusted according to the actual situation, which is not limited in the present invention.
Inputting the second initial feature set into a second feature aggregation layer, processing channel sub-features according to a second preset feature processing rule based on the second feature aggregation layer, and aggregating the processed channel sub-features to determine second initial channel features.
It can be understood that the second preset feature processing rule and the first preset feature processing rule may be the same or different, and the specific setting rule may be adjusted according to the actual situation, which is not limited in the present invention.
The second initial channel feature is input into a channel sub-attention layer, based on which a second audio feature is determined according to a channel attention mechanism.
It should be understood that the examples of the second preset feature channel splitting rule and the second preset feature processing rule are the same as the examples of the first preset feature channel splitting rule and the first preset feature processing rule, which are just a specific example to illustrate the present invention, and besides, the rules may be adjusted according to actual requirements (for example, adjusting the classification mode, the number of classifications, the type of convolution layers, etc.), which is not limited by the present invention.
Further, it can be understood that the network structure of the channel attention layer is similar to the Res2Net model structure, and in the practical application process of the present invention, the specific structure of the model can be set according to the practical situation, which is not limited by the present invention.
According to the broadcast effect classification method provided by the invention, the characteristics are segmented into the plurality of sub-characteristics through the second preset characteristic channel segmentation rule, the self-posts are processed according to the second preset characteristic processing rule, and the processed sub-characteristics are aggregated, so that the nonlinear expression capacity of the audio characteristics can be effectively improved, and the accuracy of identifying and classifying the models is further improved.
The present invention will be described in detail with reference to a specific method for classifying broadcast effects according to the present invention.
It can be understood that fig. 4 is a schematic flow chart of the broadcast effect classification method provided by the present invention, and as shown in fig. 4, before classifying the broadcast audio to be classified, training and testing of the broadcast effect classification model is also required.
For example: in the preparation of training data, a training classification system is prepared based on the classification rule corresponding to the table, and table 2 is a comparison table of broadcast effect classification rules and training classification systems, and as shown in table 2, training data is classified into 8 categories of pure noise, indistinguishable, basically qualified, unqualified, music, foreign language, percussion and the like.
Wherein, pure noise represents that the audio frequency comprises more white noise, and the broadcasting sound is not heard completely or is muted completely; the voice can not be distinguished to represent that the voice is quite noisy, and a little broadcasting sound can be heard except the noise, but the speaker can not be confirmed whether the speaker is from the broadcasting of the object station or not can not be heard clearly; the basic qualification represents that the audio contains weak sound from interference broadcast and can not hear the content of the explanation, and the noise is high; the qualified representative audio contains relatively clear sound which is normally broadcast by the frequency point and can hear the content of the sound, and almost no sound is broadcast by the object station; the disqualified representative audio contains a clearer broadcasting sound of the object station and can hear the content; the music representing audio data is composed of songs, titles, etc.; the foreign language representative audio data is composed of English, japanese, korean and other foreign languages; the percussion representative audio data is composed of a percussion such as suona.
Table 2 broadcast effect classification rules and training class system comparison table
Classification results in broadcast effect classification rules Classification for training
0/0 Pure noise
0/1,1/0,1/1 Can not be distinguished
5/0,4/0,3/0,4/1,3/1,3/2 Disqualified speech
1/2,2/1,2/0,0/2,2/2,1/2 Basic qualified speech
0/5,0/4,0/3,1/4,1/3,2/3 Qualified speech
Music Music
Foreign language Foreign language
Percussion music Percussion music
Because the quality of the medium short wave broadcast audio collected from different receivers is different, in order to improve the accuracy of the medium short wave broadcast classification collected from different receivers, the training data distribution condition needs to be ensured to be similar to the data distribution condition to be tested, and the data set used in the training stage comprises 85% of general broadcast audio training data and 15% of the frequency point broadcast audio training data.
Wherein the general broadcast audio training data includes representative broadcast audio data collected from respective frequency bin receivers, the frequency bin broadcast audio data including broadcast audio data collected exclusively from the frequency bin receivers.
The data sets are randomly scrambled and then divided into training sets and test sets according to the ratio of 9:1, the training sets and the test sets are used for training and evaluating the neural network model, part of frequency point broadcast audio data is selected as the test sets, the general broadcast audio training data scale and the item statistics are shown in table 3, the frequency point broadcast audio training data scale and the item statistics are shown in table 4, and the test set scale and the item statistics are shown in table 5.
Table 3 general broadcast audio training data size and entry statistics
Table 4 scale and entry statistics of the audio training data of the frequency bin broadcast
Table 5 test set size and entry statistics
/>
After the audio data set is determined, the audio data is decoded, the data set is expanded through a data enhancement technology, the audio data is framed in a 1 second and 1 frame mode without overlapping, the time domain information is converted into cepstral domain information through short-time Fourier transform by the intra-frame data, the window width of the Fourier transform is 25 milliseconds and is shifted to 10 milliseconds, and the cepstral frequency information in the frame is counted to obtain histogram features which are used as the audio features of the input of the classification model.
It can be appreciated that data enhancement can effectively expand the data set size, significantly improving the neural network model effect. The data enhancement operation comprises operations of randomly changing the audio rate, randomly changing the audio rhythm, randomly compressing the audio data and the like, and transforming the audio to a certain extent on the premise of not changing the original audio information. The specific method adopted can be set according to actual conditions, and the invention is not limited to this.
And training the broadcast effect classification model by using the audio features corresponding to the training set, testing the trained broadcast effect classification model by using the audio features corresponding to the testing set after training, and determining whether the broadcast effect classification model is successfully trained according to the accuracy of the broadcast effect classification result.
Further, in order to verify the accuracy of the medium-short wave broadcast effect classification, the test result of the trained broadcast effect classification model needs to be verified. And (5) regarding the unqualified category and the foreign language category in the classification result as unqualified, and regarding other categories as qualified.
As shown in Table 6, the pass and fail audio internal class differences are ignored. The overall accuracy in the test set reached 97%, and the statistics of accuracy and recall for each class are shown in table 7.
TABLE 6 classification of pass and fail audio in classification system
/>
Table 7 statistics of accuracy and recall for each category in the classification model test
Fig. 5 is a schematic flow chart of a testing method of a broadcast effect classification model according to the present invention, as shown in fig. 5, the testing stage includes the following steps:
and determining the audio characteristics corresponding to the audio data of the test set through data enhancement and short-time Fourier transform, inputting the audio characteristics into a broadcast effect classification model, calculating to obtain probability distribution of all categories in the classification system of the table 2 as a prediction result, and determining the confidence level.
The model test result comprises the classification result and the confidence coefficient of the voice, and the confidence coefficient and the accuracy coefficient of the test result are positively correlated through statistics of the accuracy coefficients under different confidence coefficients (shown in table 8).
According to the human intervention-free interval diagram divided according to the ideal confidence coefficient, as shown in fig. 6, through a large-scale test, a confidence coefficient interval (shown in table 9 and fig. 6) with the accuracy rate approaching 100% in different categories is determined according to the discrimination conditions of the categories, the confidence coefficient interval is used as an ideal confidence coefficient threshold, and the data with the confidence coefficient higher than the ideal confidence coefficient can be used for skipping the manual evaluation, so that the purpose of saving the workload is achieved.
In the test experiment, the data above the ideal confidence threshold after post-treatment accounts for 79% of all data, the accuracy is 99.5%, and manual evaluation is not needed.
Table 8 statistics of recall and precision for each class at different confidence levels
Table 9 statistics of various test results with ideal confidence level
It can be understood that in the actual evaluation service, the obtained result predicted by the method can be directly used as the final classification result, or the model prediction and the manual verification can be combined, namely, the test result of data with more than ideal confidence in the post-processing is selected as the final evaluation result, and the rest data is manually verified again, so that the workload is reduced.
The invention aims to accurately evaluate the medium-short wave broadcasting effect of audio data acquired by a front-end receiver by using a neural network technology, and screen out the audio which fails to interfere broadcasting successfully, wherein the overall accuracy is up to 97%. The accuracy of partial category audio frequency is close to 100% by the post-processing operation of the result, thereby achieving the purpose of saving the workload.
It should be understood that the method for determining the training set and the test set of the broadcast effect classification model is only used as a specific example to illustrate the invention, and besides, the specific rule of the broadcast audio classification in the samples, the number of samples of each category, the proportion of the training set of the test set, the structure of the model and the training method of the model can be set according to practical situations, which is not limited by the invention.
Fig. 7 is a schematic structural diagram of a broadcast effect classification system according to the present invention, and as shown in fig. 7, the present invention further provides a broadcast effect classification system, including: an audio feature determining unit 710 and a broadcast effect classifying unit 720;
the audio feature determining unit 710 is configured to determine an initial audio feature according to fourier transform based on the broadcast audio data to be classified;
the broadcast effect classification unit 720 is configured to input the initial audio feature into a broadcast effect classification model, and determine a target broadcast effect category.
In particular, the present invention is described taking the classification of broadcast audio effects of medium short waves (radio waves having wavelengths of 200m to 50m and frequencies of 1500kHZ to 6000 kHZ) as an example. It will be appreciated that the method may be widely applied to the classification of audio. Further, the present invention can be applied to other fields adaptively, and the present invention is not limited thereto.
The audio feature determining unit 710 is configured to determine, based on the broadcast audio data to be classified, an initial audio feature according to feature information corresponding to the audio frame extracted by fourier transform.
It will be appreciated that the data processing of the audio signal according to the fourier transform is actually extracting cepstral domain information of the audio signal. In the actual application process of the invention, when audio data are converted into audio features according to Fourier transform, the preset framing rule and window width of Fourier transform can be adjusted according to actual conditions, and the invention is not limited to the above.
And a broadcast effect classification unit 720, configured to input the initial audio feature into a trained broadcast effect classification model, and determine a target broadcast effect category.
It can be understood that the model needs to be trained before the broadcast effect classification model is used, and the training model is a training sample, a training method and a specific structure of the model which are used can be adjusted according to actual requirements, which is not limited by the invention.
According to the broadcast effect classification method provided by the invention, the broadcast audio data to be classified is obtained, the initial audio characteristics are determined according to Fourier transformation, and the target broadcast effect class is determined by using a trained broadcast effect classification model. When the broadcast effect is automatically classified, the effect evaluation is only needed to be carried out on the audio collected by the receiving and measuring end, the processing and processing of additional audio such as a transmitting end or a reference source are not needed, the step and the intelligent degree of the broadcast effect classification are effectively reduced, the artificial experience of classification personnel is not needed, the classification standard is unified through a neural network, and the accuracy of the effect classification is improved.
It should be noted that, the broadcast effect classification system provided by the present invention is used for executing the broadcast effect classification method, and the specific embodiment and the method embodiment thereof are consistent, and are not repeated herein.
Fig. 8 is a schematic structural diagram of a broadcast effect classification system provided by the present invention, as shown in fig. 8, the electronic device may include: processor 810, communication interface 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, and memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a broadcast effect classification method comprising: based on the broadcast audio data to be classified, determining initial audio characteristics according to a preset framing rule; and inputting the initial audio characteristics into a broadcast effect classification model, and determining the target broadcast effect type.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing a broadcast effect classification method provided by the above methods, the method comprising: based on the broadcast audio data to be classified, determining initial audio characteristics according to a preset framing rule; and inputting the initial audio characteristics into a broadcast effect classification model, and determining the target broadcast effect type.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the above-provided broadcast effect classification methods, the method comprising: based on the broadcast audio data to be classified, determining initial audio characteristics according to a preset framing rule; and inputting the initial audio characteristics into a broadcast effect classification model, and determining the target broadcast effect type.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A broadcast effect classification method, comprising:
based on the broadcast audio data to be classified, determining initial audio characteristics according to Fourier transformation;
inputting the initial audio features into a broadcast effect classification model to determine a target broadcast effect class;
the broadcast effect classification model includes: an audio feature dimension reduction layer, a model attention layer and an effect classification layer;
inputting the initial audio features into a broadcast effect classification model, and determining a target broadcast effect category, wherein the method specifically comprises the following steps:
inputting the initial audio features into the audio feature dimension reduction layer, and determining low-dimensional audio features based on the audio feature dimension reduction layer;
inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer;
inputting the target audio features into the effect classification layer, and determining the target broadcast effect category based on the effect classification layer;
the model attention layer includes: a temporal attention layer, a channel attention layer, a self attention layer, and a feature fusion layer;
the inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer specifically comprises the following steps:
Inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer;
inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer;
inputting the second audio feature into the self-attention layer, determining a third audio feature according to a self-attention mechanism based on the self-attention layer;
inputting the first audio feature, the second audio feature and the third audio feature into the feature fusion layer, and determining the target audio feature based on the feature fusion layer.
2. The broadcast effect classification method according to claim 1, wherein the determining the initial audio feature based on the broadcast audio data to be classified according to fourier transform specifically comprises:
based on the broadcast audio data to be classified, determining an audio frame data set according to a preset framing rule; wherein adjacent frames in the audio frame dataset are continuous and non-overlapping;
and converting time domain information in each frame in the audio frame data set into cepstral frequency information based on the Fourier transform, and determining the initial audio characteristics.
3. The broadcast effect classification method according to claim 1, wherein,
the channel attention layer includes: a first channel attention layer, a second channel attention layer, and a channel feature fusion layer;
the inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, specifically including:
inputting the first audio feature into the first channel attention layer, and determining a first channel audio feature according to a channel attention mechanism based on the first channel attention layer;
inputting the first channel audio feature into the second channel attention layer, and determining a second channel audio feature according to a channel attention mechanism based on the second channel attention layer;
inputting the first channel audio feature and the second channel audio feature into the channel feature fusion layer, and determining the second audio feature based on the channel feature fusion layer.
4. The broadcast effect classification method according to claim 1, wherein,
the temporal attention layer includes: a first feature segmentation layer, a first feature aggregation layer, and a temporal sub-attention layer;
The step of inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer specifically comprises the following steps:
inputting the low-dimensional audio features into the first feature segmentation layer, and determining a first initial feature set according to a first preset feature channel segmentation rule based on the first feature segmentation layer; wherein the first initial feature set comprises a plurality of time sub-features;
inputting the first initial feature set into the first feature aggregation layer, processing the time sub-features according to a first preset feature processing rule based on the first feature aggregation layer, and aggregating the processed time sub-features to determine a first initial time feature;
inputting the first initial temporal feature into the temporal sub-attention layer, and determining the first audio feature according to a temporal attention mechanism based on the temporal sub-attention layer.
5. The broadcast effect classification method of claim 1, wherein the channel attention layer comprises: a second feature segmentation layer, a second feature aggregation layer, and a channel sub-attention layer;
The inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer, specifically including:
inputting the first audio features into the second feature segmentation layer, and determining a second initial feature set according to a second preset feature channel segmentation rule based on the second feature segmentation layer; wherein the second initial feature set comprises a plurality of channel sub-features;
inputting the second initial feature set into the second feature aggregation layer, processing the channel sub-features according to a second preset feature processing rule based on the second feature aggregation layer, and aggregating the processed channel sub-features to determine second initial channel features;
inputting the second initial channel feature into the channel sub-attention layer, and determining the second audio feature according to a channel attention mechanism based on the channel sub-attention layer.
6. A broadcast effect classification system, comprising: an audio feature determining unit and a broadcast effect classifying unit;
the audio feature determining unit is used for determining initial audio features according to Fourier transformation based on the broadcast audio data to be classified;
The broadcast effect classification unit is used for inputting the initial audio characteristics into a broadcast effect classification model to determine a target broadcast effect category;
the broadcast effect classification model includes: an audio feature dimension reduction layer, a model attention layer and an effect classification layer;
inputting the initial audio features into a broadcast effect classification model, and determining a target broadcast effect category, wherein the method specifically comprises the following steps:
inputting the initial audio features into the audio feature dimension reduction layer, and determining low-dimensional audio features based on the audio feature dimension reduction layer;
inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer;
inputting the target audio features into the effect classification layer, and determining the target broadcast effect category based on the effect classification layer;
the model attention layer includes: a temporal attention layer, a channel attention layer, a self attention layer, and a feature fusion layer;
the inputting the low-dimensional audio features into the model attention layer, and determining target audio features according to an attention mechanism based on the model attention layer specifically comprises the following steps:
Inputting the low-dimensional audio features into the time attention layer, and determining a first audio feature according to a time attention mechanism based on the time attention layer;
inputting the first audio feature into the channel attention layer, and determining a second audio feature according to a channel attention mechanism based on the channel attention layer;
inputting the second audio feature into the self-attention layer, determining a third audio feature according to a self-attention mechanism based on the self-attention layer;
inputting the first audio feature, the second audio feature and the third audio feature into the feature fusion layer, and determining the target audio feature based on the feature fusion layer.
7. An electronic device comprising a memory and a processor, said processor and said memory completing communication with each other via a bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the broadcast effect classification method of any of claims 1-5.
8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the broadcast effect classification method of any of claims 1 to 5.
CN202110858717.XA 2021-07-28 2021-07-28 Broadcast effect classification method and system, electronic equipment and storage medium Active CN113782051B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110858717.XA CN113782051B (en) 2021-07-28 2021-07-28 Broadcast effect classification method and system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110858717.XA CN113782051B (en) 2021-07-28 2021-07-28 Broadcast effect classification method and system, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113782051A CN113782051A (en) 2021-12-10
CN113782051B true CN113782051B (en) 2024-03-19

Family

ID=78836229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110858717.XA Active CN113782051B (en) 2021-07-28 2021-07-28 Broadcast effect classification method and system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113782051B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694685A (en) * 2022-04-12 2022-07-01 北京小米移动软件有限公司 Voice quality evaluation method, device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2926335A1 (en) * 2012-11-29 2015-10-07 Sony Computer Entertainment Inc. Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
CN111312288A (en) * 2020-02-20 2020-06-19 阿基米德(上海)传媒有限公司 Broadcast audio event processing method, system and computer readable storage medium
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium
CN113160796A (en) * 2021-04-28 2021-07-23 北京中科模识科技有限公司 Language identification method, device, equipment and storage medium of broadcast audio

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7676405B2 (en) * 2005-06-01 2010-03-09 Google Inc. System and method for media play forecasting
KR20180080446A (en) * 2017-01-04 2018-07-12 삼성전자주식회사 Voice recognizing method and voice recognizing appratus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2926335A1 (en) * 2012-11-29 2015-10-07 Sony Computer Entertainment Inc. Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection
CN111312288A (en) * 2020-02-20 2020-06-19 阿基米德(上海)传媒有限公司 Broadcast audio event processing method, system and computer readable storage medium
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium
CN113160796A (en) * 2021-04-28 2021-07-23 北京中科模识科技有限公司 Language identification method, device, equipment and storage medium of broadcast audio

Also Published As

Publication number Publication date
CN113782051A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
WO2021159902A1 (en) Age recognition method, apparatus and device, and computer-readable storage medium
CN111986699B (en) Sound event detection method based on full convolution network
CN113488063A (en) Audio separation method based on mixed features and coding and decoding
CN113782051B (en) Broadcast effect classification method and system, electronic equipment and storage medium
CN109903749B (en) Robust voice recognition method based on key point coding and convolutional neural network
CN110580915B (en) Sound source target identification system based on wearable equipment
CN110299133B (en) Method for judging illegal broadcast based on keyword
CN110444225B (en) Sound source target identification method based on feature fusion network
CN116432664A (en) Dialogue intention classification method and system for high-quality data amplification
CN115050350A (en) Label checking method and related device, electronic equipment and storage medium
CN114822557A (en) Method, device, equipment and storage medium for distinguishing different sounds in classroom
CN113936667A (en) Bird song recognition model training method, recognition method and storage medium
Islam et al. Non-intrusive objective evaluation of speech quality in noisy condition
CN114898757A (en) Voiceprint confirmation model training method and device, electronic equipment and storage medium
CN111782860A (en) Audio detection method and device and storage medium
CN111951786A (en) Training method and device of voice recognition model, terminal equipment and medium
Xie et al. Image processing and classification procedure for the analysis of australian frog vocalisations
CN116403597B (en) Automatic data grabbing and state updating method for large-screen billboard
CN116631406B (en) Identity feature extraction method, equipment and storage medium based on acoustic feature generation
CN113257284B (en) Voice activity detection model training method, voice activity detection method and related device
CN118155623B (en) Speech recognition method based on artificial intelligence
Madhu et al. SiamNet: Siamese CNN Based Similarity Model for Adversarially Generated Environmental Sounds
Martin-Morato et al. Performance analysis of audio event classification using deep features under adverse acoustic conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant