CN118155654A - Model training method, audio component missing identification method and device and electronic equipment - Google Patents

Model training method, audio component missing identification method and device and electronic equipment Download PDF

Info

Publication number
CN118155654A
CN118155654A CN202410575440.3A CN202410575440A CN118155654A CN 118155654 A CN118155654 A CN 118155654A CN 202410575440 A CN202410575440 A CN 202410575440A CN 118155654 A CN118155654 A CN 118155654A
Authority
CN
China
Prior art keywords
audio
training
channel
model
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410575440.3A
Other languages
Chinese (zh)
Other versions
CN118155654B (en
Inventor
杨善明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410575440.3A priority Critical patent/CN118155654B/en
Publication of CN118155654A publication Critical patent/CN118155654A/en
Application granted granted Critical
Publication of CN118155654B publication Critical patent/CN118155654B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the application provides a model training method, an audio component missing identification device, electronic equipment and a computer readable storage medium, and relates to the field of artificial intelligence. The method comprises the following steps: obtaining a plurality of training samples, wherein each training sample comprises audio of a left channel and a right channel and labeling information, the type of the audio of each channel is original audio or missing audio, and the labeling information is used for indicating whether the types of the audio of the two channels in the corresponding training sample are consistent; and performing multiple rounds of iterative training on the comparison learning model to converge according to the multiple training samples to obtain the audio consistency recognition model. The embodiment of the application can strengthen the study and understanding of the correlation between the original audio and the processed audio by the model, and lays a foundation for accurately obtaining the judgment result of whether the audio component is missing or not according to the subsequent recognition result of whether the audio types output by the model are consistent.

Description

Model training method, audio component missing identification method and device and electronic equipment
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a model training method, an audio component missing identification method device, electronic equipment, a computer readable storage medium and a computer program product.
Background
With the rapid proliferation of internet technology and 5G infrastructure, video has emerged as a form of media from many information carriers and has become the dominant way for the public to acquire information and communicate interactions.
The uploading of the daily massive video contents brings about rich and diversified information transmission to the platform, but also brings a series of important challenges related to video quality, wherein the problem of audio component deletion is particularly remarkable, and the audio experience of a user can be seriously influenced no matter the technical errors in the video production and transmission processes or the deletion of certain audio components caused by various reasons in the playing link.
The related technology is mainly based on the basic characteristics of the audio information signals for analysis and judgment, and the core means of the method is to determine the inconsistency between the channels by carrying out detailed analysis on the spectral characteristics of the binaural audio information or measuring the difference of the respective volume intensities according to the direct comparison of the original attributes of the left channel and the right channel, but the accuracy is lower in practical application.
Disclosure of Invention
Embodiments of the present application provide a model training method, an audio component missing identifying method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product, which can solve the above-mentioned problems of the prior art. The technical scheme is as follows:
according to a first aspect of an embodiment of the present application, there is provided a model training method, the method comprising:
Obtaining a plurality of training samples, wherein each training sample comprises audio of a left channel and a right channel and labeling information, the type of the audio of each channel of one training sample is original audio or missing audio, the missing audio is audio of one audio component in the missing corresponding original audio, and the labeling information is used for indicating whether the types of the audio of the two channels in the corresponding training sample are consistent;
and carrying out multiple rounds of iterative training on the comparison learning model until convergence according to the training samples to obtain an audio consistency recognition model.
According to a second aspect of the embodiment of the present application, there is provided an audio component loss identification method including:
Inputting audio to be detected of a left channel and a right channel into an audio consistency recognition model, and obtaining a recognition result output by the audio consistency recognition model, wherein the recognition result is used for indicating whether the types of the audio to be detected of the left channel and the right channel are consistent;
If the identification result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent, determining that the audio to be detected of the left channel and the right channel has audio component missing;
the audio consistency recognition model is trained by the model training method provided by the first aspect.
According to a third aspect of an embodiment of the present application, there is provided a model training apparatus, the apparatus comprising:
The system comprises a sample obtaining module, a sample analyzing module and a sample analyzing module, wherein the sample obtaining module is used for obtaining a plurality of training samples, each training sample comprises audio of a left channel and a right channel and labeling information, the type of the audio of each channel of one training sample is original audio or missing audio, the missing audio is audio of one audio component in the corresponding original audio, and the labeling information is used for indicating whether the types of the audio of the two channels in the corresponding training sample are consistent;
And the iterative training module is used for carrying out multi-round iterative training on the comparison learning model to converge according to the training samples so as to obtain an audio consistency recognition model.
As an optional implementation manner, the comparison learning model comprises two branch models with different structures, and each branch model is used for extracting features of a training sample to obtain audio features of two channels;
The iterative training module comprises:
The similarity obtaining unit is used for inputting the training samples into a comparison learning model of the round of iteration to obtain first similarity of each training sample; the first similarity of each training sample comprises a first sub-similarity between the audio feature of any channel of the training sample obtained by one branch model and the audio feature of another channel of the training sample obtained by another branch model;
the loss value obtaining unit is used for obtaining a loss function value of the iterative training of the round, the loss function value comprises first loss values of all training samples, and the first loss value of each training sample is obtained according to the first similarity and the labeling information;
and the adjusting unit adjusts model parameters of the contrast learning model according to the loss function value.
As an optional implementation manner, one of the two branch models is further used for buffering two-channel audio features of at least one negative sample obtained by the iteration of the round as at least one reference audio feature pair during each round of iterative training, wherein the negative sample is a training sample with index annotation information indicating that types of corresponding audio of two channels in the corresponding training sample are inconsistent;
The similarity obtaining unit is further used for obtaining second similarity of each negative sample, wherein the second similarity of each negative sample comprises second sub-similarity between the audio features of each channel of the negative sample obtained by each branch model and the audio features of the same channel in a preset number of reference audio feature pairs;
the loss function value further comprises a second loss value for each negative sample, the second loss value for each negative sample being obtained from a second similarity of the negative samples.
As an optional implementation manner, the similarity obtaining unit is further configured to obtain a third similarity of each training sample, where the third similarity of each training sample represents a similarity between audio features of two channels of the corresponding training sample obtained by using one branching model;
The loss function value further comprises a third loss value of each training sample, and the third loss value of each training sample is obtained according to the third similarity and the labeling information of the training sample.
As an alternative embodiment, the first loss value for each training sample is obtained by:
If the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking any one of the two first sub-similarities as the first loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audio in the corresponding training samples are inconsistent, taking the negative value of any one of the two first sub-similarities as the first loss value.
As an alternative embodiment, the second loss value for each negative sample is determined by:
For each reference audio feature pair, obtaining a fourth similarity between the negative sample and the reference audio feature pair according to the sum of all second sub-similarities related to the negative sample and the reference audio feature pair;
A mean value of a fourth similarity of the negative sample with respect to each reference audio feature pair is determined, and a negative value of the mean value is taken as the second similarity of the negative sample.
As an alternative embodiment, the third loss value for each training sample is obtained by:
if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking a third degree of the training samples as the third loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are inconsistent, taking a negative value of a third similarity of the training samples as the third loss value.
As an alternative embodiment, each two branch models includes a feature extraction module for extracting initial audio features and a feature transformation module for mapping the initial audio features to a high-dimensional feature space;
the number of the feature transformation modules in the two branch models is different.
As an alternative embodiment, the feature extraction module is VGGish modules;
The feature transformation module is a Projector module.
As an alternative embodiment, the sample acquisition module includes:
An initial audio pair obtaining unit, configured to obtain at least one initial audio pair, where audio of both left and right channels of the initial audio pair is original audio;
an audio component determination unit configured to determine, for each initial audio pair, a respective audio component from audio of each channel in the initial audio pair;
the masking unit is used for masking each audio component in the original audio for each channel of each initial audio pair to obtain each missing audio corresponding to the original audio;
The combination unit is used for combining the original audio of the two channels of the initial audio pair and each missing audio of each initial audio pair, and setting corresponding labeling information according to whether the types of the audio of the two combined channels are consistent or not so as to obtain each training sample corresponding to the initial audio pair.
According to a fourth aspect of an embodiment of the present application, there is provided an audio component absence identifying apparatus including:
the reasoning module is used for inputting the audio to be detected of the left channel and the right channel into the audio consistency recognition model, obtaining a recognition result output by the audio consistency recognition model, and indicating whether the types of the audio to be detected of the left channel and the right channel are consistent or not;
The identification module is used for determining that the audio components of the audio to be detected of the left channel and the right channel are absent if the identification result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent;
the audio consistency recognition model is trained by the model training method provided by the first aspect.
According to a fifth aspect of embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory, the processor executing the computer program to carry out the steps of the above method.
According to a sixth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.
According to a seventh aspect of embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above method.
The technical scheme provided by the embodiment of the application has the beneficial effects that:
Through the obtained multiple training samples, the multiple training sample sets comprise the audio pairs with consistent types and the audio pairs with inconsistent types, and simultaneously comprise the audio pairs containing the original audio and the audio pairs containing the missing audio, the number of the samples is greatly increased, so that the comparison learning model is trained, comparison learning adopted by the comparison learning model is mainly calculated as similarity, in the embodiment of the application, namely, the similarity of the audio of the left channel and the right channel, the application does not need to carry out detailed analysis on the characteristics of the audio of the double channels, has no high requirement on the computing capability of the model, is completely different from a mode of simply judging the volume intensity, can identify whether the audio is processed as much as possible and whether the audio types of the left channel and the right channel are consistent, strengthens the learning and understanding of the correlation between the original audio and the processed audio, and lays a foundation for the judgment result of whether the audio component is missing or not according to the identification result of whether the audio types output by the model is consistent or not.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic diagram of an alternative architecture of an audio component loss identification system according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a model training method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of obtaining training samples according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a comparative learning model according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a comparative learning model according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a framework of a comparative learning model according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a VGGish structure of a frame according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a framework of a comparative learning model according to an embodiment of the present application;
fig. 9 is a flowchart of an audio component missing identifying method according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a model training device according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an audio component missing identifying device according to an embodiment of the present application;
Fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, several terms related to the present application are described and explained:
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
Key technologies to speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. The large model technology brings reform for the development of the voice technology, and the pre-training models such as WavLM, uniSpeech and the like which use a transducer architecture have strong generalization and universality and can excellently finish voice processing tasks in all directions.
Projector (Projector): in a deep learning model, "projector" generally refers to a component that projects input data into a new representation space. Such projections may be linear or non-linear in order to better characterize the input data. In an embodiment of the present application, a "projector" is used to learn a high-level representation of audio features for execution of subsequent tasks.
Contrast study: is a machine learning technique that learns the general characteristics of the samples of a sample set, even in the absence of labeling of the samples in the sample set, to at least compare which samples are similar. The comparison learning can identify which samples are similar samples by identifying the general features in each sample without the sample being labeled.
Comparing the learning model: a model for implementing the above comparative learning.
Currently, research on a task of mono recognition in video and audio information is relatively in a preliminary stage, and there is a fresh depth exploration specially aiming at the field. In the field of video and audio information quality control and analysis, two main technical paths are formed by a traditional strategy for solving the problem of single-channel recognition and an emerging deep learning method.
The traditional method is mainly based on the basic characteristics of the audio information signals for analysis and judgment, and relies on direct comparison of original attributes of left and right channels. The core means of such methods is to determine the inter-channel inconsistencies by an exhaustive analysis of the spectral characteristics of the binaural audio information, or by measuring the respective volume intensity differences. This type of technique has a certain feasibility and practicality in processing scenes such as those where only a single side channel has valid sound, or where the sound intensities of both side channels are significantly different.
In contrast, deep learning schemes exhibit greater flexibility and intelligence potential. The method can capture and analyze subtle differences of human voice and background noise in left and right channels more precisely by utilizing a complex neural network architecture, so as to accurately judge whether a mono phenomenon exists.
The application provides a model training method, an audio component missing identification method, an apparatus, an electronic device, a computer readable storage medium and a computer program product, and aims to solve the technical problems in the prior art.
The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.
Referring to fig. 1, fig. 1 is a schematic diagram of an alternative architecture of an audio component deficiency identifying system 100 according to an embodiment of the present application, in order to support an exemplary application, a terminal 400 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two, and a wireless link is used to implement data transmission.
In practical applications, the terminal 400 may be various types of user terminals such as a smart phone, a tablet computer, a notebook computer, and the like, and may also be a desktop computer, a game console, a television, or a combination of any two or more of these data processing devices; the server 200 may be a server supporting various services, which is configured separately, may be configured as a server cluster, may be a cloud server, or the like. In practical implementation, the audio scene recognition method provided by the embodiment of the application can be implemented by a server or a terminal alone or cooperatively.
In some embodiments, the terminal 400 is configured to perform audio segment extraction on an audio stream to be identified, so as to obtain a plurality of audio segments; splitting each audio fragment into audio to be detected of a left channel and a right channel; inputting the audio to be detected of the left and right channels into an audio consistency recognition model for type consistency recognition, and obtaining a corresponding type consistency recognition result; if the type consistency recognition result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent, determining that the audio to be detected of the left channel and the right channel has audio component missing.
In other embodiments, the terminal 400 has an audio collection device (e.g., a microphone) mounted thereon, through which audio streams are collected and transmitted to the server 200.
The server 200 is configured to perform audio fragment extraction on an audio stream to be identified to obtain a plurality of audio fragments; splitting each audio fragment into audio to be detected of a left channel and a right channel; inputting the audio to be detected of the left and right channels into an audio consistency recognition model for type consistency recognition, and obtaining a corresponding type consistency recognition result; if the type consistency recognition result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent, determining that the audio to be detected of the left channel and the right channel has audio component deficiency, and returning the recognition result of the audio component deficiency to the terminal 400, wherein the terminal 400 executes the next processing, such as re-collecting the audio stream or checking whether the audio collection device of the terminal fails or not, based on the recognition result of the audio component deficiency.
The embodiment of the application can automatically identify and point out the single-channel missing condition in the video, thereby greatly reducing the workload of manual auditing and improving the processing efficiency. More importantly, the method is beneficial to remarkably optimizing user experience, so that users are willing to spend more time immersed in the watching process while enjoying high-quality audio and video content, further the residence time and interaction depth of the users on a platform are improved, more faithful users are attracted and kept, and the activity and participation enthusiasm of the ecology of the whole community are forcefully pushed.
The embodiment of the application provides a model training method, as shown in fig. 2, which comprises the following steps:
S101, obtaining a plurality of training samples, wherein each training sample comprises audio frequency of a left channel and audio frequency of a right channel and labeling information.
It should be noted that, in order to train to obtain a model capable of identifying a monaural loss, the related art generally constructs a training sample and labeling information by whether the audio of one channel in the two channels is entirely lost during supervised learning, that is, the training sample may be audio in the left channel and audio in the right channel, and correspondingly, the labeling information is that some channel of audio in the training sample is lost or whether there is audio loss.
The training sample of the embodiment of the application comprises left and right audio, the audio type of each channel is original audio or the missing audio of one audio component in the missing original audio, the audio component type of the embodiment of the application comprises human voice and background voice, and correspondingly, the missing audio of one audio component in the original audio can be the missing audio of the human voice in the missing original audio or the missing audio of the background voice in the missing original audio. The labeling information in the embodiment of the present application is used to indicate whether the types of the two-channel audio in the corresponding training samples are consistent, please refer to table 1, which exemplarily shows a class table of the training samples constructed in the embodiment of the present application:
TABLE 1
Therefore, the embodiment of the application can greatly improve the number of training samples based on the original audio pairs (namely, the left and right channels are all original audio), reduce the difficulty of acquiring the training samples, and help to improve the training efficiency, and the labeling information in the application only indicates whether the types are consistent, but not indicates whether the audio components are missing, so that the application fully considers that in practical application, if the left and right channels only keep human voice or background voice, the application generally means that the audio components are missing due to faults, and if the types of the left and right audio are inconsistent, the application has a great probability of fault, therefore, the model obtained by training the training samples which are based on the audio of the left and right channels and the labeling information can be effectively applied to the identification of the audio component missing.
The training sample of the embodiment of the application can be obtained by the following modes:
Obtaining at least one initial audio pair, wherein the audio of the left channel and the right channel of the initial audio pair are both original audio;
For each initial audio pair, determining a respective audio component from the audio of each channel in the initial audio pair;
For the original audio of each channel in each initial audio pair, shielding each audio component in the original audio to obtain each missing audio corresponding to the original audio;
For each initial audio pair, combining original audio of two channels of the initial audio pair with each missing audio, and setting corresponding labeling information according to whether the types of the combined audio of the two channels are consistent or not so as to obtain each training sample corresponding to the initial audio pair.
According to the embodiment of the application, through determining the audio components of each channel in the initial audio pair, shielding each audio component of the original audio of each channel to obtain each missing audio corresponding to the original audio, such as missing audio of missing human voice and missing audio of missing background sound, so that an audio set is obtained for each channel, the audio set comprises the original audio and each missing audio, the audio sets of the two channels are respectively randomly obtained and combined, and corresponding labeling information is set according to whether the types of the combined audio of the two channels are consistent or not so as to obtain each training sample corresponding to the initial audio pair.
Referring to fig. 3, a schematic flow chart of obtaining a training sample according to an embodiment of the present application is shown, where:
Firstly, the application acquires an audio stream, in some embodiments, a normal video sample on a line can be extracted, then the audio stream is extracted from a video sample shaft, and of course, the normal audio stream can also be directly obtained from the line, and it should be understood that the audio of the left and right quantity channels of the normal video sample or the audio stream is the original audio;
Then, the audio stream is segmented for a preset period of time, in some embodiments, it may be segmented every 5 seconds, and then, each audio segment is split into audio files of the left channel and the right channel;
Then, the audio files can be subjected to multi-source classification by using a preset audio separation tool (for example spleeter) to obtain the corresponding human voice and background voice of each audio file;
Further, for each pair of dual-channel audio files, the mask is randomly shielded to remove the human voice or background voice in at least one audio file, and corresponding labeling information is added, so that a training sample can be obtained.
S102, performing multiple rounds of iterative training on the comparison learning model to converge according to the training samples, and obtaining an audio consistency recognition model.
According to the embodiment of the application, through the obtained plurality of training samples, the plurality of training sample sets comprise the audio pairs with consistent types and the audio pairs with inconsistent types, and simultaneously comprise the audio pairs comprising the original audio and the audio pairs comprising the missing audio, the number of the samples is greatly increased, so that the comparison learning model is trained, comparison learning adopted by the comparison learning model is mainly calculated as similarity.
On the basis of the above embodiments, as an optional embodiment, the contrast learning model of the embodiment of the present application includes two branch models with different model structures, and each branch model performs feature extraction on a training sample to obtain two-channel audio features. It should be understood that, because the model structures of the two branch models are not identical, the two branch models have different audio characteristics of two channels obtained from the same training sample based on the model structures of the two branch models, and by adopting the two branch models with different structures, richer audio characteristics can be obtained for the audio of the same channel, so that the audio analysis accuracy of the model is indirectly improved, and the model with extremely strong analysis capability is not required, thereby reducing the cost of model training.
Each round of iterative training of the embodiment of the application comprises the following steps:
Inputting the training samples into a comparison learning model of the round of iteration to obtain a first similarity of each training sample; the first similarity of each training sample comprises a first sub-similarity between the audio characteristics of each channel of the training sample obtained by one branch model and the audio characteristics of different channels of the training sample obtained by another branch model;
Obtaining a loss function value of the iterative training, wherein the loss function value comprises first loss values of all training samples, and the first loss value of each training sample is obtained according to the first similarity and the labeling information;
And adjusting model parameters of the contrast learning model according to the loss function value.
According to the embodiment of the application, the first similarity of the training sample can be obtained by inputting the training sample into the iterative comparison learning model, the first similarity of the training sample comprises the first sub-similarity between the audio characteristics of each channel of the training sample obtained by one branch model and the audio characteristics of different channels of the training sample obtained by the other branch model, and as the training sample comprises two channels, the first sub-similarity is two, one is the first sub-similarity between the audio characteristics of the left channel of the training sample obtained by the first branch model and the audio characteristics of the right channel of the training sample obtained by the second branch model, and the other is the first sub-similarity between the audio characteristics of the right channel of the training sample obtained by the first branch model.
The loss function value of the embodiment of the application comprises a first loss value of each training sample, and the first loss value is obtained according to the first similarity and the labeling information.
Referring to fig. 4, a schematic diagram of a comparative learning model of an embodiment of the present application is shown, where the comparative learning model includes two branch models with different structures, namely a first branch model and a second branch model, a training sample is input into the two branch models respectively, audio features of two channels obtained by each branch model are obtained, a first sub-similarity between the audio features of the left channel of the training sample and the audio features of the right channel of the training sample obtained by the second branch model is calculated by the first branch model, and a first sub-similarity between the audio features of the right channel of the training sample and the audio features of the left channel of the training sample obtained by the second branch model is obtained by the first branch model.
On the basis of the above embodiments, as an optional embodiment, one of the two branch models in the embodiment of the present application is further configured to buffer the two-channel audio characteristics of at least one negative sample obtained in this iteration during each iteration training.
The negative sample of the embodiment of the application is a training sample with inconsistent types of corresponding audios of two channels in the corresponding training sample indicated by index information. The second similarity is additionally obtained specifically for negative samples in order to better train the model's ability to identify samples of inconsistent audio types.
According to the embodiment of the application, the two-channel audio features of the negative sample are cached during each iteration training, so that the similarity calculation can be carried out on the two-channel audio features of the negative sample obtained by the iteration and the two-channel audio features (also called reference audio feature pairs) of the negative sample (also called reference sample) obtained by a plurality of historical iterations during each iteration training, and the second similarity of the negative sample is obtained. The second similarity represents the difference in understanding the negative samples of the model in the current iteration and the historical iteration, helping to obtain more stable and robust audio features.
It should be noted that, since the audio feature of each channel obtained by each branching model needs to calculate the second sub-similarity with the audio feature of the same channel in each reference audio feature pair, there are 4 kinds of second sub-similarities:
a second sub-similarity between the audio features of the left channel obtained by the first branch model and the audio features of the left channel in a reference audio feature pair;
a second sub-similarity between the audio features of the right channel obtained by the first branch model and the audio features of the right channel in a pair of reference audio features;
a second sub-similarity between the audio features of the left channel obtained by the second branching model and the audio features of the left channel in a reference audio feature pair;
A second sub-similarity between the audio features of the right channel obtained by the second branching model and the audio features of the right channel in a pair of reference audio features;
and the second similarity of a negative sample is the similarity between the audio feature corresponding to the negative sample and a plurality of reference audio feature pairs, so that the diversity of the sample is reflected, and the stable audio feature is more beneficial to obtaining.
The loss function value of the embodiment of the application further comprises a second loss value of the negative samples, and the second loss value of each negative sample is obtained according to the second similarity of the negative samples.
Referring to fig. 5, which is an exemplary schematic diagram of another comparative learning model provided by the embodiment of the present application, as shown in the drawing, the comparative learning model includes two branch models with different structures, namely a first branch model and a second branch model, a training sample is respectively input into the two branch models to obtain audio features of two channels obtained by each branch model, the second branch model is further used for buffering the audio features of two channels of at least one negative sample obtained by the present iteration during each round of iterative training, so that, on one hand, the similarity between the audio features of different channels obtained by the two branch models, namely, the first similarity includes a first sub-similarity between the audio features of the left channel of the training sample and the audio features of the right channel of the training sample obtained by the second branch model, which is obtained by the first branch model, and on the other hand, the first sub-similarity between the audio features of the right channel of the training sample and the audio features of the left channel of the training sample obtained by the second branch model is obtained by the first branch model, and on the other hand, the second sub-similarity between the audio features of the negative sample obtained by the first branch model is obtained by the second branch model is obtained by the first branch model, and the second sub-similarity between the audio features of the negative sample obtained by the training sample is obtained by the second branch model, and the second sub-similarity is obtained by the second sub-similarity between the audio features of the negative sample obtained by the training sample, respectively, and adjusting model parameters of the contrast learning model by using the first loss value and the second loss value.
On the basis of the foregoing embodiments, as an optional embodiment, inputting the plurality of training samples into the iterative contrast learning model of the present round, further includes: and obtaining a third similarity of each training sample.
The third similarity of a training sample in the embodiment of the present application represents the similarity between the audio features of two channels of the corresponding training sample obtained by a branching model. That is, the third similarity of the training samples obtained in the embodiment of the present application is obtained from the prediction accuracy of one branch model, and accordingly, the loss function further includes the third loss value of each training sample, and the third loss value of each training sample is obtained according to the third similarity and the labeling information of the training sample, so as to promote optimization of a single branch model.
It should be noted that in the embodiment of the present application, one of the two branch models is a first branch model, and the other branch model is a second branch model, so that a third similarity of each training sample can be obtained according to the similarity between the audio features of the two channels of the corresponding training sample obtained by the first branch model, and the second branch model is used for buffering the audio features of the two channels of the at least one negative sample obtained by the present iteration during each round of iterative training, so as to avoid excessive calculation amount of one branch model.
Referring to fig. 6, which is an exemplary schematic diagram of a frame of a further comparative learning model provided by the embodiment of the present application, the comparative learning model includes two branch models with different structures, respectively, a first branch model and a second branch model, a training sample is respectively input into the two branch models to obtain audio features of two channels obtained by each branch model, the second branch model is further used for buffering the audio features of two channels of at least one negative sample obtained by the present iteration during each round of iterative training, so that, in a first aspect, the similarity between the audio features of different channels obtained by the two branch models, namely, the first similarity includes a first sub-similarity between the audio features of the left channel of the training sample and the audio features of the right channel of the training sample obtained by the second branch model, and the first sub-similarity between the audio features of the right channel of the training sample obtained by the first branch model and the audio features of the left channel of the training sample obtained by the second branch model, in a second aspect, the number of pre-set audio is obtained, so that, in combination with the first sub-similarity between the audio features of the training sample and the second sub-similarity between the audio features of each negative sample obtained by the training sample, the first sub-similarity between the audio features of the training sample and the second sub-similarity between the audio features of the training sample is obtained by the first sub-similarity between the first sub-similarity, obtaining a first loss value of the training sample according to the first similarity and the labeling information of the training sample, obtaining a second loss value of the negative sample according to the second similarity of the negative sample, obtaining a third loss value of the training sample according to the third similarity and the labeling information of the training sample, and adjusting model parameters of the comparison learning model by using the first loss value, the second loss value and the third loss value.
On the basis of the foregoing embodiments, as an optional embodiment, the first similarity includes a first sub-similarity between an audio feature of a left channel obtained by one branching model and an audio feature of a right channel obtained by another branching model, and a first sub-similarity between an audio feature of a right channel obtained by the one branching model and an audio feature of a left channel obtained by the other branching model;
accordingly, the first loss value is obtained by:
weighting the first sub-similarity by a first weight to obtain a first weighted value;
weighting the second sub-similarity by a second weight to obtain a second weighted value;
taking the sum of the first weighted value and the second weighted value as the first loss value;
The first weight is determined according to the labeling information, and the second weight is a difference value between a preset value and the first weight.
Specifically, the first loss value L 1 of the training sample according to the embodiment of the present application may be calculated by the following formula:
L1=(label)*+(label-1)*/>
Wherein label is a first weight determined according to the labeling information of the training sample, label-1 is a difference value between a preset value and the first weight, namely a second weight, Representing the audio characteristics of the left channel of the training sample obtained by the first branch model,/>Representing the audio characteristics of the right channel of the training sample obtained by the first branch model,/>Representing audio features of the left channel of the training sample obtained by the second branch model,/>Representing the audio characteristics of the right channel of the training sample obtained by the second branch model.
Based on the foregoing embodiments, as an optional embodiment, if the labeling information is used to indicate that the types of the two-channel audio in the corresponding training sample are consistent, the label is 1, and if the labeling information is used to indicate that the types of the two-channel audio in the corresponding training sample are inconsistent, the label is 0, so that:
If the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking any one of the two first sub-similarities as the first loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audio in the corresponding training samples are inconsistent, taking the negative value of any one of the two first sub-similarities as the first loss value.
According to the embodiment of the application, when the labeling information indicates that the types of the two-channel audios in the corresponding training samples are consistent, any one first sub-similarity is used as a first loss value, and when the types of the two-channel audios are inconsistent, a negative value of any one first sub-similarity is used as the first loss value, so that the first sub-similarity calculated for the positive samples is ensured to be as high as possible, and the first sub-similarity calculated for the negative samples is ensured to be as low as possible.
On the basis of the above embodiments, as an alternative embodiment, the second loss value of each negative sample is determined by:
for each reference sample, determining the similarity between the audio features of each channel of the negative sample obtained by each branch model and the audio features of the same channel of the reference sample, and taking the similarity as the similarity of the negative sample relative to the reference;
For each reference audio feature pair, obtaining a fourth similarity between the negative sample and the reference audio feature pair according to the sum of all second sub-similarities related to the negative sample and the reference audio feature pair;
A mean value of a fourth similarity of the negative sample with respect to each reference audio feature pair is determined, and a negative value of the mean value is taken as the second similarity of the negative sample. It can be seen that the embodiment of the application does not need to annotate information when calculating the second similarity, which is reflected by whether the understanding of the negative sample by the current iterative training becomes stable compared with the understanding of the negative sample by the historical iterative training.
Specifically, the second loss value L 2 of the negative example of the embodiment of the present application may be calculated by the following formula:
L2=-
Wherein, Representing the audio characteristics of the left channel in pair i of reference audio samples,/>Represents the audio characteristics of the right channel in the reference audio sample pair i, k being the total number of reference audio sample pairs. It can be seen that the second loss value of one negative sample is obtained by means of summation in the embodiments of the present application.
On the basis of the above embodiments, as an alternative embodiment, the third loss value of each training sample is obtained by:
Weighting a third similarity between audio features of two channels of the training sample obtained by a branch model by using a third weight and a fourth weight to obtain a third weighted value and a fourth weighted value;
taking the sum of the third weighted value and the fourth weighted value as the third loss value;
The third weight is determined according to the labeling information, and the fourth weight is a difference value between a preset value and the third weight.
Specifically, the third loss value L 3 of the training sample according to the embodiment of the present application may be calculated by the following formula:
L3=(label)*+(label-1)*/>
Wherein, Representing the audio characteristics of the left channel of the training sample obtained by the branch model t,/>Representing the audio characteristics of the right channel of the training sample obtained by the branch model t, the branch model t may be any one of the two branch models.
Based on the foregoing embodiments, as an optional embodiment, if the labeling information is used to indicate that the types of the two-channel audio in the corresponding training sample are consistent, the label is 1, and if the labeling information is used to indicate that the types of the two-channel audio in the corresponding training sample are inconsistent, the label is 0, so that:
if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking a third degree of the training samples as the third loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are inconsistent, taking a negative value of a third similarity of the training samples as the third loss value.
In the embodiment of the application, when the labeling information indicates that the types of the two-channel audios in the corresponding training samples are consistent, the third similarity is used as the first loss value, and when the types of the two-channel audios are inconsistent, the negative value of the third similarity is used as the first loss value, so that the third sub-similarity calculated for the positive samples is ensured to be as high as possible, and the third sub-similarity calculated for the negative samples is ensured to be as low as possible.
On the basis of the above embodiments, as an alternative embodiment, both branch models include a feature extraction module for extracting initial audio features and a feature transformation module for mapping the initial audio features to a high-dimensional feature space; the number of the feature transformation modules in the two branch models is different.
According to the two branch models provided by the embodiment of the application, the feature transformation module is further added after the feature extraction module, so that the model can construct finer and more discriminant feature representation in a high-dimensional feature space, and the number of the feature transformation modules of the two branch models is different, so that the dimension of the audio features finally obtained by different branch models is different, and the feature values are also different, so that the comparison learning is performed.
On the basis of the above embodiments, as an alternative embodiment, the comparison learning model includes a first branch model and a second branch model;
Wherein the first branch model comprises VGGish structures and two Projector structures;
the second branch pattern includes VGGish structures and one Projector structure.
VGGish the structure is a deep neural network for audio signal processing that can extract audio features. VGGish the architecture was designed based on VGG convolutional neural network architecture, with 16 layer depths including convolutional layer, pooling layer, and fully-connected layer. In processing an audio signal, the input signal is first subjected to some preprocessing step, such as converting it into a spectrogram or mel frequency chart. These images will then be input into the convolutional layer of VGGish for extracting time-frequency information and features.
Referring to fig. 7, a schematic diagram of a VGGish structure according to an embodiment of the present application is shown, where the VGGish structure includes five modules, i.e., a front-end module, a first convolution module, a second convolution module, a third convolution module, and a principal component analysis (PRINCIPAL COMPONENT ANALYSIS, PCA) module. The front-end module consists of two convolution layers with different step sizes and a maximum pooling layer for converting the audio signal into mel frequency map output. The first convolution module comprises four convolution layers with the same step size and a maximum pooling layer for extracting low-level features in the audio, such as a time-averaged frequency spectrum. The second convolution module comprises four convolution layers with the same step size and one max-pooling layer for extracting higher-level features, such as a representation of the time-varying response of the input signal. The third convolution module comprises two convolution layers with different step sizes and a maximum pooling layer for expanding the network from the spatial domain information to the time domain and the frequency domain. Finally, the PCA module is configured to map the high-dimensional features extracted from the convolutional layers to a fixed dimension for classification.
Referring to fig. 8, a schematic diagram of a framework of another comparative learning model according to an embodiment of the present application is shown, and includes:
The contrast learning model comprises two branch models with different structures, namely a first branch model and a second branch model, wherein the first branch model comprises a VGGish module and two project modules, the second branch model comprises a VGGish module and a project module, a training sample is respectively input into the two branch models, the two branch models firstly obtain initial audio characteristics of two channels through the VGGish module, then the respective project modules perform feature space transformation on the initial audio characteristics so as to obtain the audio characteristics of the two channels respectively obtained by each branch model, the second branch model is also used for buffering the audio characteristics of the two channels of at least one negative sample obtained by the iteration of each round of training, namely a reference audio characteristic pair, thereby, in the first aspect, respectively calculating a first sub-similarity between the audio characteristics of the left channel of a training sample and the audio characteristics of the right channel of a training sample obtained by the second branch model, the first branch model obtains the audio characteristics of the right channel of the training sample and the audio characteristics of the second branch model, respectively obtaining the similarity between the two sub-audio characteristics of the training sample and the second sub-channel of the training sample, respectively obtaining the similarity between the two sub-audio characteristics of the training sample and the second sub-model, respectively obtaining the similarity between the two sub-audio characteristics of the two sub-channels of the training sample and the second sub-model, respectively obtaining the similarity between the two sub-audio characteristics of the training sample obtained by the training sample and the second sub-model, respectively obtaining the similarity between the two sub-audio characteristics of the second sub-model and the audio characteristics of the second sub-model, respectively obtained by the second sub-model, respectively obtaining the audio characteristics of the training model and the second sub-model and the audio model, the embodiment of the application can obtain the first similarity and the third similarity of each training sample of the round of iteration, and combines the second similarity of the negative samples in each training sample, so that if the labeling information of the training sample indicates that the types of the audio of two channels in the corresponding training sample are consistent, any one of the two first sub-similarities is taken as a first loss value, otherwise, the negative value of any one of the two first sub-similarities is taken as the first loss value, the second loss value of the negative sample is obtained according to the second similarity of the negative sample, and if the labeling information of the training sample indicates that the types of the audio of the two channels in the corresponding training sample are consistent, the third similarity of the training sample is taken as a third loss value; otherwise, taking the negative value of the third similarity of the training sample as a third loss value, and adjusting the model parameters of the comparison learning model by using the first loss value, the second loss value and the third loss value.
The embodiment of the application also provides an audio component missing identification method, as shown in fig. 9, comprising the following steps:
s201, inputting audio to be detected of a left channel and a right channel into an audio consistency recognition model, and obtaining a recognition result output by the audio consistency recognition model, wherein the recognition result is used for indicating whether the types of the audio to be detected of the left channel and the right channel are consistent;
S202, if the identification result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent, determining that the audio to be detected of the left channel and the right channel has audio component missing.
The audio consistency recognition model according to the embodiment of the application is trained according to the model training method provided by the above embodiments, the recognition result output by the model can be obtained by inputting the audio to be detected of the left and right channels into the audio consistency recognition model, the recognition result is used for indicating whether the types of the audio to be detected of the left and right channels are consistent, further, if the recognition result indicates that the types of the audio to be detected of the left and right channels are inconsistent, it is determined that the audio to be detected of the left and right channels has audio component missing, because in practical application, the left and right channels only retain human voice or background voice, which generally means that the audio component missing is caused by fault, and if the types of the left and right audio are inconsistent, the model obtained by training samples including the audio of the left and right channels and the labeling information according to the embodiment of the application can be effectively applied to recognition of the audio component missing.
An embodiment of the present application provides a model training apparatus, as shown in fig. 10, which may include: a sample acquisition module 1001, and an iterative training module 1002, wherein,
The sample obtaining module 1001 is configured to obtain a plurality of training samples, where each training sample includes audio of a left channel and audio of a right channel, and labeling information, and the type of audio of each channel of one training sample is original audio or missing audio, where missing audio is audio of one audio component in missing corresponding original audio, and the labeling information is used to indicate whether the types of audio of the two channels in the corresponding training sample are consistent; the audio component is human voice or background voice;
and the iterative training module 1002 is configured to perform multiple rounds of iterative training on the comparison learning model to converge according to the multiple training samples, so as to obtain an audio consistency recognition model.
The device of the embodiment of the present application may execute the model training method provided by the embodiment of the present application, and its implementation principle is similar, and actions executed by each module in the device of each embodiment of the present application correspond to steps in the model training method of each embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.
As an optional implementation manner, the comparison learning model comprises two branch models with different structures, and each branch model is used for extracting features of a training sample to obtain audio features of two channels;
The iterative training module comprises:
the similarity obtaining unit is used for inputting the training samples into a comparison learning model of the round of iteration to obtain first similarity of each training sample; the first similarity of each training sample comprises a first sub-similarity between the audio characteristics of any sound channel of the training sample obtained by one branch model and the audio characteristics of another sound channel of the training sample obtained by another branch model;
the loss value obtaining unit is used for obtaining a loss function value of the iterative training of the round, the loss function value comprises first loss values of all training samples, and the first loss value of each training sample is obtained according to the first similarity and the labeling information;
and the adjusting unit adjusts model parameters of the contrast learning model according to the loss function value.
As an optional implementation manner, one of the two branch models is further used for buffering two-channel audio features of at least one negative sample obtained by the iteration of the round as at least one reference audio feature pair during each round of iterative training, wherein the negative sample is a training sample with index annotation information indicating that types of corresponding audio of two channels in the corresponding training sample are inconsistent;
The similarity obtaining unit is further used for obtaining second similarity of each negative sample, wherein the second similarity of each negative sample comprises second sub-similarity between the audio features of each channel of the negative sample obtained by each branch model and the audio features of the same channel in a preset number of reference audio feature pairs;
the loss function value further comprises a second loss value for each negative sample, the second loss value for each negative sample being obtained from a second similarity of the negative samples.
As an optional implementation manner, the similarity obtaining unit is further configured to obtain a third similarity of each training sample, where the third similarity of each training sample represents a similarity between audio features of two channels of the corresponding training sample obtained by using one branching model;
The loss function value further comprises a third loss value of each training sample, and the third loss value of each training sample is obtained according to the third similarity and the labeling information of the training sample.
As an alternative embodiment, the first loss value for each training sample is obtained by:
If the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking any one of the two first sub-similarities as the first loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audio in the corresponding training samples are inconsistent, taking the negative value of any one of the two first sub-similarities as the first loss value.
As an alternative embodiment, the second loss value for each negative sample is determined by:
For each reference audio feature pair, obtaining a fourth similarity between the negative sample and the reference audio feature pair according to the sum of all second sub-similarities related to the negative sample and the reference audio feature pair;
A mean value of a fourth similarity of the negative sample with respect to each reference audio feature pair is determined, and a negative value of the mean value is taken as the second similarity of the negative sample.
As an alternative embodiment, the third loss value for each training sample is obtained by:
if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking a third degree of the training samples as the third loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are inconsistent, taking a negative value of a third similarity of the training samples as the third loss value.
As an alternative embodiment, each two branch models includes a feature extraction module for extracting initial audio features and a feature transformation module for mapping the initial audio features to a high-dimensional feature space;
the number of the feature transformation modules in the two branch models is different.
As an alternative embodiment, the feature extraction module is VGGish modules;
The feature transformation module is a Projector module.
As an alternative embodiment, the sample acquisition module includes:
An initial audio pair obtaining unit, configured to obtain at least one initial audio pair, where audio of both left and right channels of the initial audio pair is original audio;
an audio component determination unit configured to determine, for each initial audio pair, a respective audio component from audio of each channel in the initial audio pair;
the masking unit is used for masking each audio component in the original audio for each channel of each initial audio pair to obtain each missing audio corresponding to the original audio;
The combination unit is used for combining the original audio of the two channels of the initial audio pair and each missing audio of each initial audio pair, and setting corresponding labeling information according to whether the types of the audio of the two combined channels are consistent or not so as to obtain each training sample corresponding to the initial audio pair.
An embodiment of the present application provides an audio component missing identifying apparatus, as shown in fig. 11, which may include: an inference module 1101, and an identification module 1102, wherein,
The reasoning module 1101 is configured to input audio to be detected of the left and right channels to an audio consistency recognition model, and obtain a recognition result output by the audio consistency recognition model, where the recognition result is used to indicate whether types of the audio to be detected of the left and right channels are consistent;
the identifying module 1102 is configured to determine that the audio components of the audio to be detected of the left and right channels are absent if the identification result indicates that the types of the audio to be detected of the left and right channels are inconsistent;
the audio consistency recognition model is trained by the model training method provided by the embodiments.
The device of the embodiment of the present application may execute the method for identifying missing audio components provided by the embodiment of the present application, and its implementation principle is similar, and actions executed by each module in the device of each embodiment of the present application correspond to steps in the method for identifying missing audio components of each embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to in the corresponding method shown in the foregoing, which is not repeated herein.
The embodiment of the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of a model training method or an audio component missing identification method, and compared with the related technology, the steps of the model training method or the audio component missing identification method can be realized: through the obtained multiple training samples, the multiple training sample sets comprise the audio pairs with consistent types and the audio pairs with inconsistent types, and simultaneously comprise the audio pairs containing the original audio and the audio pairs containing the missing audio, the number of the samples is greatly increased, so that the comparison learning model is trained, comparison learning adopted by the comparison learning model is mainly calculated as similarity, in the embodiment of the application, namely, the similarity of the audio of the left channel and the right channel, the application does not need to carry out detailed analysis on the characteristics of the audio of the double channels, has no high requirement on the computing capability of the model, is completely different from a mode of simply judging the volume intensity, can identify whether the audio is processed as much as possible and whether the audio types of the left channel and the right channel are consistent, strengthens the learning and understanding of the correlation between the original audio and the processed audio, and lays a foundation for the judgment result of whether the audio component is missing or not according to the identification result of whether the audio types output by the model is consistent or not.
In an alternative embodiment, there is provided an electronic device, as shown in fig. 12, the electronic device 4000 shown in fig. 12 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, bus 4002 is shown with only one bold line in the figures, but does not represent only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.
It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.
The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims (15)

1. A method of model training, comprising:
Obtaining a plurality of training samples, wherein each training sample comprises audio of a left channel and a right channel and labeling information, the type of the audio of each channel of one training sample is original audio or missing audio, the missing audio is audio of one audio component in the missing corresponding original audio, and the labeling information is used for indicating whether the types of the audio of the two channels in the corresponding training sample are consistent;
and carrying out multiple rounds of iterative training on the comparison learning model until convergence according to the training samples to obtain an audio consistency recognition model.
2. The method of claim 1, wherein the comparative learning model comprises two branch models with different structures, each branch model being used for extracting features from training samples to obtain two-channel audio features;
wherein each round of iterative training comprises:
Inputting the training samples into a comparison learning model of the round of iteration to obtain a first similarity of each training sample; the first similarity of each training sample comprises a first sub-similarity between the audio feature of any channel of the training sample obtained by one branch model and the audio feature of another channel of the training sample obtained by another branch model;
Obtaining a loss function value of the iterative training, wherein the loss function value comprises first loss values of all training samples, and the first loss value of each training sample is obtained according to the first similarity and the labeling information;
And adjusting model parameters of the contrast learning model according to the loss function value.
3. The method according to claim 2, wherein one of the two branch models is further used for buffering, as at least one reference audio feature pair, two-channel audio features of at least one negative sample obtained by the present iteration during each iteration training, where the negative sample is a training sample indicating that types of corresponding audio of two channels in the corresponding training sample are inconsistent;
The inputting the plurality of training samples into a contrast learning model further comprises:
obtaining second similarity of each negative sample, wherein the second similarity of each negative sample comprises second sub-similarity between the audio features of each channel of the negative sample obtained by each branch model and the audio features of the same channel in a preset number of reference audio feature pairs;
the loss function value further includes a second loss value for each negative sample, the second loss value for each negative sample being obtained from a second similarity of the negative samples.
4. The method of claim 2, wherein inputting the plurality of training samples into the iterative contrast learning model of the present round further comprises: obtaining third similarity of each training sample, wherein the third similarity of each training sample represents similarity between audio features of two channels of the corresponding training sample obtained by a branch model;
The loss function value further comprises a third loss value of each training sample, and the third loss value of each training sample is obtained according to the third similarity and the labeling information of the training sample.
5. The method of claim 2, wherein the first loss value for each training sample is obtained by:
If the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking any one of the two first sub-similarities as the first loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audio in the corresponding training samples are inconsistent, taking the negative value of any one of the two first sub-similarities as the first loss value.
6. A method according to claim 3, wherein the second loss value for each negative sample is determined by:
For each reference audio feature pair, obtaining a fourth similarity between the negative sample and the reference audio feature pair according to the sum of all second sub-similarities related to the negative sample and the reference audio feature pair;
A mean value of a fourth similarity of the negative sample with respect to each reference audio feature pair is determined, and a negative value of the mean value is taken as the second similarity of the negative sample.
7. The method of claim 4, wherein the third loss value for each training sample is obtained by:
if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are consistent, taking a third degree of the training samples as the third loss value;
And if the labeling information of the training samples indicates that the types of the two-channel audios in the corresponding training samples are inconsistent, taking a negative value of a third similarity of the training samples as the third loss value.
8. The method of any of claims 1-7, wherein both branch models include a feature extraction module for extracting initial audio features and a feature transformation module for mapping the initial audio features to a high-dimensional feature space;
the number of the feature transformation modules in the two branch models is different.
9. The method of claim 1, wherein the obtaining a plurality of training samples comprises:
Obtaining at least one initial audio pair, wherein the audio of the left channel and the right channel of the initial audio pair are both original audio;
For each initial audio pair, determining a respective audio component from the audio of each channel in the initial audio pair;
For the original audio of each channel in each initial audio pair, shielding each audio component in the original audio to obtain each missing audio corresponding to the original audio;
For each initial audio pair, combining original audio of two channels of the initial audio pair with each missing audio, and setting corresponding labeling information according to whether the types of the combined audio of the two channels are consistent or not so as to obtain each training sample corresponding to the initial audio pair.
10. An audio component missing identification method, characterized by comprising:
Inputting audio to be detected of a left channel and a right channel into an audio consistency recognition model, and obtaining a recognition result output by the audio consistency recognition model, wherein the recognition result is used for indicating whether the types of the audio to be detected of the left channel and the right channel are consistent;
If the identification result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent, determining that the audio to be detected of the left channel and the right channel has audio component missing;
wherein the audio consistency recognition model is trained by the model training method of any one of claims 1-9.
11. A model training device, comprising:
The system comprises a sample obtaining module, a sample analyzing module and a sample analyzing module, wherein the sample obtaining module is used for obtaining a plurality of training samples, each training sample comprises audio of a left channel and a right channel and labeling information, the type of the audio of each channel of one training sample is original audio or missing audio, the missing audio is audio of one audio component in the corresponding original audio, and the labeling information is used for indicating whether the types of the audio of the two channels in the corresponding training sample are consistent;
And the iterative training module is used for carrying out multi-round iterative training on the comparison learning model to converge according to the training samples so as to obtain an audio consistency recognition model.
12. An audio component absence identifying apparatus, comprising:
the reasoning module is used for inputting the audio to be detected of the left channel and the right channel into the audio consistency recognition model, obtaining a recognition result output by the audio consistency recognition model, and indicating whether the types of the audio to be detected of the left channel and the right channel are consistent or not;
The identification module is used for determining that the audio components of the audio to be detected of the left channel and the right channel are absent if the identification result indicates that the types of the audio to be detected of the left channel and the right channel are inconsistent;
wherein the audio consistency recognition model is trained by the model training method of any one of claims 1-9.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of any one of claims 1-10.
14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-10.
15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-10.
CN202410575440.3A 2024-05-10 2024-05-10 Model training method, audio component missing identification method and device and electronic equipment Active CN118155654B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410575440.3A CN118155654B (en) 2024-05-10 2024-05-10 Model training method, audio component missing identification method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410575440.3A CN118155654B (en) 2024-05-10 2024-05-10 Model training method, audio component missing identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN118155654A true CN118155654A (en) 2024-06-07
CN118155654B CN118155654B (en) 2024-07-23

Family

ID=91301626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410575440.3A Active CN118155654B (en) 2024-05-10 2024-05-10 Model training method, audio component missing identification method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN118155654B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170465A (en) * 2017-06-29 2017-09-15 数据堂(北京)科技股份有限公司 A kind of audio quality detection method and audio quality detecting system
CN108231091A (en) * 2018-01-24 2018-06-29 广州酷狗计算机科技有限公司 A kind of whether consistent method and apparatus of left and right acoustic channels for detecting audio
US20210176580A1 (en) * 2019-12-09 2021-06-10 Samsung Electronics Co., Ltd. Audio output apparatus and method of controlling thereof
WO2023283823A1 (en) * 2021-07-14 2023-01-19 东莞理工学院 Speech adversarial sample testing method and apparatus, device, and computer-readable storage medium
CN116052718A (en) * 2022-12-27 2023-05-02 科大讯飞股份有限公司 Audio evaluation model training method and device and audio evaluation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107170465A (en) * 2017-06-29 2017-09-15 数据堂(北京)科技股份有限公司 A kind of audio quality detection method and audio quality detecting system
CN108231091A (en) * 2018-01-24 2018-06-29 广州酷狗计算机科技有限公司 A kind of whether consistent method and apparatus of left and right acoustic channels for detecting audio
US20210176580A1 (en) * 2019-12-09 2021-06-10 Samsung Electronics Co., Ltd. Audio output apparatus and method of controlling thereof
WO2023283823A1 (en) * 2021-07-14 2023-01-19 东莞理工学院 Speech adversarial sample testing method and apparatus, device, and computer-readable storage medium
CN116052718A (en) * 2022-12-27 2023-05-02 科大讯飞股份有限公司 Audio evaluation model training method and device and audio evaluation method and device

Also Published As

Publication number Publication date
CN118155654B (en) 2024-07-23

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
CN112289338B (en) Signal processing method and device, computer equipment and readable storage medium
CN114242044B (en) Voice quality evaluation method, voice quality evaluation model training method and device
CN109766476B (en) Video content emotion analysis method and device, computer equipment and storage medium
Hoffmann et al. Bass enhancement settings in portable devices based on music genre recognition
CN113241092A (en) Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network
CN112614504A (en) Single sound channel voice noise reduction method, system, equipment and readable storage medium
CN114613387A (en) Voice separation method and device, electronic equipment and storage medium
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
CN111477248B (en) Audio noise detection method and device
Zeng et al. Spatio-temporal representation learning enhanced source cell-phone recognition from speech recordings
Cui et al. Research on audio recognition based on the deep neural network in music teaching
CN118155654B (en) Model training method, audio component missing identification method and device and electronic equipment
WO2023226572A1 (en) Feature representation extraction method and apparatus, device, medium and program product
CN115294947B (en) Audio data processing method, device, electronic equipment and medium
CN111103568A (en) Sound source positioning method, device, medium and equipment
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
CN113571063B (en) Speech signal recognition method and device, electronic equipment and storage medium
CN112489678B (en) Scene recognition method and device based on channel characteristics
CN115565548A (en) Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment
CN114023350A (en) Sound source separation method based on shallow feature reactivation and multi-stage mixed attention
CN114863939B (en) Panda attribute identification method and system based on sound
CN111312276B (en) Audio signal processing method, device, equipment and medium
CN114446316B (en) Audio separation method, training method, device and equipment of audio separation model

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant