CN111460214A

CN111460214A - Classification model training method, audio classification method, device, medium and equipment

Info

Publication number: CN111460214A
Application number: CN202010255326.4A
Authority: CN
Inventors: 王康; 何怡; 许凌
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-07-28
Anticipated expiration: 2040-04-02
Also published as: CN111460214B

Abstract

The present disclosure relates to a classification model training method, an audio classification method, apparatus, medium, and device. The method comprises the following steps: acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages; acquiring a plurality of second audios belonging to an uncommon language, and determining the language characteristics and the language of each second audio; setting a full connection layer in the initial audio classification model according to the total number of the languages to which the second audio belongs to obtain an intermediate audio classification model; and training the intermediate audio classification model by taking the language features of the second audio as model input data and the language to which the second audio belongs as model output data to obtain a target audio classification model. Therefore, the accuracy of identifying and classifying the abnormal languages can be improved, and the problems of poor model effect and low accuracy caused by few samples of the abnormal languages are solved.

Description

Classification model training method, audio classification method, device, medium and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a classification model training method, an audio classification method, an apparatus, a medium, and a device.

Background

In an audio processing scenario, there is sometimes a need to identify which language the audio content belongs to, that is, for a piece of audio, to identify which language the speaking content in the audio belongs to, the content of the piece of audio can also be considered to be classified.

In the related art, model training is generally performed in advance for a target language to be recognized, multiple model training modes can be used for model training, after corresponding models are obtained through training, recognition effects of the multiple models obtained through training under the same recognition scene are compared, then the model with the best effect is selected as the model capable of being used for recognizing the target language, and recognition is completed by using the selected model when speech content in audio needs to be recognized in the following process.

The above-mentioned method is excellent when the training data amount of the target language is large enough, for example, the target language is a common language such as chinese and english. However, if the training data amount of the target language itself is small, for example, the target language is an extraordinary language such as indian language and spanish language, and the accuracy of the model obtained by training is inferior due to insufficient training data, through the above method, even if the model with the best effect is selected from the multiple models, the recognition accuracy of the model cannot reach the standard, and the language to which the content of speech in the audio belongs cannot be recognized accurately.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides an audio classification model training method, including:

acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages;

acquiring a plurality of second audios belonging to an uncommon language, and determining the language characteristics and the language of each second audio;

setting a full connection layer in the initial audio classification model according to the total number of the languages to which the second audio belongs to obtain an intermediate audio classification model;

and training the intermediate audio classification model by taking the language features of the second audio as model input data and the language to which the second audio belongs as model output data to obtain a target audio classification model.

In a second aspect, the present disclosure provides an audio classification method, the method comprising:

segmenting audio to be processed to obtain a plurality of audio segments to be processed;

inputting each audio clip to be processed into a target audio classification model respectively to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method of the first aspect of the disclosure, and the output result is used for indicating the probability that the audio clip to be processed input into the target audio classification model corresponds to each language of the languages to which the second audio belongs;

and aiming at each audio clip to be processed, determining the language to which the audio clip to be processed belongs according to the probability that the audio clip to be processed corresponds to each language in the language to which the second audio belongs.

In a third aspect, the present disclosure provides an audio classification model training apparatus, the apparatus comprising:

the system comprises a first obtaining module, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining an initial audio classification model which is obtained based on a plurality of first audio training belonging to common languages;

the second acquisition module is used for acquiring a plurality of second audios belonging to an unusual language and determining the language characteristics and the language of each second audio;

the setting module is used for setting a full connection layer in the initial audio classification model according to the total number of the languages to which the second audio belongs so as to obtain an intermediate audio classification model;

and the model training module is used for training the intermediate audio classification model by taking the language features of the second audio as model input data and the language to which the second audio belongs as model output data so as to obtain a target audio classification model.

In a fourth aspect, the present disclosure provides an audio classification apparatus, the apparatus comprising:

the segmentation module is used for segmenting the audio to be processed to obtain a plurality of audio segments to be processed;

a classification module, configured to input each of the audio segments to be processed to a target audio classification model, respectively, so as to obtain an output result of the target audio classification model, where the target audio classification model is obtained by training according to the audio classification model training method of the first aspect of the disclosure, and the output result is used to indicate probabilities that the audio segments to be processed input to the target audio classification model correspond to respective languages of the languages to which the second audio belongs;

and the determining module is used for determining the language to which the audio to be processed belongs according to the probability that the audio to be processed corresponds to each language in the language to which the second audio belongs.

In a fifth aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of the first aspect of the present disclosure, or that, when executed by a processing device, performs the steps of the method of the second aspect of the present disclosure.

In a sixth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect of the disclosure or to carry out the steps of the method of the second aspect of the disclosure.

According to the technical scheme, an initial audio classification model is obtained, a plurality of second audios belonging to the non-common languages are obtained, the language features and the language to which each second audio belongs are determined, the full connection layer in the initial audio classification model is set according to the total number of the languages to which the second audios belong, so that an intermediate audio classification model is obtained, then the language features of the second audios are used as model input data, the language to which the second audios belong is used as model output data, and the intermediate audio classification model is trained to obtain the target audio classification model. The initial audio classification model is obtained based on a plurality of first audio training belonging to common languages, so that the initial audio classification model has basic capability of language classification. Therefore, based on the initial audio classification model with good language classification capability, the non-common languages are further subjected to targeted training, the accuracy of identifying and classifying the non-common languages can be improved, and the problems of poor model effect and low accuracy caused by few samples of the non-common languages are solved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

In the drawings:

FIG. 1 is a flow diagram of an audio classification model training method provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of audio classification provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a block diagram of an audio classification model training apparatus provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a block diagram of an audio classification apparatus provided in accordance with an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device provided in accordance with an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

As described in the background art, in the prior art, for recognizing the language of the audio speaking content, model training is generally performed in advance for a target language to be recognized, multiple model training methods are used for performing model training, after corresponding models are obtained through training, recognition effects of the multiple models obtained through training in the same recognition scene are compared, the model with the best effect is selected as an audio classification model which can be used for recognizing the target language, and the audio classification model is used for completing language recognition when it is required to recognize which of the target languages the speaking content in the audio subsequently. For example, if the target language is chinese, english, or indian, a plurality of models for identifying chinese, english, or indian are obtained by performing model training based on the training data of chinese, english, or indian in the above manner, and the model with the best identification effect is selected as the audio classification model for identifying chinese, english, or indian.

The above-mentioned method is excellent when the amount of training data of the target language is large enough, for example, the target language is a common language such as chinese and english, and such a common language has thousands of hours or more of audio as training data. If the target language is an unusual language such as indian or spanish, such unusual language has only about hundreds of hours, tens of hours or less of audio as training data, for example, only about 150 hours of audio in indian can be used as training data. The training data volume of the non-common languages is small, and the accuracy of the model obtained through training is poor due to insufficient training data, so that even if the model with the best effect in the multiple models is selected as the audio classification model, the recognition accuracy of the model cannot reach the standard, and the language to which the speaking content in the audio belongs cannot be recognized accurately. Thus, in the above example, the resulting audio classification model for recognizing chinese, english, indian languages is not ideal in language classification due to the lack of training data of indian languages themselves.

In order to solve the above problems in the prior art, the present disclosure provides a classification model training method, an audio classification method, an apparatus, a medium, and a device.

Fig. 1 is a flowchart of an audio classification model training method provided according to an embodiment of the present disclosure. As shown in fig. 1, the method may include the following steps.

In step 11, an initial audio classification model is obtained.

The initial audio classification model is obtained based on a plurality of first audio training belonging to common languages. Here, the commonly used language refers to a sufficient language of the available training data, for example, chinese, english, etc.

It should be noted that, without additional description, the language referred to in the present disclosure may be a language of a certain country (e.g., chinese, english, french, etc.) or a dialect of a certain region (e.g., tetralogy, guangdong language, etc.).

Prior to step 11, the method provided by the present disclosure may further include the steps of:

acquiring a plurality of first audios belonging to common languages, and determining the language characteristics and the language of each first audio;

and training the neural network model by taking the language features of the first audio as model input data and the language to which the first audio belongs as model output data to obtain an initial audio classification model.

The plurality of first audios belonging to the common languages may be obtained from data sets (including audios belonging to the common languages) corresponding to a plurality of (i.e., two or more) common languages, respectively, and the number of the first audios corresponding to each common language may be controlled to be equal in consideration of a training effect of the model. For example, the first audio in chinese and english is obtained for 4000h (hours), and corresponds to the first audio in chinese 2000h and corresponds to the first audio in english 2000 h.

In one possible embodiment, the linguistic features of each first audio may be extracted by a pre-trained feature extraction model. Illustratively, the feature extraction model may be trained based on an AudioSet data set. Specifically, the feature extraction model may be obtained by a pretrain model provided by *** and based on AudioSet, which is an audio classification model used in the aforementioned prior art, and is used for classifying audio input to the model to identify which language the audio input to the model belongs to. Thus, after the last layer (i.e., the last fully-connected layer) of the pretrain model is removed, the remaining portion of the model has the ability to extract certain features of the audio that are used in the pretrain model to generate audio classification results, which, as can be seen, help classify the language of the audio as language features of the audio. Thus, the remaining part of the model can be used as a feature extraction model for extracting the language features of the first audio. In a possible example, the pretrain model is a CNN (convolutional neural Networks) model, and the linguistic feature of each first audio is a feature extracted by the convolutional neural network to help classify each first audio.

And after a first audio is input into the feature extraction model, the language feature of the first audio output by the feature extraction model can be obtained. Wherein the language feature of each first audio is a feature vector, such as an N-dimensional feature vector, and N may be 128.

As described above, in a data set of a certain language, a plurality of audios belonging to the language are stored, and therefore, the language to which each first audio belongs is known at the time of acquiring the first audio.

The model training process is to train a neural network model using the language features of the first audio as input data of the model and the language to which the first audio belongs as true outputs of the model to obtain an initial audio classification model, and in each training, using the language features of a first audio as model input data and the language to which the first audio belongs as true outputs, the initial audio classification model may be, for example, an L (L ong Short-Term Memory network) model.

The input of the initial audio classification model is a feature vector of the audio (i.e., the N-dimensional feature vector mentioned above), the output is a probability that the input audio corresponds to each language of the first audio, and the output may be in the form of an M-dimensional vector, where M is a total number of languages of the first audio to which the initial audio classification model is trained. Wherein the greater the probability value corresponding to a language, the more likely the audio will belong to that language. For example, if the initial audio classification model is obtained based on the first audio training corresponding to two common languages, i.e., chinese and english, the output result of the initial audio classification model is a 2-dimensional vector, and the 2-dimensional vector represents the probability that the input content input to the initial audio classification model belongs to chinese or english, respectively.

It should be noted that the manner of training the neural network model belongs to the prior art, is well known to those skilled in the art, and is not described in detail herein.

The initial audio classification model obtained at this time is obtained based on a plurality of first audio training belonging to common languages, so that the initial audio classification model has excellent classification effect, and the internal parameters of the model enable the initial audio classification model to have basic capability of language classification. Training of the initial audio classification model in the approach provided by the present disclosure, it can be considered a first stage training of the final desired model.

In step 12, a plurality of second audios belonging to different languages are obtained, and the language features and the language of each second audio are determined.

Here, the term "unusual" refers to a language in which training data is available in a small amount, for example, indian language or the like. Wherein the plurality of second audios belonging to the unusual language are acquired, the plurality of second audios may be acquired from data sets (including audios belonging to the unusual language) corresponding to each of a plurality of (i.e., two or more) unusual languages, respectively, and the number of second audios corresponding to each unusual language may be controlled to be equivalent in consideration of a training effect of the model. For example, if the language to which the second audio belongs includes indian a and indian B (two dialects of indian), the second audio of indian a and indian B may be obtained for 800h, and correspond to the first audio 400h of indian a and the second audio 400h of indian B. For the problem of insufficient data set of the non-common languages, data expansion can be performed in a copying manner and used for model training, and in the data expansion process, data can be processed to a certain degree, for example, if the above-mentioned indian language a has only 200h of audio, the 200h of audio can be copied and some noise can be added to form new audio, and the original 200h of audio and the new 200h of audio are used as 400h of audio used for model training.

Also, as described above, in a data set of a certain language, a plurality of audios belonging to that language are stored, and therefore, the language to which each second audio belongs is known at the time of acquiring the second audio.

In one possible implementation, the determining the language characteristic of each second audio in step 12 may include the following steps:

and extracting the language features of each second audio through the pre-trained feature extraction model.

The feature extraction model is trained based on the AudioSet data set, and descriptions about the feature extraction model and the language features are given above when describing how to determine the language features of each first audio, and will not be repeated here. The principle of determining the language characteristic of each second audio is the same as that of determining the language characteristic of each first audio, and it is only necessary to change the processing object from the first audio to the second audio according to the description already given above.

In step 13, according to the total number of the languages to which the second audio belongs, the full connection layer in the initial audio classification model is set to obtain the intermediate audio classification model.

The initial audio classification model includes an input layer, an intermediate layer, and an output layer. The input layer is used for sending input data to the middle layer. The intermediate layer is used for performing in-model operation on input data, the last layer of the intermediate layer is a full-link layer, the output of the full-link layer is M-dimensional data, and the M categories correspond to the M categories of the languages, and M is the total number of the languages to which the first audio of the initial audio classification model belongs. The full-link layer is connected to the output layer through an activation function, and is configured to obtain an M-dimensional vector, i.e., M probability values as described above, based on the M-dimensional data output by the full-link layer.

As described above, in the scheme provided by the present disclosure, the training of the initial audio classification model may be regarded as a first stage training of the finally required model, so that the model has good language classification capability. Then, a second stage of training may be started, in which the training is performed on the extraordinary language, and the corresponding category in the full-link layer needs to be changed correspondingly to adapt to the current training, so that the full-link layer in the initial audio classification model needs to be set according to the total number of the languages to which the second audio belongs, so as to obtain the intermediate audio classification model.

In one possible embodiment, step 13 may include the steps of:

and setting the categories contained in the full connection layer in the initial audio classification model, so that the number of the categories contained in the full connection layer is the same as the total number of the languages to which the second audio belongs, and the categories contained in the full connection layer correspond to the languages to which the second audio belongs one to one.

For example, if the initial audio classification model is obtained by training based on chinese and english related data, the total number of classes of the fully-connected layers of the initial audio classification model is 2, and the classes of the fully-connected layers correspond to chinese and english, respectively, then if the training of the second stage is to perform targeted training on indian a and indian B, it is necessary to set the total number of classes of the fully-connected layers to 2, and take each class to correspond to indian a and indian B, respectively, and obtain the intermediate audio classification model.

Thus, the fully-connected layer of the intermediate audio classification model is a perfect match to the second stage of training, and further training of the intermediate audio classification model can begin.

In step 14, the language features of the second audio are used as model input data, and the language to which the second audio belongs is used as model output data, and the intermediate audio classification model is trained to obtain the target audio classification model.

The training process of the intermediate audio classification model is consistent with the input data format of the initial audio classification model, and is the characteristic vector of the audio extracted by the characteristic extraction model, and the training processes of the intermediate audio classification model and the initial audio classification model are also similar.

The model training process is to train the intermediate audio classification model using the language features of the second audio as model input data and the language to which the second audio belongs as the true output of the model to obtain the target audio classification model, and in each training, using the language features of a second audio as model input data and the language to which the second audio belongs as the true output.

The input of the target audio classification model is a feature vector (e.g., the N-dimensional feature vector mentioned above) of the second audio, and the output is a probability that the input second audio corresponds to each language of the language to which the second audio belongs, and the output may be in the form of a K-dimensional vector, where K is a total number of languages to which the second audio belongs for training the target audio classification model. Wherein the greater the probability value corresponding to a language, the more likely the audio will belong to that language. For example, if the target audio classification model is obtained based on the first audio training corresponding to two common languages, that is, indian a and indian B, the output result of the target audio classification model is a 2-dimensional vector, and represents the probability that the input data input to the target audio classification model belongs to indian a or indian B, respectively.

In addition, in training the intermediate audio classification model to obtain the target audio classification model, not only the second audio belonging to the non-commonly used language can be trained in the training process of the second stage according to the above-mentioned manner, but also the audio of the commonly used language and the non-commonly used language can be trained in the training process of the second stage by referring to the above-mentioned manner, and the training manner is the same as that given in the above text, that is, the language features of the audio belonging to the commonly used language (or the language features of the audio belonging to the non-commonly used language) are used as model input data, and the language to which the input audio belongs is used as model output data, so as to train the intermediate audio classification model to obtain the target audio classification model. Considerable considerations of the balance and amount of training data are consistent with those given above and will not be repeated here. For example, in the second stage of model training, training may be performed in combination with chinese, english, and indian languages, and the finally obtained target audio classification model is a model for classifying chinese, english, and indian languages.

Fig. 2 is a flowchart of an audio classification method provided according to an embodiment of the present disclosure. As shown in fig. 2, the method may include the following steps.

In step 21, the audio to be processed is sliced to obtain a plurality of audio segments to be processed.

The longer the audio time is, the higher the computing power required by audio processing is, and the more the problems are, so that the whole section of audio to be processed can be firstly segmented to obtain a plurality of audio segments to be processed, and then the audio segments to be processed are processed, so that the computing pressure in the audio processing process can be effectively reduced, and the audio processing efficiency and accuracy can be improved.

In a possible implementation mode, equal segmentation can be performed on the audio to be processed, so that the obtained multiple audio segments to be processed are consistent in time length and data format, and the subsequent processing is more efficient.

In another possible implementation, the audio to be processed may be analyzed, and the portion of the audio without human voice is used as a segmentation point to segment the audio to be processed, so as to ensure that the obtained contents of the multiple audio segments to be processed have higher relevance, thereby facilitating subsequent language identification.

In step 22, each audio clip to be processed is input to the target audio classification model to obtain an output result of the target audio classification model.

The target audio classification model is obtained by training according to the audio classification model training method provided by any embodiment of the disclosure. Accordingly, the output result is used for indicating the probability that the audio clip to be processed input to the target audio classification model corresponds to each language in the language to which the second audio belongs.

In step 23, for each audio clip to be processed, the language to which the audio clip to be processed belongs is determined according to the probability that the audio clip to be processed corresponds to each language in the language to which the second audio belongs.

The language of the audio clip to be processed can be determined according to the probability that the audio clip to be processed corresponds to each language of the second audio, and further, the language of the audio clip to be processed can be determined according to the language of each audio clip to be processed.

In one possible implementation, determining the language to which the audio segment to be processed belongs may be implemented as follows:

and if the only one of the probabilities of the audio clip to be processed corresponding to each language in the language to which the second audio belongs is greater than a preset probability threshold, determining the language corresponding to the probability greater than the preset probability threshold as the language to which the audio clip to be processed belongs.

The preset probability threshold may be set according to an empirical value. The method for determining the language of a single audio clip to be processed is given, and the method can be referred to for determining the language of each audio clip to be processed.

In another possible embodiment, step 23 may include the steps of:

for each audio clip to be processed, determining the language corresponding to the maximum probability as the language to which the audio clip to be processed belongs according to the maximum probability corresponding to the audio clip to be processed;

and determining the language of the audio to be processed according to the language of each audio clip to be processed.

In this embodiment, the language to which the audio clip to be processed belongs is determined by: and determining the language corresponding to the maximum probability as the language of the audio clip to be processed according to the maximum probability corresponding to the audio clip to be processed. As described above, the larger the probability value of the audio clip to be processed corresponding to a certain language is, the more likely the audio clip to belong to the language is, and therefore, the language to which the audio clip to be processed belongs can be directly determined by the maximum probability value. The method for determining the language of a single audio clip to be processed is given, and the method can be referred to for determining the language of each audio clip to be processed.

In a possible embodiment, determining the language to which the audio to be processed belongs according to the language to which each audio clip to be processed belongs may include the following steps:

counting the language of each audio clip to be processed to determine the language with the maximum number;

and determining the language with the largest number as the language to which the audio to be processed belongs.

After a segment of audio to be processed is segmented, a plurality of audio segments to be processed can be obtained, each audio segment to be processed corresponds to a language, and therefore, in a segment of audio to be processed, the larger the proportion of the audio segment to be processed belonging to a certain language is, the more likely the audio to be processed belongs to the language is. Therefore, the language to which each audio clip to be processed belongs may be counted to determine the language with the largest number, and the language with the largest number may be determined as the language to which the audio clip to be processed belongs. For example, if the audio C to be processed is segmented into 10 audio segments to be processed, and 8 audio segments to be processed belong to the indian language and the remaining 2 audio segments to be processed belong to the chinese language in the 10 audio segments to be processed, it may be determined that the audio C to be processed belongs to the indian language.

According to the scheme, the audio to be processed is segmented to obtain a plurality of audio segments to be processed, each audio segment to be processed is input into the target audio classification model to obtain an output result of the target audio classification model, and the language to which the audio to be processed belongs is determined according to the probability that the audio segment to be processed corresponds to each language in the language to which the second audio belongs. The target classification model is obtained by training based on the audio classification model training method provided by any embodiment of the disclosure, has excellent recognition and classification effects, and can improve the accuracy of the language determination of the audio to be processed.

Fig. 3 is a block diagram of an audio classification model training apparatus provided according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus 30 may include:

a first obtaining module 31, configured to obtain an initial audio classification model, where the initial audio classification model is obtained based on training of multiple first audios belonging to a common language;

a second obtaining module 32, configured to obtain multiple second audios that belong to an uncommon language, and determine a language feature and a language to which each of the second audios belongs;

a setting module 33, configured to set a full connection layer in the initial audio classification model according to the total number of languages to which the second audio belongs, so as to obtain an intermediate audio classification model;

and the model training module 34 is configured to train the intermediate audio classification model by using the language features of the second audio as model input data and using the language to which the second audio belongs as model output data, so as to obtain a target audio classification model.

Optionally, the setting module 33 is configured to set categories included in a full-link layer in the initial audio classification model, so that the number of the categories included in the full-link layer is the same as the total number of the languages to which the second audio belongs, and the categories included in the full-link layer correspond to the languages to which the second audio belongs one to one.

Optionally, the second obtaining module 32 is configured to extract a language feature of each second audio through a pre-trained feature extraction model, where the feature extraction model is obtained by training based on an AudioSet data set.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram of an audio classification apparatus provided according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 40 may include:

a segmentation module 41, configured to segment the audio to be processed to obtain a plurality of audio segments to be processed;

a classification module 42, configured to input each of the audio segments to be processed into a target audio classification model, so as to obtain an output result of the target audio classification model, where the target audio classification model is obtained by training according to the audio classification model training method according to any embodiment of the present disclosure, and the output result is used to indicate probabilities that the audio segments to be processed input into the target audio classification model correspond to respective languages of the languages to which the second audio belongs;

a determining module 43, configured to determine, for each audio clip to be processed, a language to which the audio clip to be processed belongs according to a probability that the audio clip to be processed corresponds to each language in the languages to which the second audio belongs.

Optionally, the determining module 43 includes:

the first determining submodule is used for determining the language corresponding to the maximum probability as the language of the audio clip to be processed according to the maximum probability corresponding to the audio clip to be processed aiming at each audio clip to be processed;

and the second determining submodule is used for determining the language of the audio to be processed according to the language of each audio clip to be processed.

Optionally, the second determining sub-module includes:

the statistic submodule is used for carrying out statistics on the language to which each audio clip to be processed belongs so as to determine the language with the largest quantity;

and the third determining submodule is used for determining the language with the maximum number as the language to which the audio to be processed belongs.

Referring now to FIG. 5, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609. communication devices 609 may allow electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the electronic devices may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). examples of communications networks include local area networks ("L AN"), wide area networks ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages; acquiring a plurality of second audios belonging to an uncommon language, and determining the language characteristics and the language of each second audio; setting a full connection layer in the initial audio classification model according to the total number of the languages to which the second audio belongs to obtain an intermediate audio classification model; and training the intermediate audio classification model by taking the language features of the second audio as model input data and the language to which the second audio belongs as model output data to obtain a target audio classification model.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: segmenting audio to be processed to obtain a plurality of audio segments to be processed; inputting each audio clip to be processed into a target audio classification model respectively to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method of any embodiment of the disclosure, and the output result is used for indicating the probability that the audio clip to be processed input into the target audio classification model corresponds to each language in the language to which the second audio belongs; and aiming at each audio clip to be processed, determining the language to which the audio clip to be processed belongs according to the probability that the audio clip to be processed corresponds to each language in the language to which the second audio belongs.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including but not limited to AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module does not in some cases constitute a definition of the module itself, for example, the first obtaining module may also be described as a "module that obtains an initial audio classification model".

For example, without limitation, exemplary types of hardware logic that may be used include Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex programmable logic devices (CP L D), and so forth.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided an audio classification model training method, including:

According to one or more embodiments of the present disclosure, there is provided an audio classification model training method, where the setting of a full connection layer in an initial audio classification model according to a total number of languages to which a second audio belongs to obtain an intermediate audio classification model includes:

and setting categories contained in a full connection layer in the initial audio classification model, so that the number of the categories contained in the full connection layer is the same as the total number of the languages to which the second audio belongs, and the categories contained in the full connection layer correspond to the languages to which the second audio belongs one to one.

According to one or more embodiments of the present disclosure, there is provided an audio classification model training method, wherein the determining the language feature of each second audio includes:

and extracting the language features of each second audio through a pre-trained feature extraction model, wherein the feature extraction model is obtained by training based on an AudioSet data set.

According to one or more embodiments of the present disclosure, there is provided an audio classification method including:

inputting each audio clip to be processed into a target audio classification model respectively to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method of any embodiment of the disclosure, and the output result is used for indicating the probability that the audio clip to be processed input into the target audio classification model corresponds to each language in the language to which the second audio belongs;

According to one or more embodiments of the present disclosure, there is provided an audio classification method, where, for each to-be-processed audio segment, determining a language to which the to-be-processed audio belongs according to a probability that the to-be-processed audio segment corresponds to each language in a language to which the second audio belongs includes:

According to one or more embodiments of the present disclosure, an audio classification method is provided, where the determining, according to a language to which each of the to-be-processed audio clips belongs, a language to which the to-be-processed audio belongs includes:

and determining the language with the maximum number as the language to which the audio to be processed belongs.

According to one or more embodiments of the present disclosure, there is provided an audio classification model training apparatus, including:

According to one or more embodiments of the present disclosure, an audio classification model training apparatus is provided, where the setting module is configured to set categories included in a full connection layer in the initial audio classification model, so that the number of the categories included in the full connection layer is the same as the total number of the languages to which the second audio belongs, and the categories included in the full connection layer correspond to the languages to which the second audio belongs one to one.

According to one or more embodiments of the present disclosure, an audio classification model training apparatus is provided, where the second obtaining module is configured to extract a language feature of each second audio through a pre-trained feature extraction model, where the feature extraction model is obtained by training based on an AudioSet data set.

According to one or more embodiments of the present disclosure, there is provided an audio classification apparatus including:

the classification module is configured to input each to-be-processed audio clip to a target audio classification model respectively to obtain an output result of the target audio classification model, where the target audio classification model is obtained by training according to the audio classification model training method according to any embodiment of the present disclosure, and the output result is used to indicate probabilities that the to-be-processed audio clip input to the target audio classification model corresponds to each language of the languages to which the second audio belongs;

According to one or more embodiments of the present disclosure, there is provided an audio classification apparatus, wherein the determining module includes:

According to one or more embodiments of the present disclosure, there is provided an audio classification apparatus, wherein the second determination submodule includes:

According to one or more embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processing apparatus, implements the steps of the audio classification model training method provided by any of the embodiments of the present disclosure, or which, when executed by a processing apparatus, implements the steps of the audio classification method provided by any of the embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, there is provided an electronic device including:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the audio classification model training method provided in any embodiment of the present disclosure, or to implement the steps of the audio classification method provided in any embodiment of the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for training an audio classification model, the method comprising:

2. The method according to claim 1, wherein said setting a fully-connected layer in the initial audio classification model according to the total number of languages to which the second audio belongs to obtain an intermediate audio classification model comprises:

3. The method of claim 1, wherein said determining linguistic characteristics of each of said second audios comprises:

4. A method of audio classification, the method comprising:

inputting each audio clip to be processed into a target audio classification model respectively to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method of any one of claims 1 to 3, and the output result is used for indicating the probability that the audio clip to be processed input into the target audio classification model corresponds to each language of the languages to which the second audio belongs;

5. The method according to claim 4, wherein said determining, for each of the audio clips to be processed, the language to which the audio clip to be processed belongs according to the probability that the audio clip to be processed corresponds to each of the languages to which the second audio belongs comprises:

6. The method according to claim 5, wherein said determining the language of said audio to be processed according to the language of each of said audio to be processed segments comprises:

7. An apparatus for training an audio classification model, the apparatus comprising:

8. An apparatus for audio classification, the apparatus comprising:

a classification module, configured to input each of the audio segments to be processed into a target audio classification model, respectively, so as to obtain an output result of the target audio classification model, where the target audio classification model is obtained by training according to the audio classification model training method according to any one of claims 1 to 3, and the output result is used to indicate a probability that the audio segment to be processed input into the target audio classification model corresponds to each language of the languages to which the second audio belongs;

9. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the program, when executed by a processing device, implements the steps of the method of any one of claims 1-3, or wherein the program, when executed by a processing device, implements the steps of the method of any one of claims 4-6.

10. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 3 or to carry out the steps of the method according to any one of claims 4 to 6.