CN108664999B

CN108664999B - Training method and device of classification model and computer server

Info

Publication number: CN108664999B
Application number: CN201810412797.4A
Authority: CN
Inventors: 王乃岩; 樊峻崧
Original assignee: Beijing Tusimple Technology Co Ltd
Current assignee: Beijing Tusimple Technology Co Ltd
Priority date: 2018-05-03
Filing date: 2018-05-03
Publication date: 2021-02-12
Anticipated expiration: 2038-05-03
Also published as: CN108664999A

Abstract

The invention discloses a training method and a device of a classification model and a computer server, and aims to solve the technical problems of low calculation efficiency and narrow application range of the prior art for training the classification model by a semi-supervised learning technology. The method comprises the following steps: constructing an initial classification model, wherein the initial classification model comprises at least one single-mode classification model with the same classification task, and a modal data training set corresponding to each single-mode classification model comprises tag training data and label-free training data; training the initial classification model to obtain a target classification model based on a method for aligning feature code distribution of labeled training data and unlabeled training data in a modal data training set of each single-modal classification model. The scheme can improve the efficiency of the classification model training and has wider application range.

Description

Training method and device of classification model and computer server

Technical Field

The invention relates to the field of deep learning, in particular to a training method of a classification model, a training device of the classification model and a computer server.

Background

At present, a large amount of labeled sample data is usually needed for training a neural network, firstly, a large amount of sample data needs to be collected, then, the collected sample data is labeled manually to obtain labeled sample data for training the neural network, and the collection and the labeling need higher labor cost and time cost.

For solving the technical problem, the neural network is trained by adopting a training data set comprising label training data and label-free training data at present, a large amount of label training data is not needed, so that the dependence on a large amount of label data can be relieved, and the problems of higher cost of labeled sample data and higher time cost in the prior art are solved.

At present, the existing semi-supervised learning technology in deep learning mainly introduces random noise or various random transformations in the processes of input and feature construction, and simultaneously restricts the output of a neural network to have robustness and invariance so as to achieve the aim of performing auxiliary training by using unlabelled training data, for example, Takeru Miyato and the like uses antagonistic samples, Mehdi Sajjadi and the like use random transformations, and Samuli Lane and the like use random noise to introduce disturbance.

However, the existing semi-supervised learning technology has the following technical defects: in order to obtain supervision information from the label-free training data, the same group of training samples need to be subjected to forward calculation for multiple times, so that the efficiency is low; meanwhile, only single-mode data can be used for learning and training, and the application range is narrow.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for training a classification model, and a computer server, so as to solve the technical problems of low computational efficiency and narrow application range of the prior art in training a classification model by a semi-supervised learning technique.

The embodiment of the invention provides a training method of a classification model in a first aspect, which comprises the following steps:

constructing an initial classification model, wherein the initial classification model comprises at least one single-mode classification model with the same classification task, and a modal data training set corresponding to each single-mode classification model comprises tag training data and label-free training data;

and training the initial classification model by adopting the modal training data set corresponding to each single-modal classification model to obtain a target classification model based on a method for aligning the feature code distribution of the labeled training data and the unlabeled training data in the modal data training set of each single-modal classification model.

In an embodiment of the present invention, in a second aspect, a training apparatus for classification models includes:

the model building unit is used for building an initial classification model, the initial classification model comprises at least one single-mode classification model with the same classification task, and a mode data training set corresponding to each single-mode classification model comprises tag training data and label-free training data;

and the training unit is used for training the initial classification model by adopting the modal training data set corresponding to each single-modal classification model to obtain the target classification model based on a method for aligning the characteristic coding distribution of the labeled training data and the unlabeled training data in the modal data training set of each single-modal classification model.

In a third aspect, an embodiment of the present invention provides a computer server, including a memory, and one or more processors communicatively connected to the memory;

the memory has stored therein instructions executable by the one or more processors to cause the one or more processors to implement the aforementioned training method of the classification model.

According to the technical scheme, based on the method for aligning the characteristic coding distribution of the labeled training data and the unlabeled training data in the modal data training set of each single-modal classification model, the initial classification model is trained by adopting the modal training data set corresponding to each single-modal classification model to obtain the target classification model. Namely, by adopting the technical scheme of the invention, the countercheck constraint training is carried out on the feature encoder by combining the labeled training data and the unlabeled training data, so that the encoder can learn the feature expression with good consistency of the labeled training data and a large amount of unlabeled training data, and the requirement of carrying out forward calculation on the same group of training samples for multiple times in the prior art is avoided, thereby improving the training efficiency of the classification model, and in addition, the training and learning can be carried out aiming at the multi-modal data, and the application range is wider.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flowchart of a method for training a classification model according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an initial classification model according to an embodiment of the present invention;

FIG. 3 is a second exemplary diagram of the initial classification model according to the present invention;

FIG. 4 is a flowchart of training based on the initial classification model shown in FIG. 2/FIG. 3 according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a target classification model trained based on the initial classification model shown in FIG. 3 according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating that the same object respectively corresponds to multiple modal data representations according to an embodiment of the present invention;

FIG. 7 is a third exemplary diagram of an initial classification model according to an embodiment of the present invention;

FIG. 8 is a flowchart of training based on the initial classification model shown in FIG. 7 according to an embodiment of the present invention;

FIG. 9 is a fourth exemplary diagram illustrating an initial classification model according to the present invention;

FIG. 10 is a diagram illustrating a target classification model trained based on the initial classification model shown in FIG. 9 according to an embodiment of the present invention;

FIG. 11 is a fifth exemplary diagram illustrating the structure of the initial classification model according to the embodiment of the present invention;

FIG. 12 is a sixth exemplary diagram illustrating the structure of the initial classification model according to the embodiment of the present invention;

FIG. 13 is a seventh exemplary diagram illustrating an initial classification model according to an embodiment of the present invention;

FIG. 14 is an eighth schematic structural diagram of an initial classification model according to an embodiment of the present invention;

FIG. 15 is a schematic structural diagram of an apparatus for training a classification model according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of a computer server according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, a flowchart of a training method of a classification model in an embodiment of the present invention is shown, where the method includes:

101, constructing an initial classification model, wherein the initial classification model comprises at least one single-mode classification model with the same classification task, and a mode data training set corresponding to each single-mode classification model comprises tag training data and label-free training data;

102, training the initial classification model by using the modal training data set corresponding to each single-modal classification model based on a method for aligning the feature code distribution of the labeled training data and the unlabeled training data in the modal data training set of each single-modal classification model to obtain a target classification model.

In the embodiment of the invention, each single-mode classification model in the initial classification model classifies the modal data of the corresponding type, the types of the modal data corresponding to different single-mode classification models are different, but the classification tasks corresponding to a plurality of single-mode classification models are the same. For example, the multi-modal classification model includes three single-modal classification models, which are respectively represented by a model a, a model B and a model C, wherein the model a is used for classifying image data, the model B is used for classifying character data, and the model C is used for classifying video data, but the classification tasks of the model a, the model B and the model C are the same, for example, the classification tasks include pedestrians, vehicles, traffic lights and the like, that is, each model identifies pedestrians, vehicles, traffic lights and the like from the corresponding type of modal data.

Based on the method flow shown in fig. 1, in the embodiment of the present invention, there may be a plurality of structures of the initial classification model, and a plurality of examples are described below in detail for training the initial classification models with different structures to obtain the target classification model, and those skilled in the art may extend other alternatives based on the examples provided in the embodiment of the present invention, but any alternative is within the scope to be protected by the present application as long as it is based on a method of aligning feature code distributions of a plurality of single-mode classification models.

Example 1

In example 1, the structure of the initial classification model may include only one single-mode classification model as shown in fig. 2, or may include more than two single-mode classification models as shown in fig. 3, regardless of the structures shown in fig. 2 or fig. 3, each single-mode classification model includes a feature encoder, and a classifier and a discriminator respectively cascaded to the feature encoder, the discriminator is configured to determine whether a feature code output by the feature encoder is derived from tagged training data or untagged training data, an output end of the discriminator is provided with a first loss function for training the discriminator and a second loss function for training the feature encoder, and the first loss function and the second loss function are oppositively set.

In this example 1, the step 102 trains the initial classification model by using the modality training data set corresponding to each single-modality classification model to obtain the target classification model, which may be specifically implemented by, but is not limited to, the following manner, which includes steps 102a to 102b, as shown in fig. 4:

102a, performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model;

and 102b, deleting the discriminators in the single-mode classification models in the classification models obtained through training to obtain the target classification model shown in the figure 5.

In example 1, step 102a may be implemented by, but is not limited to, the following:

performing the following iterative training on the initial classification model for multiple times, wherein one iterative training specifically comprises the following steps a 1-a 2, wherein:

step A1, aiming at each single-mode classification model, acquiring training data from a mode data training set of the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to a loss function value of the classifier of the single-mode classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

and step A2, carrying out next iterative training based on the initial classification model after parameter adjustment.

Preferably, in the step a1, parameters of the discriminator and the feature encoder of the single-mode classification model are adjusted based on values of the first loss function and values of the second loss function of the single-mode classification model, and the adjustment may be implemented by, but not limited to, any one of the following modes (mode B1-mode B2):

a mode B1, adjusting parameters of the discriminator according to a value of a first loss function after the discriminator of the monomodal classification model discriminates the feature code output by the feature encoder; and adjusting the parameters of the feature encoder based on the value of the second loss function after the discriminator performs re-discrimination on the feature code output by the feature encoder after the parameters are adjusted.

And in the mode B2, adjusting parameters of the discriminator according to values of a first loss function after the discriminator of the single-mode classification model discriminates the feature code output by the feature encoder, and adjusting parameters of the feature encoder of the single-mode classification model according to values of a second loss function.

In practical applications, the same object may be represented by different modality data, such as images, videos, voices, texts, and the like. As shown in fig. 6, the same room can be expressed by three modality data, i.e., image, hand drawing, and text, respectively. In the embodiment of the invention, when the classification model comprises more than two single-mode classification models, in order to improve the performance of each single-mode classification model, cross-mode training can be performed on the single-mode classification models based on a method for aligning the feature code distribution of a plurality of single-mode classification models in the training process, namely confrontation constraint training can be performed on the feature code distribution of the plurality of single-mode classification models. Different single-mode classification models correspond to different-mode modal data training sets, and by aligning the feature coding distribution of different modal data of a plurality of single-mode classification models, the training data of different modes can be implicitly and commonly utilized among the plurality of single-mode classification models, and the feature information of the training data of different modes is shared, so that the plurality of single-mode classification models can be cooperatively trained, and the performance of each single-mode classification model is mutually improved by utilizing multi-mode data. The training method can improve the accuracy of classification of each single-mode classification model, does not need each training sample to have multi-mode data expression at the same time (namely the technical scheme of the invention does not need multi-mode data alignment of the training samples), and has the advantages of easy acquisition of the training samples and wider application range. For the cross-modal training scheme, the structure of the initial classification model may be set as shown in fig. 7, fig. 9, fig. 11, and fig. 12, and the initial classification models shown in fig. 7, fig. 9, fig. 11, and fig. 12 are described in detail by example 2, example 3, example 4, and example 5, respectively.

Example 2

The structure for setting the initial classification model may be as shown in fig. 7, where each single-mode classification model includes a feature encoder, and a classifier and a discriminator respectively cascaded with the feature encoder, where the discriminator is configured to determine that a feature code output by the feature encoder is derived from tagged training data or untagged training data, an output end of the discriminator is provided with a first loss function for training the discriminator and a second loss function for training the feature encoder, and the first loss function and the second loss function are opposedly set; and the feature encoders of the plurality of single-mode classification models are also respectively connected to the same cross-mode discriminator, the cross-mode discriminator is used for discriminating the mode types corresponding to the feature encodings output by the feature encoders of the single-mode classification models, a third loss function for training the cross-mode discriminator and a fourth loss function for training the feature encoders in the single-mode classification models are arranged at the output end of the cross-mode discriminator, and the third loss function and the fourth loss function are arranged in a confronting manner.

In the aforementioned flow shown in fig. 1, the initial classification model is trained in step 102 by using the modality training data set corresponding to each single-modality classification model to obtain the target classification model, which may be specifically implemented by, but is not limited to, the following manner, which includes steps 102c to 102d, as shown in fig. 8:

102c, performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model;

and 102d, deleting the discriminators and the cross-modal discriminators in the single-modal classification models in the classification models obtained by training.

Preferably, the step 102c can be implemented by, but not limited to, the following ways:

performing the following iterative training on the initial classification model for multiple times, wherein one iterative training comprises the steps of C1-C3, wherein:

step C1, aiming at each single-mode classification model, acquiring training data from a mode data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to the value of a loss function of the classifier of the single-mode classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

step C2, adjusting parameters of the cross-modal discriminator and the feature encoder of each single-modal classification model based on the value of the third loss function and the value of the fourth loss function;

and step C3, performing next iterative training based on the initial classification model after parameter adjustment.

Preferably, in the step C1, parameters of a discriminator and a feature encoder of the single-mode classification model are adjusted based on a value of a first loss function and a value of a second loss function of the single-mode classification model, which may be specifically referred to as a mode B1 or a mode B2 in example 1, and are not described herein again.

Preferably, in the step C2, parameters of the cross-modal discriminator and the feature encoder of each single-modal classification model are adjusted based on a value of the third loss function and a value of the fourth loss function, which may be specifically implemented by, but not limited to, any one of the following modes (mode D1 to mode D2):

the mode D1 is that parameters of the cross-modal classifier are adjusted according to the value of a third loss function after the cross-modal classifier discriminates the feature codes output by the feature encoders of the single-modal classification models; and adjusting the parameters of the feature encoders of the single-mode classification models based on the value of a fourth loss function obtained after the cross-mode discriminator subjected to parameter adjustment performs re-discrimination on the feature codes output by the feature encoders of the single-mode classification models.

And in the mode D2, adjusting parameters of the cross-modal classifier according to values of a third loss function after the cross-modal classifier discriminates the feature codes output by the feature encoders of the single-modal classification models, and adjusting parameters of the feature encoders of the single-modal classification models according to values of a fourth loss function.

Example 3

The structure for setting the initial classification model may be as shown in fig. 9, where each single-mode classification model includes a feature encoder, and a classifier and a fifth loss function that are respectively cascaded with the feature encoder, and a value of the fifth loss function indicates consistency of distribution of feature codes of labeled training data and unlabeled training data in a modal data training set corresponding to the single-mode classification model; the characteristic encoders of the plurality of single-mode classification models are connected to the same cross-mode discriminator, the cross-mode discriminator is used for discriminating the mode types corresponding to the characteristic encoders output by the characteristic encoders of the single-mode classification models, a third loss function used for training the cross-mode discriminator and a fourth loss function used for training the characteristic encoders in the single-mode classification models are arranged at the output end of the cross-mode discriminator, and the third loss function and the fourth loss function are arranged in a confronting manner.

In this example 3, the initial classification model is trained in the step 102 by using the modality training data set corresponding to each single-modality classification model to obtain the target classification model, which may be specifically implemented by, but is not limited to, the following manners, including steps 102e to 102 f:

102e, performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model;

102f, deleting the cross-mode discriminators in the classification model obtained by training to obtain a target classification model shown in FIG. 10; or deleting the cross-modal discriminator in the trained classification model and the fifth loss function in each single-modal classification model to obtain the target classification model shown in fig. 5.

Preferably, the step 102e can be implemented by, but not limited to, the following ways:

performing the following iterative training on the initial classification model for multiple times, wherein one iterative training comprises the steps of E1-E3:

step E1, for each single-mode classification model, acquiring training data from a mode data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model based on values of a loss function of the classifier according to the single-mode classification model; adjusting parameters of a feature encoder of the single-mode classification model according to the value of a fifth loss function of the single-mode classification model;

step E2, adjusting parameters of the cross-modal discriminator and the feature encoders of the single-modal classification models based on the value of the third loss function and the value of the fourth loss function;

and E3, carrying out next iterative training based on the initial classification model after parameter adjustment.

The specific implementation of the step E2 can be referred to as the mode D1 or the mode D2 in example 2, and details are not repeated here.

Example 4

In example 4, the initial classification model may have a structure as shown in fig. 11, where each single-mode classification model includes a feature encoder, and a classifier and a fifth loss function respectively cascaded with the feature encoder, and a value of the fifth loss function indicates consistency of feature encoding distributions of labeled training data and unlabeled training data in a modal data training set corresponding to the single-mode classification model; the feature encoders of the plurality of single-mode classification models are all connected to the same sixth loss function, and the value of the sixth loss function represents the consistency of the feature encoding distribution output by the feature encoders of the single-mode classification models.

In this example 4, the initial classification model is trained in the step 102 by using the modality training data set corresponding to each single-modality classification model to obtain the target classification model, which may be specifically implemented by, but not limited to, a manner including steps 102g to 102h, where the manner includes steps 102g to 102h

102g, performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model;

and 102h, deleting a sixth loss function in the multi-modal classification model obtained through training and a fifth loss function in each single-modal classification model to obtain a target classification model.

Preferably, the step 102g can be realized by, but not limited to, the following ways: performing the following iterative training on the initial classification model for multiple times, wherein one iterative training comprises steps F1-F3, wherein:

step F1, for each single-mode classification model, acquiring training data from a mode data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to a loss function value of the classifier of the single-mode classification model; adjusting parameters of a feature encoder of the single-mode classification model according to the value of a fifth loss function of the single-mode classification model;

step F2, adjusting parameters of the feature encoders of the single-mode classification models according to the values of the sixth loss function;

and F3, performing next iterative training based on the initial classification model after parameter adjustment.

Example 5

In example 5, the initial classification model may have a structure as shown in fig. 12, each single-mode classification model includes a feature encoder, and a classifier and a discriminator respectively cascaded to the feature encoder, the discriminator is configured to discriminate whether a feature code output by the feature encoder cascaded to the discriminator is derived from tagged training data or untagged training data, and an output end of the discriminator is provided with a first loss function for training the discriminator and a second loss function for training the feature encoder, where the first loss function and the second loss function are countermeasure settings; the feature encoders of the plurality of single-mode classification models are all connected to the same sixth loss function, and the value of the sixth loss function represents the consistency of the feature encoding distribution output by the feature encoders of the single-mode classification models.

In this example 5, the initial classification model is trained in step 102 by using the modality training data set corresponding to each single-modality classification model to obtain the target classification model, which may be specifically implemented by, but is not limited to, the following manner, where the manner includes steps 102i to 102 j:

102i, performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model;

102j, deleting discriminators in all single-mode classification models in the classification models obtained through training to obtain target classification models; or deleting the sixth loss function in the trained classification model and the discriminators in the single-mode classification models to obtain the target classification model.

Preferably, the step 102i can be implemented by, but not limited to, the following ways:

performing the following iterative training on the initial classification model for multiple times, wherein one iterative training comprises the steps G1-G3:

g1, aiming at each single-mode classification model, acquiring training data from a mode data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to the value of a loss function of the classifier of the single-mode classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

g2, adjusting parameters of the feature encoders of the single-mode classification models according to the value of the sixth loss function;

and G3, carrying out next iterative training based on the initial classification model after parameter adjustment.

In the embodiment of the present invention, in the step G1, parameters of a discriminator and a feature encoder of the single-mode classification model are adjusted based on a value of the first loss function and a value of the second loss function of the single-mode classification model, and specific implementation may refer to the mode B1 or the mode B2 in example 1, which is not described herein again.

Example 6

In example 6, the initial classification model may have a structure as shown in fig. 13 or fig. 14, and regardless of the initial classification model shown in fig. 13 or fig. 14, each of the single-mode classification models includes a feature encoder and a fifth loss function in cascade, and a value of the fifth loss function represents consistency of feature coding distributions of the labeled training data and the unlabeled training data in the modal data training set corresponding to the single-mode classification model.

In this example 6, in the step 102, the initial classification model is trained by using the modality training data set corresponding to each single-modality classification model to obtain the target classification model, which may be specifically implemented by, but is not limited to, the following manner, where the manner includes steps 102k to 102 l:

102k, performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model;

and 102l, deleting the fifth loss function in each single-mode classification model in the classification model obtained through training to obtain a target classification model.

In example 6, step 102k may be implemented by, but is not limited to, the following:

performing the following iterative training on the initial classification model for multiple times, wherein one iterative training specifically includes the following steps H1-H2, wherein:

step H1, aiming at each single-mode classification model, acquiring training data from a mode data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model based on the value of a loss function of the classifier according to the single-mode classification model; adjusting parameters of a feature encoder of the single-mode classification model according to the value of a fifth loss function of the single-mode classification model;

and step H2, performing next iterative training based on the initial classification model after parameter adjustment.

In the embodiment of the present invention, the initial classification models constructed in the foregoing examples 1 and 6 have a simpler structure and a faster training speed, but the complementarity between the multi-modal data cannot be fully utilized between the single-modal classification models to perform collaborative learning. Although the structures of the constructed initial classification models are relatively complex, different single-mode classification models correspond to different-mode modal data training sets, and by aligning the feature coding distribution of different-mode data of a plurality of single-mode classification models, the plurality of single-mode classification models can implicitly utilize the training data of different modes together and share the feature information of the training data of different modes, so that the plurality of single-mode classification models can be cooperatively trained, and the performance of each single-mode classification model is mutually improved by utilizing multi-mode data. The different examples can be combined with the labeled training data and the unlabeled training data to carry out confrontation constraint training on the feature encoder, so that the encoder can learn the feature expression with good consistency of the labeled training data and a large amount of unlabeled training data, the requirement of carrying out forward calculation on the same group of training samples for multiple times in the prior art is avoided, the training efficiency of the classification model can be improved, in addition, the training and learning can be carried out aiming at the multi-mode data, and the application range is wider; on the basis of the above, the person skilled in the art can select any one of the initial classification models in the foregoing examples according to actual needs.

In the first embodiment of the present invention, in the foregoing examples, the loss functions of the classifiers in the single-mode classification models may be set to be the same. Classifying features in individual single-mode classesThe encoder and the classifier are respectively denoted as f_eAnd f_cThe learning parameters of the feature encoder and the classifier are respectively theta_eAnd theta_c. In the embodiment of the invention, the encoder and the classifier in each single-mode classification model can adopt a cross entropy loss function to carry out parameter theta on the labeled training data according to the truth label_eAnd theta_cOptimized by L_c(X；θ_e,θ_c) Representing the loss function of a classifier in a unimodal classification model, the loss function may be set as shown in equation (1):

in the formula (1), N_lRepresenting the total number of labeled training data in the modal data training set corresponding to the single modal classification model, C being the number of classes of the classification task,

representing a training sample x_iClass label of (2), if x_iBelong to the k-th class

A value of 1, if x_iNot in class k

The value is 0.

Preferably, in example 1, example 2 and example 5, the feature encoder and the classifier in each of the single-modality classification modalities are respectively denoted as f_eAnd f_cThe learning parameters of the feature encoder, the classifier and the discriminator are respectively theta_e、θ_cAnd phi, by L_d(X; φ) represents a first loss function, which may be set as shown in equation (2):

in the formula (2), N_lTotal number of labeled data in training set for modal data corresponding to single modal classification model, N_uTotal number of unlabeled data in training set for modal data corresponding to single modal classification model, z_iIs a scalar if x_iZ is labeled data_iA value of 1, if x_iZ is the label-free data_iThe value is 0.

The second loss function may be L_e(X；θ_e) It is shown that the second loss function is arranged against the first loss function, and therefore the second loss function can be arranged as shown in equation (3):

in the formula (3), N_lTotal number of labeled data in training set for modal data corresponding to single modal classification model, N_uTotal number of unlabeled data in training set for modal data corresponding to single modal classification model, z_iIs a scalar if x_iZ is labeled data_iA value of 1, if x_iZ is the label-free data_iThe value is 0.

Preferably, in examples 2 and 3, the feature encoder and the classifier in each of the single-modality classification modalities are respectively denoted as f_eAnd f_cThe learning parameters of the feature encoder, the classifier and the cross-mode discriminator are respectively theta_e、θ_cAnd phi' with L_d′(X; φ') represents a third loss function, which may be set as shown in equation (4):

in formula (4), N is the total number of all training samples included in the modal data training set of all single-modal classification models, J is the total number of single-modal classification models,

representing the total number of the labeled training data contained in the training set of modal data corresponding to the jth single-modal classification model,

represents the total number of unlabeled training data contained in the training set of modal data corresponding to the jth single-modal classification model,

and

respectively representing the feature encoder of the jth single-mode classification model and the learning parameters of the feature encoder.

The fourth loss function in examples 2 and 3 is set against the third loss function, which may be L_m(X), then the fourth loss function can be set as shown in equation (5):

in the formula (5), d'_(k)A k-th element representing a cross mode discriminator output vector; j is the total number of single-mode classification models,

and

respectively representing the learning parameters of a characteristic encoder and a characteristic encoder of the jth single-mode classification model, and phi' is the learning parameter of a cross-mode discriminator.

Preferably, in examples 3, 4 and 6, the feature encoders and classifiers in the respective single-modality classification modalities are respectively denoted as f_eAnd f_cThe learning parameters of the feature encoder and the classifier are respectively theta_e、θ_cThe fifth loss function in each of the unimodal classification models may be L_mmd(X；θ_e) To show, the fifth loss function may be set as shown in equation (6):

in equation (6), k (·,. cndot.) is a kernel function, x represents labeled training data, y represents unlabeled training data, N_lTotal number of labeled data in training set for modal data corresponding to single modal classification model, N_uAnd the total number of the unlabeled data in the modal data training set corresponding to the single modal classification model.

Preferably, in examples 4 and 5, the expression of the sixth loss function may be L_mmd' (X), the sixth loss function may be set as shown in equation (7):

in the formula (7), a plurality of modal data training sets corresponding to a plurality of single-modal classification models form a group between every two modal data training sets, and N is used for each group_aAnd N_bRespectively representing the number of training samples contained in the two modal training data sets in the group, x and y representing the training samples respectively belonging to the two different modal training data sets, and k (·,) is a kernel function.

The foregoing formula (1) to formula (7) are only an example, and those skilled in the art may also use other formulas to implement the same function, and the present application is not limited thereto.

Example two

Based on the same concept of the training method of the classification model provided in the first embodiment, a second embodiment of the present invention provides a training apparatus of a classification model, the structure of the apparatus may be as shown in fig. 15, and the apparatus includes a model building unit 1 and a training unit 2, where:

the model building unit 1 is used for building an initial classification model, wherein the initial classification model comprises at least one single-mode classification model with the same classification task, and a mode data training set corresponding to each single-mode classification model comprises tag training data and label-free training data;

and the training unit 2 is used for training the initial classification model by adopting the modal training data set corresponding to each single-modal classification model based on a method for aligning the feature code distribution of the labeled training data and the unlabeled training data in the modal data training set of each single-modal classification model to obtain the target classification model.

Based on the training apparatus for classification models shown in fig. 15, there may be a plurality of structures of initial classification models in the embodiment of the present invention, and a plurality of examples are described below in detail for training initial classification models with different structures respectively to obtain a target classification model, and those skilled in the art may extend other alternatives based on the examples provided in the embodiment of the present invention, but the alternatives are all within the scope to be protected by the present application as long as the alternatives are based on a method of aligning feature coding distributions of a plurality of single-mode classification models.

Example 1A

Example 1A corresponds to example 1 in the first embodiment, and the structure of the initial modality classification model may be as shown in fig. 2 or fig. 3, for details, refer to example 1 in the first embodiment, and will not be described herein again.

In this example 1A, the training unit 2 shown in fig. 15 specifically includes:

the training subunit is used for performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model and triggering the deletion subunit when the training is finished;

and the deleting subunit is used for deleting the discriminators in the single-mode classification models in the classification models obtained by training of the training subunit.

In this example 1A, the training subunit is specifically configured to:

performing the following iterative training on the initial classification model for multiple times:

acquiring training data from a modal data training set of each single-modal classification model aiming at each single-modal classification model, inputting the training data into a feature encoder of the single-modal classification model, and adjusting parameters of the feature encoder and a classifier in the single-modal classification model according to a loss function value of a classifier of the single-modal classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

and performing the next iterative training based on the initial classification model after the parameters are adjusted.

The training subunit adjusts parameters of a discriminator and a feature encoder of the single-mode classification model based on a value of the first loss function and a value of the second loss function of the single-mode classification model, and specific implementation may refer to the mode B1 or the mode B2 in example 1, which is not described herein again.

Example 2A

Example 2A corresponds to example 2 in the first embodiment, and the structure of the initial modality classification model may be as shown in fig. 7, for details, refer to example 2 in the first embodiment, and details are not repeated here.

In this example 2A, the training unit 2 shown in fig. 15 specifically includes:

and the deleting subunit is used for deleting the discriminators and the cross-modal discriminators in the single-modal classification models in the classification models obtained by training the training subunit.

In example 2A, the training subunit is specifically configured to:

for each single-mode classification model, acquiring training data from a modal data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to the value of a loss function of the classifier of the single-mode classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

adjusting parameters of the cross-modal discriminator and the feature encoders of the single-modal classification models based on the value of the third loss function and the value of the fourth loss function;

In example 2A, the training subunit adjusts parameters of the cross-modal discriminator and the feature encoders of the single-modal classification models based on a value of the third loss function and a value of the fourth loss function, and specific implementation may refer to a mode D1 or a mode D2 in example 2, which is not described herein again.

In example 2A, the training subunit adjusts parameters of a discriminator and a feature encoder of the single-mode classification model based on a value of a first loss function and a value of a second loss function of the single-mode classification model, and specific implementation may refer to a mode B1 or a mode B2 in example 2, which is not described herein again.

Example 3A

Example 3A corresponds to example 3 in the first embodiment, and the structure of the initial modality classification model may be as shown in fig. 9, for details, refer to example 3 in the first embodiment, and details are not repeated here.

In example 3A, the training unit shown in fig. 15 may specifically include:

the deleting subunit is used for deleting the cross-mode discriminator in the classification model obtained by the training of the training subunit to obtain a target classification model; or deleting the cross-modal discriminator in the classification model obtained by training the training subunit and the fifth loss function in each single-modal classification model to obtain the target classification model.

In example 3A, the training subunit is specifically configured to:

for each single-mode classification model, acquiring training data from a mode data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model based on values of a loss function of a classifier according to the single-mode classification model; adjusting parameters of a feature encoder of the single-mode classification model according to the value of a fifth loss function of the single-mode classification model;

In example 3A, the training subunit adjusts parameters of the cross-modal discriminator and the feature encoder of each single-modal classification model based on a value of the third loss function and a value of the fourth loss function, and specific implementation may be referred to as a mode D1 or a mode D2 in example 2, which is not described herein again

Example 4A

Example 4A corresponds to example 4 in the first embodiment, and the structure of the initial modality classification model may be as shown in fig. 11, for details, refer to example 4 in the first embodiment, and details are not repeated here.

In example 4A, the training unit shown in fig. 15 may specifically include:

and the deleting subunit is used for deleting the sixth loss function in the multi-modal classification model obtained by training of the training subunit and the fifth loss function in each single-modal classification model to obtain the target classification model.

In example 4A, the training subunit is specifically configured to:

for each single-mode classification model, acquiring training data from a modal data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to the value of a loss function of the classifier of the single-mode classification model; adjusting parameters of a feature encoder of the single-mode classification model according to the value of a fifth loss function of the single-mode classification model;

adjusting parameters of a feature encoder of each single-mode classification model according to the value of the sixth loss function;

Example 5A

Example 5A corresponds to example 5 in the first embodiment, and the structure of the initial modality classification model may be as shown in fig. 12, for details, refer to example 5 in the first embodiment, and details are not repeated here.

In example 5A, the training unit shown in fig. 15 may specifically include:

the deleting subunit is used for deleting the discriminators in the single-mode classification models in the classification models obtained by the training of the training subunit to obtain target classification models; or deleting the sixth loss function in the classification model obtained by training the training subunit and the discriminators in the single-mode classification models to obtain the target classification model.

In example 5A, the training subunit is specifically configured to:

for each single-mode classification model, acquiring training data from a modal data training set corresponding to the single-mode classification model, inputting the training data into a feature encoder of the single-mode classification model, and adjusting parameters of the feature encoder and a classifier in the single-mode classification model according to the value of a loss function of a classifier of the single-mode classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

In example 5A, the training subunit adjusts parameters of a discriminator and a feature encoder of the single-mode classification model based on a value of the first loss function and a value of the second loss function of the single-mode classification model, and specific implementation may refer to the mode B1 or the mode B2 in example 2, which is not described herein again.

Example 6A

Example 6A corresponds to example 6 in the first embodiment, and the structure of the initial modality classification model may be as shown in fig. 13 or fig. 14, for details, refer to example 6 in the first embodiment, and will not be described herein again.

In example 6A, the training unit shown in fig. 15 may specifically include:

and the deleting subunit is used for deleting the fifth loss function in each single-mode classification model in the classification model obtained by training of the training subunit to obtain the target classification model.

In example 6A, the training subunit is specifically configured to:

In the embodiment of the present invention, the initial classification models constructed in the foregoing examples 1A and 6A have a simpler structure and a faster training speed, but the complementarity between the multi-modal data cannot be fully utilized between the single-modal classification models to perform collaborative learning. Although the structures of the initial classification models constructed in examples 2A to 5A are relatively complex, different single-mode classification models correspond to different-mode modal data training sets, and by aligning feature coding distributions of different-mode data of a plurality of single-mode classification models, the plurality of single-mode classification models can implicitly and commonly use the training data of different modes and share feature information of the training data of different modes, so that the plurality of single-mode classification models can be cooperatively trained, and the performance of each single-mode classification model is mutually improved by using multi-mode data. The different examples can be combined with the labeled training data and the unlabeled training data to carry out confrontation constraint training on the feature encoder, so that the encoder can learn the feature expression with good consistency of the labeled training data and a large amount of unlabeled training data, the requirement of carrying out forward calculation on the same group of training samples for multiple times in the prior art is avoided, the training efficiency of the classification model can be improved, in addition, the training and learning can be carried out aiming at the multi-mode data, and the application range is wider; on the basis of the above, the person skilled in the art can select any one of the initial classification models in the foregoing examples according to actual needs.

EXAMPLE III

A third embodiment of the present invention further provides a computer server, as shown in fig. 16, where the computer server includes a memory and one or more processors communicatively connected to the memory;

the memory stores instructions executable by the one or more processors to cause the one or more processors to implement a method for training a multi-modal classification model according to any one of the preceding embodiments.

In the third embodiment of the present invention, the computer server may be a hardware device such as a PC, a notebook, a tablet computer, an FPGA (Field-Programmable Gate Array), an industrial computer, or a smart phone.

While the principles of the invention have been described in connection with specific embodiments thereof, it should be noted that it will be understood by those skilled in the art that all or any of the steps or elements of the method and apparatus of the invention may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof, which may be implemented by those skilled in the art using their basic programming skills after reading the description of the invention.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the above embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the above-described embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An object classification method using a classification model, comprising:

constructing an initial classification model, wherein the initial classification model comprises at least two single-mode classification models with the same object classification task, different single-mode classification models correspond to different modal data types, the modal data comprise at least one of images, videos, voices and characters and are respectively used for representing different characteristics of the same object, and a modal training data set corresponding to each single-mode classification model comprises tag training data and label-free training data;

based on a method for aligning the feature code distribution of labeled training data and unlabeled training data in the modal training data set of each single-modal classification model, carrying out iterative training on the initial classification model by adopting the modal training data set corresponding to each single-modal classification model to obtain a target classification model, and carrying out object classification according to the target classification model;

each single-mode classification model comprises a feature encoder, a classifier and a discriminator, wherein the classifier and the discriminator are respectively cascaded with the feature encoder, and the discriminator is used for judging whether the feature encoding output by the feature encoder comes from labeled training data or unlabeled training data.

2. The method according to claim 1, wherein the output of the discriminator is provided with a first loss function for training the discriminator and a second loss function for training the feature encoder, the first and second loss functions being arranged in opposition;

training the initial classification model by using a modal training data set corresponding to each single-modal classification model to obtain a target classification model, which specifically comprises the following steps:

performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single modal classification model;

and deleting the discriminators in the single-mode classification models in the classification models obtained by training.

3. The method according to claim 2, wherein iteratively training the initial classification model using the modal training dataset corresponding to each single-modal classification model specifically comprises:

acquiring training data from a modal training data set of the single-modal classification model for each single-modal classification model, inputting the training data into a feature encoder of the single-modal classification model, and adjusting parameters of the feature encoder and a classifier in the single-modal classification model according to a loss function value of a classifier of the single-modal classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

4. The method according to claim 2, wherein the feature encoders of the at least two single-mode classification models are further respectively connected to a same cross-mode discriminator, the cross-mode discriminator is used for discriminating a mode type corresponding to the feature encoding output by the feature encoder of each single-mode classification model, a third loss function for training the cross-mode discriminator and a fourth loss function for training the feature encoder in each single-mode classification model are arranged at an output end of the cross-mode discriminator, and the third loss function and the fourth loss function are arranged in opposition;

the method further comprises the following steps: deleting the cross-mode discriminator in the classification model obtained by training.

5. The method according to claim 4, wherein iteratively training the initial classification model using the modal training dataset corresponding to each single-modal classification model specifically comprises:

acquiring training data from a modal training data set corresponding to a single-modal classification model for each single-modal classification model, inputting the training data into a feature encoder of the single-modal classification model, and adjusting parameters of the feature encoder and a classifier in the single-modal classification model according to a loss function value of the classifier of the single-modal classification model; adjusting parameters of a discriminator and a feature encoder of the single-mode classification model based on the value of the first loss function and the value of the second loss function of the single-mode classification model;

6. The method according to claim 5, wherein adjusting parameters of the cross-modal discriminator and the feature encoder of each single-modal classification model based on a value of the third loss function and a value of the fourth loss function comprises:

adjusting parameters of a cross-modal discriminator according to the value of a third loss function after the cross-modal discriminator discriminates the feature codes output by the feature encoders of the single-modal classification models;

and adjusting the parameters of the feature encoders of the single-mode classification models based on the value of a fourth loss function obtained after the cross-mode discriminator subjected to parameter adjustment performs re-discrimination on the feature codes output by the feature encoders of the single-mode classification models.

7. The method according to claim 3 or 5, wherein adjusting parameters of a discriminator and a feature encoder of the monomodal classification model based on a value of a first loss function and a value of a second loss function of the monomodal classification model comprises:

adjusting parameters of a discriminator according to the value of a first loss function after the discriminator of the single-mode classification model discriminates the feature codes output by the feature encoder;

and adjusting the parameters of the feature encoder based on the value of the second loss function after the discriminator performs re-discrimination on the feature code output by the feature encoder after the parameters are adjusted.

8. An object classification apparatus using a classification model, comprising:

the model building unit is used for building an initial classification model, the initial classification model comprises at least two single-mode classification models with the same object classification task, different single-mode classification models correspond to different modal data types, the modal data comprise at least one of images, videos, voices and characters and are respectively used for representing different characteristics of the same object, and a modal training data set corresponding to each single-mode classification model comprises tag training data and label-free training data;

the training unit is used for carrying out iterative training on the initial classification model by adopting the modal training data set corresponding to each single-modal classification model based on a method for aligning the characteristic coding distribution of the labeled training data and the unlabeled training data in the modal training data set of each single-modal classification model to obtain a target classification model so as to carry out object classification according to the target classification model;

9. The apparatus of claim 8, wherein the output of the discriminator is provided with a first loss function for training the discriminator and a second loss function for training the feature encoder, the first and second loss functions being oppositional;

the training unit specifically comprises:

the training subunit is used for performing iterative training on the initial classification model by adopting a modal training data set corresponding to each single-modal classification model, and triggering the deletion subunit after the training is finished;

and the deleting subunit is used for training the training subunit to delete the discriminators in the single-mode classification models in the classification models obtained by training.

10. The apparatus according to claim 9, wherein the training subunit is specifically configured to:

11. The apparatus according to claim 9, wherein the feature encoders of the at least two single-mode classification models are further respectively connected to a same cross-mode discriminator, the cross-mode discriminator is configured to discriminate a mode type corresponding to the feature encoding output by the feature encoder of each single-mode classification model, a third loss function for training the cross-mode discriminator and a fourth loss function for training the feature encoder in each single-mode classification model are provided at an output end of the cross-mode discriminator, and the third loss function and the fourth loss function are opposedly provided;

the delete subunit is further to: deleting the cross-mode discriminator in the classification model obtained by training.

12. The apparatus according to claim 11, wherein the training subunit is specifically configured to:

13. The apparatus according to claim 12, wherein the training subunit adjusts parameters of the cross-modal classifier and the feature encoder of each single-modal classification model based on a value of the third loss function and a value of the fourth loss function, and specifically includes:

14. The apparatus according to claim 10 or 12, wherein the training subunit adjusts parameters of a discriminator and a feature encoder of the single-mode classification model based on values of a first loss function and values of a second loss function of the single-mode classification model, and specifically includes:

15. A computer server comprising a memory and one or more processors communicatively coupled to the memory;

the memory has stored therein instructions executable by the one or more processors to cause the one or more processors to implement an object classification method applying a classification model as claimed in any one of claims 1 to 7.