CN114187486A

CN114187486A - Model training method and related equipment

Info

Publication number: CN114187486A
Application number: CN202111498607.3A
Authority: CN
Inventors: 林和政; 吴翔宇
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-03-15

Abstract

The embodiment of the disclosure provides a model training method and related equipment. The method comprises the following steps: determining a multimedia training sample, and acquiring historical sample characteristics and historical category characteristics of the multimedia training sample; the historical sample characteristics and the historical category characteristics are respectively output by utilizing a first characteristic extraction model and a first classifier; processing the multimedia training sample through a second feature extraction model to obtain a prediction sample feature of the multimedia training sample, and calculating a first loss function according to the historical sample feature and the prediction sample feature; processing the prediction sample characteristics through a first classifier to determine prediction category characteristics of the multimedia training samples, and calculating a second loss function according to the history category characteristics and the prediction category characteristics; the second feature extraction model is trained based on the first loss function and the second loss function. The method can train the second feature extraction model, so that the features extracted by the second feature extraction model can be aligned with the historical sample features on the feature space.

Description

Model training method and related equipment

Technical Field

The present disclosure relates to the field of deep learning technologies, and in particular, to a model training method, a model training apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

The work sharing platform can receive the multimedia files uploaded by the users and pushes the multimedia files to the users as works in a personalized mode. The method comprises the steps of mapping a work to a feature space capable of expressing content information of the work through a feature extraction model deployed on a work sharing platform to obtain the work features of the work, and pushing the work to a user in a personalized mode according to the work features. Because the data of the works on the work sharing platform are changed, the feature space for expressing the content information of the works is changed correspondingly, so that the features of the works on the work sharing platform are required to be updated frequently, and the consistency of the works on the feature space is ensured.

In the related art, when the work features of the works are updated, feature extraction processing is performed on all the works on the work sharing platform again by using a feature extraction model deployed on a line to obtain new work features, so that the consistency of all the works on a feature space is ensured. Therefore, in the process of updating the characteristics of the works in the related technology, in order to ensure the consistency of all the works in the characteristic space, the data volume to be processed is all the works on the work sharing platform, so that the defects of large resource consumption and low efficiency of updating the characteristics of the works exist.

Disclosure of Invention

The present disclosure provides a model training method, a model training apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which at least solve the problems of large resource consumption and low efficiency of updating the characteristics of a work in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a model training method, including: determining a multimedia training sample, and acquiring historical sample characteristics and historical category characteristics of the multimedia training sample; the historical sample features are output by prediction through a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier; processing the multimedia training sample through a second feature extraction model to obtain a predicted sample feature of the multimedia training sample, and calculating a first loss function according to the historical sample feature and the predicted sample feature; processing the prediction sample characteristics through a first classifier to determine prediction category characteristics of the multimedia training samples, and calculating a second loss function according to the history category characteristics and the prediction category characteristics; the second feature extraction model is trained based on the first loss function and the second loss function.

In some exemplary embodiments of the present disclosure, the second feature extraction model includes an image feature extraction model and a text feature extraction model; and processing the multimedia training sample through a second feature extraction model to obtain a predicted sample feature of the multimedia training sample, wherein the step comprises the following steps of: processing the multimedia training sample through an image feature extraction model to obtain the image features of the multimedia training sample; processing the multimedia training sample through a text feature extraction model to obtain the text features of the multimedia training sample; and fusing the image characteristics and the text characteristics to obtain the predicted sample characteristics of the multimedia training sample.

In some exemplary embodiments of the present disclosure, the step of processing the multimedia training sample through the image feature extraction model to obtain the image features of the multimedia training sample includes: acquiring a cover image of a multimedia training sample; performing frame extraction on the multimedia training samples to obtain a preset number of sampled frame images; and respectively extracting the image characteristics of the cover image and each sampling frame image through an image characteristic extraction model to be used as the image characteristics of the multimedia training sample.

In some exemplary embodiments of the present disclosure, the step of processing the multimedia training sample through the text feature extraction model to obtain the text features of the multimedia training sample includes: respectively identifying the cover image and each sampling frame image by using a character identification technology to obtain a first identification text; identifying the multimedia training sample by using a content identification technology to obtain a second identification text; obtaining a description text of the multimedia training sample, and splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample; and extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

In some exemplary embodiments of the present disclosure, the second feature extraction model includes a multi-head attention layer; the method comprises the following steps of fusing image features and text features to obtain predicted sample features of the multimedia training sample, wherein the predicted sample features comprise: and carrying out fusion processing on the image features and the text features by using the multi-head attention layer to obtain the predicted sample features of the multimedia training sample.

In some exemplary embodiments of the present disclosure, the training of the second feature extraction model according to the first loss function and the second loss function comprises: acquiring constraint parameters, and constructing a constraint condition expression based on the constraint parameters, the first loss function and the second loss function; and training the second feature extraction model according to the constraint conditional expression.

In some exemplary embodiments of the present disclosure, further comprising: acquiring a multimedia file to be processed; processing the multimedia file to be processed through the second feature extraction model to obtain the multimedia features of the multimedia file to be processed; and processing the multimedia features through a second classifier, and determining a recommended label of the multimedia file to be processed.

According to a second aspect of the embodiments of the present disclosure, there is provided a model training apparatus including: the acquisition module is configured to determine a multimedia training sample, and acquire historical sample characteristics and historical category characteristics of the multimedia training sample; the historical sample features are output by prediction through a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier; the computing module is configured to process the multimedia training samples through the second feature extraction model to obtain predicted sample features of the multimedia training samples, and compute a first loss function according to the historical sample features and the predicted sample features; the calculation module is further configured to perform determining a prediction class feature of the multimedia training sample by processing the prediction sample feature with a first classifier, calculating a second loss function according to the history class feature and the prediction class feature; a training module configured to perform training of a second feature extraction model according to the first loss function and the second loss function.

In some exemplary embodiments of the present disclosure, the second feature extraction model includes an image feature extraction model and a text feature extraction model; and the calculation module executes the step of processing the multimedia training sample through the second feature extraction model to obtain the predicted sample feature of the multimedia training sample, and the step comprises the following steps of: processing the multimedia training sample through an image feature extraction model to obtain the image features of the multimedia training sample; processing the multimedia training sample through a text feature extraction model to obtain the text features of the multimedia training sample; and fusing the image characteristics and the text characteristics to obtain the predicted sample characteristics of the multimedia training sample.

In some exemplary embodiments of the present disclosure, the calculating module performs the step of processing the multimedia training sample through the image feature extraction model to obtain the image features of the multimedia training sample, including: acquiring a cover image of a multimedia training sample; performing frame extraction on the multimedia training samples to obtain a preset number of sampled frame images; and respectively extracting the image characteristics of the cover image and each sampling frame image through an image characteristic extraction model to be used as the image characteristics of the multimedia training sample.

In some exemplary embodiments of the present disclosure, the step of processing the multimedia training sample by the text feature extraction model to obtain the text feature of the multimedia training sample by the computation module includes: respectively identifying the cover image and each sampling frame image by using a character identification technology to obtain a first identification text; identifying the multimedia training sample by using a content identification technology to obtain a second identification text; obtaining a description text of the multimedia training sample, and splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample; and extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

In some exemplary embodiments of the disclosure, the second feature extraction model includes a multi-head attention layer, and the calculating module performs a step of fusing image features and text features to obtain predicted sample features of the multimedia training sample, including: and carrying out fusion processing on the image features and the text features by using the multi-head attention layer to obtain the predicted sample features of the multimedia training sample.

In some exemplary embodiments of the disclosure, the training module performs the step of training the second feature extraction model according to the first loss function and the second loss function, including: acquiring constraint parameters, and constructing a constraint condition expression based on the constraint parameters, the first loss function and the second loss function; and training the second feature extraction model according to the constraint conditional expression.

In some exemplary embodiments of the present disclosure, the method further comprises a processing module configured to perform: acquiring a multimedia file to be processed; processing the multimedia file to be processed through the second feature extraction model to obtain the multimedia features of the multimedia file to be processed; and processing the multimedia features through a second classifier, and determining a recommended label of the multimedia file to be processed.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the executable instructions to implement the model training method of any one of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform the model training method of any one of the above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the model training method of any one of the above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the model training method provided by the embodiment of the disclosure can acquire historical sample characteristics and historical category characteristics of a multimedia training sample (i.e., a work in a work sharing platform), wherein the historical sample characteristics are predicted and output by processing the multimedia training sample through a first characteristic extraction model, the historical category characteristics are acquired by processing the historical sample characteristics through a first classifier, the predicted sample characteristics of the multimedia training sample are acquired through a second characteristic extraction model, and the predicted sample characteristics are processed by using the first classifier to acquire the predicted category characteristics of the multimedia training sample; and then calculating a first loss function based on the obtained historical sample characteristics and the prediction sample characteristics, calculating a second loss function based on the obtained historical category characteristics and the prediction category characteristics, and training the second characteristic extraction model according to the first loss function and the second loss function. On one hand, since the historical sample features are extracted by the first feature extraction model, and the historical category features are output based on the historical sample feature prediction obtained by the first feature extraction model, after the training, the predicted sample features of the multimedia training sample extracted by the second feature extraction model can be aligned with the historical sample features of the multimedia training sample extracted by the first feature extraction model on the feature space. On the other hand, compared with the prior art, the method and the device have the advantages that the amount of the works needing to be processed in the work feature updating process is less, so that the resource overhead is saved, and the updating efficiency of the work features is improved.

Further, the model training method provided by the embodiment of the disclosure may further perform classification processing according to the multimedia features to obtain the recommended tags of the multimedia files to be processed after the multimedia features (i.e., the work features) of the multimedia files to be processed (i.e., the works) are extracted through the second feature extraction model, so as to be used for pushing the multimedia files to the user in a personalized manner.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram of an exemplary system architecture of a model training method shown in accordance with an exemplary embodiment.

FIG. 2 is a flow diagram illustrating a method of model training in accordance with an exemplary embodiment.

FIG. 3 is a flow diagram illustrating a method for deriving predicted sample features in a model training method according to an exemplary embodiment.

FIG. 4 is a flow diagram illustrating image features from a model training method according to an exemplary embodiment.

FIG. 5 is a diagram illustrating a model training method to obtain a text to be processed according to an exemplary embodiment.

FIG. 6 is a flow diagram illustrating a method for obtaining text features in a model training method in accordance with an exemplary embodiment.

FIG. 7 is a flow chart illustrating a method of model training in accordance with an exemplary embodiment.

FIG. 8 is a diagram illustrating a network architecture for implementing a model training method in accordance with an exemplary embodiment.

FIG. 9 is a diagram illustrating a network architecture for determining recommended tags using a second feature extraction model in accordance with an exemplary embodiment.

FIG. 10 is a block diagram illustrating a model training apparatus in accordance with an exemplary embodiment.

FIG. 11 is a schematic diagram illustrating a structure of an electronic device suitable for use in implementing exemplary embodiments of the present disclosure, according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

FIG. 1 shows a schematic diagram of an exemplary system architecture to which the model training method of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture may include a server 101, a network 102, and a client 103. Network 102 serves as a medium for providing communication links between clients 103 and server 101. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

In some optional embodiments, the server 101 may be a server providing various services, and may be configured to train a second feature extraction model, deploy the trained second feature extraction model on the server 101 or the client 103, and process the to-be-processed multimedia file acquired from the client 103 through the trained second feature extraction model to obtain the multimedia features of the to-be-processed multimedia file.

Further, in some optional embodiments, the process used by the server 101 to train the second feature extraction model may be: the server 101 determines a multimedia training sample, and obtains historical sample characteristics and historical category characteristics of the multimedia training sample; the historical sample features are output by prediction through a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier; the server 101 processes the multimedia training sample through the second feature extraction model to obtain a predicted sample feature of the multimedia training sample, and calculates a first loss function according to the historical sample feature and the predicted sample feature; the server 101 determines the prediction category characteristics of the multimedia training samples by processing the prediction sample characteristics through the first classifier, and calculates a second loss function according to the history category characteristics and the prediction category characteristics; the server 101 trains a second feature extraction model according to the first loss function and the second loss function.

The server 101 may also determine a recommended label of the multimedia file to be processed according to the multimedia characteristics of the multimedia file to be processed, so as to send the determined recommended label to the client 103. Specifically, the server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like.

In some alternative embodiments, the client 103 may be used to present a recommendation tag. Specifically, the client 103 may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an AR (Augmented Reality) device, a VR (Virtual Reality) device, a smart wearable device, and other types of electronic devices, or the client 103 may be a personal computer such as a laptop computer, a desktop computer, and the like. Optionally, the operating system running on the electronic device may include, but is not limited to, an android system, an IOS system, linux, windows, and the like.

In addition, it should be noted that fig. 1 shows only one application environment of the model training method provided by the present disclosure. The number of clients, networks and servers in fig. 1 is merely illustrative, and there may be any number of clients, networks and servers, as desired.

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the steps of the model training method in the exemplary embodiment of the present disclosure will be described in more detail below with reference to the drawings and the embodiments.

Fig. 2 is a flowchart illustrating a model training method according to an exemplary embodiment, and an execution subject of the method provided in the embodiment of fig. 2 may be any electronic device, for example, the server 101 in the embodiment of fig. 1, but the disclosure is not limited thereto.

As shown in fig. 2, the model training method provided by the embodiment of the present disclosure may include the following steps:

step S201, determining a multimedia training sample, and acquiring historical sample characteristics and historical category characteristics of the multimedia training sample; wherein, the historical sample features are predicted and output by using the first feature extraction model, and the historical category features are obtained by processing the historical sample features through the first classifier.

In the embodiment of the disclosure, the multimedia training sample may be a short video, a long video, a live video, an audio, an image work set, etc., and the embodiment of the disclosure is not limited; in some practical applications, the multimedia training samples can be obtained from a specified database, such as: the designated database can be a video library on a work sharing platform (such as a video platform), and all historical multimedia files uploaded by a user can be stored in the video library; the specified database may also be a pre-set sample database for model training. The number of multimedia training samples may be adjusted according to actual conditions, and the embodiment of the present disclosure is not limited, for example: the data accounting for 1% or 3% of the total amount can be selected from the designated database to be used as multimedia training samples, or 1000 or 5000 pieces of data can be selected from the designated database to be used as multimedia training samples.

In addition, the feature extraction model (including the first feature extraction model and the second feature extraction model) may be used to map the multimedia training sample to a feature space capable of expressing content information thereof, so as to obtain sample features (including historical sample features and predicted sample features) of the multimedia training sample; the first feature extraction model is a feature extraction model that has been trained and deployed on-line, and may be a feature extraction model that responds to a previous most recent feature update requirement.

In addition, the first classifier can be used for processing sample features of the multimedia training samples to obtain class features (including historical class features and predicted class features) which can represent classes to which the samples belong; the first classifier may be any one of classification technologies such as a multilayer neural network (MLP), a Support Vector Machine (SVM), a K-nearest neighbor (KNN), a Gaussian Mixture Model (GMM), and the like, and the embodiments of the present disclosure are not limited thereto.

In some practical applications, preset categories may be set before the multimedia training samples are classified by using the first classifier, and then the first classifier processes the sample features to obtain probability vectors of the multimedia training samples belonging to the preset categories, so as to determine the category features of the multimedia training samples according to the probability vectors. In some practical applications, the probability vector may be used as a class feature of the multimedia training sample. Specifically, the number of the preset categories and the content of the specific categories can be determined when the preset categories are set, the number of the preset categories can be a preset number (such as 2, 3, 100 or 1000), and the content of the specific categories can be categories with practical significance (such as music, sports, dancing, gourmet food and/or travel); after the preset categories are set, the sample features can be processed through the first classifier, so that probability vectors with the dimensionality of the multimedia training samples being the preset number are obtained, and each value in the probability vectors can represent the probability that the multimedia training samples belong to each preset category. The following are exemplified:

when the number of the preset categories is 3 and the specific category contents of the preset categories are sports, food and travel, respectively, when the probability vector of the a sample obtained by the first classifier is [0.2, 0.85, 0.1], the meaning of "the probability that the a sample belongs to the sports category is 0.2, the probability that the a sample belongs to the food category is 0.85, and the probability that the a sample belongs to the travel category is 0.1" can be understood for the probability vector. The probability vector can be used as a class feature of the a sample to characterize the class of the a sample.

Step S203, processing the multimedia training sample through the second feature extraction model to obtain a predicted sample feature of the multimedia training sample, and calculating a first loss function according to the historical sample feature and the predicted sample feature.

The second feature extraction model is a feature extraction model which is not trained and is not deployed on line, the second feature extraction model is a feature extraction model to be trained in the disclosure, and after the second feature extraction model is trained by the method provided by the embodiment of the disclosure, the second feature extraction model can be deployed on line instead of the first feature extraction model. The predicted sample features of the multimedia training samples obtained by using the second feature extraction model in this step may be used to calculate a first loss function in combination with the historical sample features obtained in step S201, so as to be used for training the second feature extraction model. In some practical applications, the untrained second feature extraction model may include extraction parameters to be trained, and the model training method provided by the present disclosure may adjust the extraction parameters, thereby implementing training of the second feature extraction model.

In embodiments of the present disclosure, a first loss function may be calculated from the historical sample characteristics and the predicted sample characteristics. The first loss function may be any one of a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a square loss function, an exponential loss function, a perceptual loss function, a cross entropy loss function, a mean square error loss function, and the like, and the embodiments of the present application are not limited thereto

And S205, processing the prediction sample characteristics through the first classifier to determine the prediction class characteristics of the multimedia training sample, and calculating a second loss function according to the history class characteristics and the prediction class characteristics.

The predicted category features of the multimedia training samples obtained by using the first classifier in the embodiment of the present disclosure may be used to calculate a second loss function in combination with the historical category features obtained in step S201, so as to be used for training a second feature extraction model. The second loss function may be any one of a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a square loss function, an exponential loss function, a perceptual loss function, a cross entropy loss function, a mean square error loss function, and the like, and the embodiment of the present application is not limited.

Step S207, train the second feature extraction model according to the first loss function and the second loss function.

In the embodiment of the present disclosure, the second feature extraction model may be trained according to the first loss function calculated in step S203 and the second loss function calculated in step S205. In some practical applications, the second feature extraction model may include extraction parameters, the second feature extraction model may be trained according to the first loss function and the second loss function, the extraction parameters may be adjusted to obtain target extraction parameters of the second feature extraction model, and the second feature extraction model with the target extraction parameters may be determined as the trained second feature extraction model.

In the model training method provided by an embodiment of the present disclosure, a historical sample feature and a historical category feature of a multimedia training sample may be obtained first, the historical sample feature is obtained by processing a prediction output of the multimedia training sample through a first feature extraction model, the historical category feature is obtained by processing the historical sample feature through a first classifier, the predicted sample feature of the multimedia training sample is obtained through a second feature extraction model, and the predicted sample feature is processed by using the first classifier to obtain the predicted category feature of the multimedia training sample; and then calculating a first loss function based on the obtained historical sample characteristics and the prediction sample characteristics, calculating a second loss function based on the obtained historical category characteristics and the prediction category characteristics, and training the second characteristic extraction model according to the first loss function and the second loss function. On one hand, since the historical sample features are extracted by the first feature extraction model, and the historical category features are output based on the historical sample feature prediction obtained by the first feature extraction model, after the training, the predicted sample features of the multimedia training sample extracted by the second feature extraction model can be aligned with the historical sample features of the multimedia training sample extracted by the first feature extraction model on the feature space. On the other hand, compared with the prior art, the method and the device have the advantages that the amount of the works needing to be processed in the work feature updating process is less, so that the resource overhead is saved, and the updating efficiency of the work features is improved.

The method provided by the embodiment of the disclosure is explained by taking a multimedia training sample as a video format as an example:

for historical videos existing on a video platform (namely a work sharing platform), selecting partial historical videos as multimedia training samples for training a second feature extraction model, processing the historical videos through a first feature extraction model to obtain historical video features of the historical videos as historical sample features, and processing the historical video features by using a first classifier to obtain corresponding historical category features; after the historical video is selected as the multimedia training sample of the second feature extraction model, the historical sample features and the historical category features of the historical video can be directly obtained. Then, obtaining a prediction sample characteristic of the historical video selected as a multimedia training sample through a second characteristic extraction model, and processing the prediction sample characteristic by utilizing the first classifier to obtain a prediction category characteristic of the historical video; and then calculating a first loss function based on the obtained historical sample characteristics and the prediction sample characteristics, calculating a second loss function based on the obtained historical category characteristics and the prediction category characteristics, and training the second characteristic extraction model according to the first loss function and the second loss function.

Further, in some practical applications, after the second feature extraction model is trained, the second feature extraction model may be deployed on a video platform for extracting video features of a video to be processed, and the video features of the video to be processed may be aligned in a feature space with historical video features already stored on the video platform. The video to be processed may be a video newly uploaded by the user.

In some practical applications, when the video features of the video on the video platform need to be updated, the trained second feature extraction model can be obtained by using the model training method provided by the present disclosure, and then the features of the video to be processed (such as a new video uploaded to the video platform by a user) are extracted by using the trained second feature extraction model, so that the video features of the video to be processed, which are aligned with the features of the historical sample in the feature space, can be obtained. Therefore, in the process of training a second feature extraction model which can meet the requirement that the features of the videos on the video platform are kept consistent, only part of the historical videos are selected as multimedia training samples to be processed; in the related art, all videos on the video platform need to be processed in order to keep the characteristics of the videos on the video platform consistent. Compared with the prior art, the method and the device have the advantages that the amount of videos needing to be processed in the video feature updating process is less, and the effects of saving resource overhead and improving video feature updating efficiency are achieved.

In some embodiments, the second feature extraction model may include an image feature extraction model and a text feature extraction model; and, the step of processing the multimedia training sample by the second feature extraction model to obtain a predicted sample feature of the multimedia training sample may include: processing the multimedia training sample through an image feature extraction model to obtain the image features of the multimedia training sample; processing the multimedia training sample through a text feature extraction model to obtain the text features of the multimedia training sample; and fusing the image characteristics and the text characteristics to obtain the predicted sample characteristics of the multimedia training sample.

In the embodiment of the present disclosure, the image feature extraction model may be a residual error neural network (ResNet), a Vision encoder (viion Transformer, ViT), and the like, and the embodiment of the present disclosure is not limited; the text feature extraction model may be a Word2vec model (Word to vector), a Global Word vector Representation model (Global Vectors for Word Representation, GloVe), a depth speech coding model (Bidirectional Encoder Representation from transforms, BERT), and the like, and the embodiments of the present disclosure are not limited thereto. In some implementations, the second feature extraction model may include: an image feature extraction model with image extraction parameters and a text feature extraction model with text extraction parameters; the model training method provided by the disclosure can adjust the image extraction parameters in the image feature extraction model and adjust the text extraction parameters in the text feature extraction model, thereby realizing the training of the second feature extraction model.

The present exemplary embodiment will be described in more detail below with reference to fig. 3 and the embodiment.

Fig. 3 is a flowchart illustrating a method for obtaining predicted sample features in a model training method according to an exemplary embodiment, and as shown in fig. 3, step S203 in the embodiment of fig. 2 may further include the following steps:

step S301, processing the multimedia training sample through the image feature extraction model to obtain the image features of the multimedia training sample.

In the embodiment of the present disclosure, image feature vectors of preset dimensionality quantities of the multimedia training samples can be obtained through the image feature extraction model, and the preset dimensionality quantities can be adjusted and set according to actual conditions, for example: 128. 256, 1024, etc., and the disclosed embodiments are not limited. In some practical applications, the setting of the preset number of dimensions may depend on the required accuracy of the downstream service (e.g., recommendation service), and the higher the required accuracy, the larger the preset number of dimensions may be set.

Step S303, processing the multimedia training sample through the text feature extraction model to obtain the text features of the multimedia training sample.

In the embodiment of the disclosure, the text feature vectors of the preset dimensionality number of the multimedia training sample can be obtained through the text feature extraction model, and the length of the text feature vectors can be consistent with the length of the image feature vectors obtained in step S301.

And S305, fusing the image characteristics and the text characteristics to obtain the predicted sample characteristics of the multimedia training sample.

In the embodiment of the disclosure, after the image features and the text features are fused, multi-modal feature expression can be performed on the multimedia training sample, so that the predicted sample features are more accurately, comprehensively and fully expressed.

Therefore, by implementing the method shown in fig. 3, the image features and the text features can be extracted from the multimedia training sample, so as to obtain the multi-modal features of the multimedia training sample, and accurately, comprehensively and fully express the semantics of the multimedia training sample.

In some embodiments, the step of processing the multimedia training sample through the image feature extraction model to obtain the image features of the multimedia training sample may include: acquiring a cover image of a multimedia training sample; performing frame extraction on the multimedia training samples to obtain a preset number of sampled frame images; and respectively extracting the image characteristics of the cover image and each sampling frame image through an image characteristic extraction model to be used as the image characteristics of the multimedia training sample.

In the embodiment of the present disclosure, the cover image may be obtained from a designated system, or may be obtained from upload information of a multimedia training sample, where the upload information may be content that is automatically edited by a user when uploading a multimedia work, such as: the user can set a cover image for the work to upload. When the multimedia training sample is a video, frame extraction processing may be performed on the multimedia training sample, such as: randomly extracting a preset number of sampling frame images, or uniformly extracting the preset number of sampling frame images at equal intervals; when the multimedia training sample is an image work set, image selection processing can be performed on the image work set, such as: and randomly selecting a preset number of image works in the image work set, or uniformly selecting a preset number of image works at equal intervals in the image work set. The preset number can be adjusted and set according to actual conditions, such as: 3. 4, 5, etc., and the disclosed embodiments are not limited.

Fig. 4 is a flowchart illustrating an image feature obtaining method in a model training method according to an exemplary embodiment, and as shown in fig. 4, step S301 in the embodiment of fig. 3 may further include the following steps:

step S401, cover images of multimedia training samples are obtained.

Step S403, performing frame extraction on the multimedia training sample to obtain a preset number of sampled frame images.

For example, if the multimedia training sample is a historical video and the preset number is 4, the historical video may be subjected to the following frame extraction: the start frame, 1/3 frames of the total duration, 2/3 frames of the total duration, and the last frame of the historical video are extracted.

In step S405, the image features of the cover image and each sample frame image are respectively extracted by the image feature extraction model as the image features of the multimedia training sample.

For example, feature extraction may be performed on one cover image obtained in step S401 and 4 sample frame images obtained in step S403, respectively, to obtain 5 image features, and the 5 image features may be used together as the image features of the multimedia training sample.

It can be seen that by implementing the schematic diagram shown in fig. 4, multi-way image extraction can be performed on the multimedia training sample. The multimedia training samples are converted into the images in various modes, so that a sufficient data base can be provided for image feature extraction, and the comprehensively expressed image features of the multimedia training samples are obtained.

In the embodiment of the disclosure, (preset number +1) image features may be extracted from the multimedia training samples to perform sufficient image feature expression on the multimedia training samples.

In some embodiments, the step of processing the multimedia training sample through the text feature extraction model to obtain the text features of the multimedia training sample may include: respectively identifying the cover image and each sampling frame image by using a character identification technology to obtain a first identification text; identifying the multimedia training sample by using a content identification technology to obtain a second identification text; obtaining a description text of the multimedia training sample, and splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample; and extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

In the embodiment of the present disclosure, the description text may be obtained from a designated system, or obtained from the upload information of the multimedia training sample, and in the upload information uploaded along with the work, the user may edit the text description information autonomously, for example: the user can set sentences related to the content of the work for the work; the word description information can be used as description text for extracting text features of the work. The Character Recognition technology may be a method for recognizing characters such as OCR (Optical Character Recognition), and the Content Recognition technology may be a method for recognizing Content such as ACR (Automatic Content Recognition), and the embodiments of the present disclosure are not limited thereto.

The present exemplary embodiment will be described in more detail below with reference to fig. 5, 6 and the embodiment.

Fig. 5 is a schematic diagram illustrating a model training method to obtain a text to be processed according to an exemplary embodiment, and as shown in fig. 5, the method may include the following steps:

firstly, acquiring a cover image and a description text of a multimedia training sample;

then, frame extraction processing is carried out on the multimedia training samples to obtain sampling frame images 1-4 (the number is only used for illustration, and the disclosure is not limited thereto), then the cover image and the sampling frame images 1-4 are respectively identified by using a character identification technology to obtain identification results of 5 text formats, and then splicing is carried out to obtain a first identification text;

then, identifying the whole multimedia training sample by using a content identification technology to obtain an identification result of a text format as a second identification text;

and finally, splicing the description text, the first identification text and the second identification text to obtain the text to be processed.

Therefore, by implementing the schematic diagram shown in fig. 5, multi-way character extraction can be performed on the multimedia training sample, and the multimedia training sample is converted into a text in multiple ways, so that a sufficient data basis can be provided for subsequent text feature extraction.

Fig. 6 is a flowchart illustrating a method for training a model to obtain text features according to an exemplary embodiment, where, as shown in fig. 6, step S303 in the embodiment of fig. 3 may further include:

step S601, respectively identifying a cover image and each sampling frame image by using a character identification technology to obtain a first identification text;

step S603, identifying the multimedia training sample by using a content identification technology to obtain a second identification text;

step S605, splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample;

and step S607, extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

Therefore, by implementing the method shown in fig. 6, the multimedia training sample can be fully extracted to obtain the text to be processed, and then the text to be processed is processed by using the text feature extraction model to obtain the text features of the multimedia training sample, which are comprehensively expressed.

In some embodiments, the second feature extraction model may include a multi-head attention layer; the method comprises the following steps of fusing image features and text features to obtain predicted sample features of the multimedia training sample, wherein the predicted sample features comprise: and carrying out fusion processing on the image features and the text features by using the multi-head attention layer to obtain the predicted sample features of the multimedia training sample.

In the embodiment of the present disclosure, a multi-Head Attention Layer (Mult-Head Attention Layer) may fuse the image features and the text features obtained in the foregoing steps to obtain a fusion expression result of the multi-modal features. For example, if 5 512-dimensional image features and a 512-dimensional text feature are obtained, a 512-dimensional predicted sample feature can be obtained through the fusion processing of the multi-head attention layer.

In some embodiments, training the second feature extraction model step according to the first loss function and the second loss function may include: acquiring constraint parameters, and constructing a constraint condition expression based on the constraint parameters, the first loss function and the second loss function; and training the second feature extraction model according to the constraint conditional expression.

In the embodiment of the present disclosure, the constraint parameter and the usage mode of the constraint parameter may be set based on actual situations, such as: the constraint parameter may be a first weight value corresponding to the first loss function and a second weight value corresponding to the second loss function; the constraint parameter may be used by calculating a weighted sum using the constraint parameter, the first loss function, and the second loss function, and then training the second feature extraction model with the smallest weighted sum as a training target. The constraint parameter may also be an iteration number threshold, the usage mode when the constraint parameter is the iteration number threshold may be a condition for determining to stop the iterative computation, the iterative computation may be performed respectively with the first loss function minimum as a target and the second loss function minimum as a target, and the iterative computation is stopped if the target is achieved before the iteration number threshold is reached, or the iterative computation is stopped when the iteration number threshold is reached.

In some practical applications, the first classifier may also include classification parameters, and the model training method of the present disclosure may also train the first classifier, and specifically, may adjust the classification parameters in the first classifier. For example, the first classifier may be trained according to the second loss function, the classification parameters may be adjusted to obtain target classification parameters of the first classifier, and the first classifier having the target classification parameters may be determined as the trained first classifier. And the second loss function is calculated according to the historical class characteristics and the prediction class characteristics in the previous step.

Fig. 7 is a flowchart illustrating a model training method according to an exemplary embodiment, and as shown in fig. 7, the method provided by the embodiment of the present disclosure may include:

step S701, determining a multimedia training sample, obtaining historical sample characteristics of the multimedia training sample predicted and output by using a first characteristic extraction model, and obtaining historical category characteristics obtained by processing the historical sample characteristics through a first classifier.

Step S703, acquiring a cover image of the multimedia training sample; performing frame extraction on the multimedia training samples to obtain a preset number of sampled frame images; and respectively extracting the image characteristics of the cover image and each sampling frame image through an image characteristic extraction model to be used as the image characteristics of the multimedia training sample.

Step S705, obtaining a description text of the multimedia training sample; respectively identifying the cover image and each sampling frame image by using a character identification technology to obtain a first identification text; identifying the multimedia training sample by using a content identification technology to obtain a second identification text; splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample; and extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

And step S707, fusing the image features and the text features by using the multi-head attention layer in the second feature extraction model to obtain the predicted sample features of the multimedia training sample.

Step S709, calculate a first loss function according to the historical sample feature and the predicted sample feature.

Step 711, the prediction sample features are processed by the first classifier to determine the prediction class features of the multimedia training samples, and a second loss function is calculated according to the history class features and the prediction class features.

Step S713, train the second feature extraction model according to the first loss function and the second loss function.

The same steps in the embodiment shown in fig. 7 as those in the embodiments shown in fig. 2, fig. 3, fig. 4, or fig. 6 may refer to the text descriptions of the embodiments shown in fig. 2, fig. 3, fig. 4, or fig. 6, and the disclosure is not repeated herein.

It can be seen that, by implementing the method shown in fig. 7, the predicted sample feature and the predicted category feature can be obtained according to the second feature extraction model, and then the first loss function and the second loss function can be constructed by combining the obtained historical sample feature and the obtained historical category feature, and the second feature extraction model is trained according to the first loss function and the second loss function, so that the sample feature extracted by the second feature extraction model can be aligned with the sample feature extracted by the first feature extraction model in the feature space.

FIG. 8 is a diagram illustrating a network architecture for implementing a model training method in accordance with an exemplary embodiment. Taking training of the second feature extraction model 800 capable of extracting video features as an example, as shown in fig. 8, the second feature extraction model 800 is a feature extraction model to be trained, and the second feature extraction model 800 may include: the image encoder 806, the depth language coding model 808 and the fusion layer 809 may specifically include the following steps:

the server/terminal can obtain a cover image 801 of the selected historical video, and uniformly extract sampling frame images of a preset number (the preset number is 4 in the embodiment) at equal intervals from the historical video to obtain sampling frame images 802-805; then, an image encoder 806 is used for extracting features of the cover image 801 and the 4 sampling frame images to obtain 5 image features; specifically, 5 512-dimensional image features can be obtained;

the server/terminal can also process the historical video to obtain a text 807 to be processed of the historical video; then, performing feature extraction on the text to be processed by using a depth language coding model (BERT) 808 to obtain 1 text feature; specifically, 1 text feature with 512 dimensions can be obtained; the method for obtaining the text to be processed may refer to fig. 5 and related embodiments, which are not described herein again;

then, the server/terminal can perform fusion processing on 5 image features and 1 text feature through a fusion Layer (Mult-Head Attention Layer)809 to obtain 512-dimensional prediction sample features 810 of the video;

furthermore, stored historical sample features 811 and historical category features 812 may be obtained, on one hand, a first loss function 813 may be calculated according to the historical sample features 811 and predicted sample features 810, on the other hand, the predicted sample features 810 may be processed by a first classifier 814 to obtain predicted category features 815, and then a second loss function 816 may be calculated according to the historical category features 812 and the predicted category features 815; the first classifier 814 may be a classification model such as a Multilayer Perceptron (MLP);

finally, the second feature extraction model 800 may be trained according to the first loss function 813 and the second loss function 816.

As can be seen, by implementing the network architecture shown in fig. 8, the predicted sample features and the predicted category features can be obtained according to the second feature extraction model, and then the first loss function and the second loss function can be constructed by combining the obtained historical sample features and the obtained historical category features, and the second feature extraction model is trained according to the first loss function and the second loss function, so that the sample features extracted by the second feature extraction model can be aligned with the historical sample features in the feature space.

In some embodiments, the method provided by the embodiments of the present disclosure may further include: acquiring a multimedia file to be processed; processing the multimedia file to be processed through the second feature extraction model to obtain the multimedia features of the multimedia file to be processed; and processing the multimedia features through a second classifier, and determining a recommended label of the multimedia file to be processed.

In the embodiment of the present disclosure, the multimedia file to be processed may be a short video to be processed, a long video to be processed, a live video to be processed, an audio to be processed, an image work set to be processed, or the like. The second classifier can be used for determining a recommended label of the multimedia file to be processed according to the multimedia characteristics of the multimedia file to be processed; the second classifier may be a trained classifier, or may be a classifier obtained from another system and directly deployable and usable. After the training of the second feature extraction model is completed, the second feature extraction model can be deployed on the line for use, for example: the newly uploaded video of the user can be used as the multimedia file to be processed, the deployed second feature extraction model is used for performing multimedia feature extraction on the multimedia file to be processed, and then the second classifier is used for obtaining the recommendation label of the multimedia file to be processed according to the multimedia features of the multimedia file to be processed. For example, the recommendation tag may be a classification of the multimedia file for a recommendation scenario, and the specific category content may be set based on actual situations, for example: football, basketball, baking, Chinese meal, etc. In some practical applications, after the recommendation tag of the multimedia file to be processed is obtained, the recommendation tag can be used for calling other services, and the recommendation tag can also be displayed on a client for a user to use. The second classifier may be any one of classification technologies such as a multilayer neural network (MLP), a Support Vector Machine (SVM), a K-nearest neighbor (KNN), a Gaussian Mixture Model (GMM), and the like, and the embodiments of the present disclosure are not limited thereto.

FIG. 9 is a diagram illustrating a network architecture for determining recommended tags using a second feature extraction model in accordance with an exemplary embodiment. Taking a multimedia file to be processed as an example of a video to be processed, fig. 9 shows a process of determining a recommended label of the video to be processed, as shown in fig. 9, a second feature extraction model 900 is a feature extraction model that is trained and deployed on line, and may be used to replace a previous first feature extraction model, where the second feature extraction model 900 may include: the image encoder, the depth language coding model and the fusion layer specifically comprise the following steps:

the server/terminal can obtain a cover image 901 of a video to be processed, and uniformly extract a preset number (in this embodiment, the preset number is 4) of sampling frame images at equal intervals from the video to be processed to obtain sampling frame images 902-905; then, an image encoder 906 is used for carrying out feature extraction on the cover image 901 and the 4 sampling frame images to obtain 5 image features;

the server/terminal can also process the video to be processed to obtain a text 907 to be processed of the video to be processed; then, a deep language coding model 908 is used for extracting the features of the text to be processed to obtain 1 text feature;

then, the server/terminal may perform fusion processing on the 5 image features and the 1 text feature through a fusion layer 909 to obtain a 512-dimensional video feature 910 of the video to be processed;

finally, the predicted video features are input to a second classifier 911, so that the second classifier 911 determines a recommendation label 912 for the video to be processed according to the video features 910.

The same steps in the embodiment shown in fig. 9 as in the embodiment shown in fig. 8 can be referred to the text description of the embodiment shown in fig. 8, and the details of the present disclosure are not repeated herein.

FIG. 10 is a block diagram illustrating a model training apparatus in accordance with an exemplary embodiment. Referring to fig. 10, a model training apparatus 1000 provided in an embodiment of the present disclosure may include: an obtaining module 1010 configured to determine a multimedia training sample, obtain a history sample characteristic and a history category characteristic of the multimedia training sample; the historical sample features are output by prediction through a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier; a calculating module 1020 configured to perform processing on the multimedia training sample through the second feature extraction model to obtain a predicted sample feature of the multimedia training sample, and calculate a first loss function according to the historical sample feature and the predicted sample feature; the calculation module 1020 is further configured to perform determining a prediction class feature of the multimedia training sample by processing the prediction sample feature with a first classifier, calculating a second loss function according to the history class feature and the prediction class feature; a training module 1030 configured to perform training the second feature extraction model according to the first loss function and the second loss function.

In some embodiments, the second feature extraction model comprises an image feature extraction model and a text feature extraction model; and the calculating module 1020 executes the step of processing the multimedia training sample by the second feature extraction model to obtain the predicted sample feature of the multimedia training sample, including: processing the multimedia training sample through an image feature extraction model to obtain the image features of the multimedia training sample; processing the multimedia training sample through a text feature extraction model to obtain the text features of the multimedia training sample; and fusing the image characteristics and the text characteristics to obtain the predicted sample characteristics of the multimedia training sample.

In some embodiments, the calculating module 1020 performs the step of processing the multimedia training sample through the image feature extraction model to obtain the image features of the multimedia training sample, including: acquiring a cover image of a multimedia training sample; performing frame extraction on the multimedia training samples to obtain a preset number of sampled frame images; and respectively extracting the image characteristics of the cover image and each sampling frame image through an image characteristic extraction model to be used as the image characteristics of the multimedia training sample.

In some embodiments, the calculating module 1020 performs the step of processing the multimedia training samples through the text feature extraction model to obtain the text features of the multimedia training samples, including: respectively identifying the cover image and each sampling frame image by using a character identification technology to obtain a first identification text; identifying the multimedia training sample by using a content identification technology to obtain a second identification text; obtaining a description text of the multimedia training sample, and splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample; and extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

In some embodiments, the second feature extraction model includes a multi-head attention layer, and the calculating module 1020 performs the step of fusing the image feature and the text feature to obtain a predicted sample feature of the multimedia training sample, including: and performing fusion processing on the image features and the text features of each image in the image set to be processed by using the multi-head attention layer to obtain the predicted sample features of the multimedia training sample.

In some embodiments, training module 1030 performs the step of training the second feature extraction model according to the first loss function and the second loss function, including: acquiring constraint parameters, and constructing a constraint condition expression based on the constraint parameters, the first loss function and the second loss function; and training the second feature extraction model according to the constraint conditional expression.

In some embodiments, the model training apparatus 1000 may further include a processing module 1040 configured to perform: acquiring a multimedia file to be processed; processing the multimedia file to be processed through the second feature extraction model to obtain the multimedia features of the multimedia file to be processed; and processing the multimedia features through a second classifier, and determining a recommended label of the multimedia file to be processed.

It can be seen that, with the implementation of the apparatus shown in fig. 10, the historical sample features and the historical category features of the multimedia training samples may be obtained first, and the predicted sample features and the predicted category features of the multimedia training samples may be obtained through the second feature extraction model; and calculating a loss function based on the historical sample characteristics, the historical category characteristics, the prediction sample characteristics and the prediction category characteristics, and training a second characteristic extraction model according to the loss function. Since the historical sample features and the historical category features are obtained according to the first feature extraction model, after the training, the sample features extracted by the second feature extraction model can be aligned with the sample features extracted by the first feature extraction model in the feature space.

In some practical applications, when the video features of the video on the video platform need to be updated, the trained second feature extraction model can be obtained by using the model training method provided by the present disclosure, and then the features of the video to be processed (such as a new video uploaded to the video platform by a user) are extracted by using the trained second feature extraction model, so that the video features of the video to be processed, which are aligned with the features of the historical sample in the feature space, can be obtained. Compared with the prior art, the method and the device have the advantages that the amount of videos needing to be processed in the video feature updating process is less, so that resource overhead is saved, and the video feature updating efficiency is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An electronic device 1100 according to such an embodiment of the disclosure is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting different system components (including the memory unit 1120 and the processing unit 1110), and a display unit 1140.

Where the memory unit stores program code, the program code may be executed by the processing unit 1110 to cause the processing unit 1110 to perform the steps according to various exemplary embodiments of the present disclosure as described in the above-mentioned "exemplary methods" section of this specification. For example, the processing unit 1110 may execute step S201 shown in fig. 2, determine a multimedia training sample, and obtain a history sample characteristic and a history category characteristic of the multimedia training sample; the historical sample features are output by prediction through a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier; step S203, processing the multimedia training sample through a second feature extraction model to obtain a prediction sample feature of the multimedia training sample, and calculating a first loss function according to the historical sample feature and the prediction sample feature; step S205, processing the prediction sample characteristics through a first classifier to determine the prediction category characteristics of the multimedia training sample, and calculating a second loss function according to the history category characteristics and the prediction category characteristics; step S207, train the second feature extraction model according to the first loss function and the second loss function.

As another example, the electronic device may implement the various steps shown in FIG. 2.

The storage unit 1120 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)1121 and/or a cache memory unit 1122, and may further include a read-only memory unit (ROM) 1123.

The storage unit 1120 may also include a program/utility 1124 having a set (at least one) of program modules 1125, such program modules 1125 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 1130 may be representative of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1100 may also communicate with one or more external devices 1170 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1100, and/or any devices (e.g., router, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 1150. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1160 communicates with the other modules of the electronic device 1100 over the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as a memory comprising instructions, executable by a processor of an apparatus to perform the above-described method is also provided. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the model training method in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of model training, comprising:

determining a multimedia training sample, and acquiring historical sample characteristics and historical category characteristics of the multimedia training sample; wherein the historical sample features are predicted and output by using a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier;

processing the multimedia training sample through a second feature extraction model to obtain a predicted sample feature of the multimedia training sample, and calculating a first loss function according to the historical sample feature and the predicted sample feature;

processing the prediction sample characteristics through the first classifier to determine prediction category characteristics of the multimedia training samples, and calculating a second loss function according to the history category characteristics and the prediction category characteristics;

training the second feature extraction model according to the first loss function and the second loss function.

2. The model training method of claim 1, wherein the second feature extraction model comprises an image feature extraction model and a text feature extraction model; and the number of the first and second groups,

the step of processing the multimedia training sample through the second feature extraction model to obtain the predicted sample feature of the multimedia training sample comprises:

processing the multimedia training sample through the image feature extraction model to obtain the image features of the multimedia training sample;

processing the multimedia training sample through the text feature extraction model to obtain the text feature of the multimedia training sample;

and fusing the image features and the text features to obtain the predicted sample features of the multimedia training sample.

3. The model training method of claim 2, wherein the step of processing the multimedia training sample by the image feature extraction model to obtain the image features of the multimedia training sample comprises:

acquiring a cover image of the multimedia training sample;

performing frame extraction on the multimedia training samples to obtain a preset number of sampling frame images;

and respectively extracting the image characteristics of the cover image and each sampling frame image through the image characteristic extraction model to be used as the image characteristics of the multimedia training sample.

4. The model training method of claim 3, wherein the step of processing the multimedia training sample by the text feature extraction model to obtain the text features of the multimedia training sample comprises:

respectively identifying the cover image and each sampling frame image by using a character identification technology to obtain a first identification text;

identifying the multimedia training sample by using a content identification technology to obtain a second identification text;

obtaining a description text of the multimedia training sample, and splicing the description text, the first identification text and the second identification text to obtain a text to be processed of the multimedia training sample;

and extracting the text features of the text to be processed through the text feature extraction model to serve as the text features of the multimedia training sample.

5. The model training method of claim 1, wherein the step of training the second feature extraction model according to the first loss function and the second loss function comprises:

acquiring a constraint parameter, and constructing a constraint condition expression based on the constraint parameter, the first loss function and the second loss function;

and training the second feature extraction model according to the constraint condition expression.

6. The model training method according to any one of claims 1 to 5, further comprising:

acquiring a multimedia file to be processed;

processing the multimedia file to be processed through the second feature extraction model to obtain the multimedia features of the multimedia file to be processed;

and processing the multimedia features through a second classifier, and determining a recommended label of the multimedia file to be processed.

7. A model training apparatus, comprising:

the acquisition module is configured to determine a multimedia training sample, and acquire historical sample characteristics and historical category characteristics of the multimedia training sample; wherein the historical sample features are predicted and output by using a first feature extraction model, and the historical category features are obtained by processing the historical sample features through a first classifier;

the computing module is configured to process the multimedia training samples through a second feature extraction model to obtain predicted sample features of the multimedia training samples, and compute a first loss function according to the historical sample features and the predicted sample features;

the calculation module is further configured to perform processing the predicted sample features by the first classifier to determine predicted class features of the multimedia training samples, calculate a second loss function from the historical class features and the predicted class features;

a training module configured to perform training the second feature extraction model according to the first loss function and the second loss function.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the model training method of any one of claims 1 to 6.

9. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, enable the electronic device to perform the model training method of any of claims 1-6.

10. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the model training method of any of claims 1 to 6.