CN112214639B

CN112214639B - Video screening method, video screening device and terminal equipment

Info

Publication number: CN112214639B
Application number: CN202011178088.8A
Authority: CN
Inventors: 尹康
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2024-06-18
Anticipated expiration: 2040-10-29
Also published as: CN112214639A

Abstract

The application provides a video screening method, which comprises the following steps: training the first classification model based on the video training set to obtain a trained first classification model; inputting the video into the trained first classification model aiming at each video in a plurality of basic videos and a plurality of expanded videos to obtain a feature vector of the trained first classification model aiming at the video; and screening out target videos from the video training set according to the feature vectors respectively corresponding to the videos in the video training set. By the method, the quality of the video data set can be improved.

Description

Video screening method, video screening device and terminal equipment

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video screening method, a video screening apparatus, a terminal device, and a computer readable storage medium.

Background

At present, various machine learning models are widely applied to scenes such as classification, detection and the like of images and videos. In a practical application scenario, for a given task (e.g., video classification), a developer needs to collect a sufficient number of training data sets and train a specified machine learning model (e.g., video classification model) through the training data sets, so that the specified machine learning model obtains better performance for the given task. It can be seen that the quality of the training dataset is one of the key factors that determine the actual performance of the machine learning model.

In application scenes such as video classification, compared with an image dataset, the video dataset has higher collection difficulty due to large data volume and high labeling cost, so that the quality of the current video dataset is often poor, and the performance of a video classification model obtained by training is limited in specific application such as model training through a video training set.

Disclosure of Invention

The embodiment of the application provides a video screening method, a video screening device, terminal equipment and a computer readable storage medium, which can improve the quality of a video data set.

In a first aspect, an embodiment of the present application provides a video filtering method, including:

Training the first classification model based on a video training set to obtain a trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos respectively corresponding to the basic videos, and each expansion video is obtained according to the corresponding basic video;

Inputting the video into the trained first classification model aiming at each video in a plurality of basic videos and a plurality of expanded videos to obtain a feature vector of the trained first classification model aiming at the video;

and screening out target videos from the video training set according to the feature vectors respectively corresponding to the videos in the video training set.

In a second aspect, an embodiment of the present application provides a video filtering apparatus, including:

the first training module is used for training the first classification model based on a video training set to obtain a trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively, and each expansion video is obtained according to the corresponding basic video;

The feature extraction module is used for inputting the video into the trained first classification model aiming at each video in the plurality of basic videos and the plurality of expanded videos to obtain feature vectors obtained by the trained first classification model aiming at the video;

and the screening module is used for screening out target videos from the video training set according to the feature vectors respectively corresponding to the videos in the video training set.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, a display, and a computer program stored in the memory and capable of running on the processor, where the processor implements the video screening method according to the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium storing a computer program, which when executed by a processor implements the video filtering method as described in the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which when run on a terminal device, causes the terminal device to perform the video screening method of the first aspect.

Compared with the prior art, the embodiment of the application has the beneficial effects that: in the embodiment of the application, the first classification model can be trained based on the video training set to obtain the trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively. Because each extended video is obtained according to the corresponding basic video, that is, each extended video has certain similarity with the corresponding basic video, the first classification model is trained according to the plurality of basic videos and the extended video corresponding to each basic video, so that the trained first classification model can better identify similar videos and can extract similar feature vectors from the similar videos, and the accuracy of the first classification model after training on the feature vectors extracted from the input videos is ensured in the subsequent processing process; then, inputting the video into the trained first classification model aiming at each video in a plurality of basic videos and a plurality of expanded videos to obtain a feature vector of the trained first classification model aiming at the video; at this time, feature vectors of each video can be extracted through the trained first classification model respectively, so that target videos can be screened out from the video training set according to the feature vectors corresponding to each video in the video training set respectively, and accordingly data cleaning can be performed on each video in the video training set based on the feature vectors according to requirements, and target videos meeting expectations can be obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a video filtering method according to an embodiment of the present application;

FIG. 2 is a flowchart of step S101 according to an embodiment of the present application;

FIG. 3 is a schematic diagram of training the first classification model and the third classification model according to an embodiment of the application;

fig. 4 is a schematic structural diagram of a video screening apparatus according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The video screening method provided by the embodiment of the application can be applied to terminal equipment such as a server, a desktop computer, a mobile phone, a tablet personal computer, wearable equipment, vehicle-mounted equipment, augmented reality (augmented reality, AR)/Virtual Reality (VR) equipment, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA) and the like, and the embodiment of the application does not limit the specific type of the terminal equipment.

In the practical application process, in the application scene of video classification, compared with an image dataset, the video dataset has higher collection difficulty due to large data volume and high labeling cost, so that the quality of the current video dataset is often poor, and the number of actual effective videos in a video training set is smaller, so that risks such as overfitting can be increased, and the generalization performance of a video classification model is limited.

However, the existing method for obtaining a video training set with higher quality is often based on the conventional Computer Vision (CV) feature descriptors, such as Scale-INVARIANT FEATURE TRANSFORM (SIFT) feature descriptors, directional gradient histogram (Histogram of Oriented Gradient, HOG) feature descriptors, to identify the similarity between videos of the video training set, so as to screen and obtain the video with higher quality from the video training set for training. However, the feature descriptors have low feature characterization capability, and often cannot identify videos more effectively, so that videos with high quality cannot be accurately screened, the performance of a video classification model obtained through training is greatly influenced, and other applications of the video data set are limited.

According to the embodiment of the application, the first classification model can be trained through the video training set comprising a plurality of basic videos and the expanded videos corresponding to the basic videos respectively, so that the accuracy of feature vector extraction of the input videos through the trained first classification model is ensured, then the videos are input into the trained first classification model aiming at each of the basic videos and the expanded videos, the feature vectors, which are obtained by the trained first classification model aiming at the videos, are obtained, and the target videos which are more in line with expectations are screened from the video training set according to the feature vectors.

Specifically, fig. 1 shows a flowchart of a video screening method provided by an embodiment of the present application, where the video screening method may be applied to a terminal device.

As shown in fig. 1, the video filtering method may include:

Step S101, training a first classification model based on a video training set to obtain a trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively, and each expansion video is obtained according to the corresponding basic video.

In the embodiment of the present application, the first classification model may be a model capable of classifying video. The first classification model may be a machine learning model such as a convolutional neural network (Convolutional Neural Networks, CNN) model, for example. The structure of the first classification model is not limited herein.

In some embodiments, each of the base videos may respectively correspond to a preset tag. At this time, the base video may be a video corresponding to a preset tag in the video training set. For example, the preset tag may include information such as a content identifier and a video number. The acquisition mode of the preset label corresponding to the basic video can be various. For example, the preset label may be obtained by manual labeling, or may be obtained by an algorithm such as keyword extraction or other information extraction. The preset label can be used for evaluating a training result through a loss function and the like when the first classification model is trained, namely evaluating the classification precision of the first classification model, so as to judge whether training is completed.

The extended video may be obtained in advance from a corresponding base video. The number of the expansion videos corresponding to each basic video can be different or the same. For example, in some examples, each base video may correspond to 10 extension videos, respectively. The label corresponding to the expanded video can be a preset label of the corresponding basic video, and can be obtained by extracting the preset label of the corresponding basic video. For example, if the extended video includes only a part of the content in the base video, the part associated with the part of the content in the preset tag may be used as the tag of the extended video.

According to the embodiment of the application, the first classification model is trained according to the plurality of basic videos and the expansion videos corresponding to the basic videos, so that the trained first classification model can better identify similar videos and can extract similar feature vectors from the similar videos, and the accuracy of the first classification model in feature extraction of the videos is improved in the training process.

In some embodiments, each video in the video training set meets a preset format condition, so that the formats of each video in the video training set are kept uniform, and the corresponding classification model is convenient to read and process.

For example, the preset labels of the basic videos in the video training set may be in a fixed-dimension label vector, and the file types of the basic videos in the video training set are the same, the video frame numbers are the same, the video duration is the same, the video frame sizes are the same, and/or the value ranges of the pixel points in the video frames are the same.

The extended video may be obtained in multiple ways, for example, the extended video may be obtained by sampling a corresponding base video; and/or, the extended video may be obtained by extracting an image of a designated image area in each video frame of the corresponding base video, and/or the extended video may be obtained by adding designated noise to each video frame of the corresponding base video.

In some examples, for each base video, an extended video of the base video may be generated in accordance with the following embodiments alone or in combination.

In some embodiments, prior to training the first classification model based on the video training set, comprising:

And aiming at each basic video, sampling the basic video at a preset sampling rate to obtain partial or all expanded videos corresponding to the basic video, wherein if the number of the expanded videos obtained by sampling is more than two, the initial sampling frames respectively corresponding to the expanded videos obtained by sampling in the basic video are different.

In the embodiment of the application, the initial sampling frame corresponding to each extended video can be determined according to the scene requirement. For example, for the base video a, the extended video A1 of the base video a may be obtained by sampling the 0 th, 5 th, 10 th, 15 th, … th frames of the base video a, and the extended video A2 of the base video a may be obtained by sampling the 3 rd, 8 th, 13 th, 18 th, … th frames of the base video a.

Optionally, in order that each video in the video training set meets a preset format condition, so that the formats of each video in the video training set are kept uniform, after the basic video is sampled to obtain an expanded video obtained by sampling, the basic video and/or the expanded video obtained by sampling can be adjusted to obtain the basic video meeting the preset format condition, and the expanded video obtained by sampling meeting the preset format condition is obtained; for example, the preset format condition may be that the file type is a specified type, the video frame number is a specified frame number, the video duration is a specified duration, the video frame size is a specified size, and/or the value range of the pixel point in the video frame is a specified range. And then, the basic video meeting the preset format condition and the expanded video obtained by sampling meeting the preset format condition can be used as at least part of basic video and at least part of expanded video in the video training set.

And aiming at each basic video, carrying out image extraction on the appointed image area in each video frame of the basic video to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by image extraction is more than two, the appointed image areas respectively corresponding to each extended video obtained by image extraction in the basic video are different.

The specified image areas corresponding to the respective expanded videos obtained by image extraction in the basic video may be different in size or may be different in area position in the corresponding basic video. For example, for the base video B, the extended video B1 of the base video B may be obtained by extracting an h×w image sub-region in the upper left corner of each frame of video frame in the base video B, and the extended video B2 of the base video B may be obtained by extracting an h×w image sub-region in the lower right corner of each frame of video frame in the base video B.

Optionally, in order that each video in the video training set meets a preset format condition, so that the formats of each video in the video training set are kept uniform, after the basic video is subjected to image extraction to obtain an expanded video obtained by image extraction, the basic video and/or the expanded video obtained by image extraction can be adjusted to obtain the basic video meeting the preset format condition, and the expanded video obtained by image extraction meeting the preset format condition is obtained; for example, the preset format condition may be that the file type is a specified type, the video frame number is a specified frame number, the video duration is a specified duration, the video frame size is a specified size, and/or the value range of the pixel point in the video frame is a specified range. And then, the basic video meeting the preset format condition and the expanded video obtained by sampling meeting the preset format condition can be used as at least part of basic video and at least part of expanded video in the video training set.

And adding appointed noise to each video frame in the basic video aiming at each basic video to obtain partial or all expanded videos corresponding to the basic video, wherein if the number of the expanded videos obtained by adding the appointed noise is more than two, the appointed noise respectively corresponding to each expanded video obtained by adding the appointed noise is different.

In the embodiment of the present application, the specified noise may be gaussian distributed noise or uniformly distributed noise, for example. The specific noise corresponding to each expanded video obtained by adding the specific noise can be different in distribution mode or different in noise size.

It should be noted that, the embodiments for acquiring the extended videos corresponding to the base videos in the video training set may be implemented separately or may be combined. For example, for a certain basic video, the extended video of the basic video may have an extended video obtained by 1 sampling, an extended video obtained by 2 image extraction, and an extended video obtained by 1 adding a specified noise. Or in another scene, for another basic video, the extended video of the basic video may be an extended video obtained by sampling and then extracting an image. Therefore, the acquisition modes of the expanded videos corresponding to the basic videos can be various.

In some embodiments, the step S101 may specifically include:

Step S201, randomly acquiring a first video and a second video from the video training set in each iterative training;

Step S202, inputting the first video into the first classification model, inputting the second video into a third classification model, and obtaining a first training result obtained by the first classification model for the first video and a second training result obtained by the third classification model for the second video, wherein the structure of the third classification model is the same as that of the first classification model;

step 203, based on the first training result and the second training result, obtaining a current loss value according to a preset loss function, and judging whether the current loss value meets a preset condition;

Step S204, if the current loss value meets a preset condition, the first classification model is used as a trained first classification model;

Step S205, if the current loss value does not meet the preset condition, updating the first classification model according to the first training result and the second training result, and executing the next iteration training according to the updated first classification model.

In the embodiment of the present application, the first classification model and the third classification model may form a twin structure, that is, a structure having two identical branches, and then, iteratively updating the structure according to a preset loss function until a loss value of the preset loss function corresponding to the twin structure obtained after the iterative updating meets a preset condition, and ending the training process.

Through the twin structure formed by the first classification model and the third classification model, training results corresponding to the first classification model and the third classification model can be mutually verified in the training process, and parameters of the first classification model and the third classification model can be updated in subsequent training iterations.

In addition, the first video and the second video corresponding to each iterative training are randomly acquired from the video training set, so that the first video and the second video may be similar videos, for example, the first video is a base video, and the second video may be an extended video corresponding to the base video. Or the first video and the second video may have a large difference from each other. At this time, when training is iterated each time, whether the first classification model and the third classification model can better identify similar features in the input video or not can be judged according to the first training result and the second training result, and different features in the input video are distinguished, so that the first classification model and the third classification model which are trained can better identify similar videos, and similar feature vectors can be extracted from the similar videos, and in the subsequent processing process, the accuracy of extracting the feature vectors from the input video by the first classification model which is trained is guaranteed.

It should be noted that, after training the first classification model and the third classification model is completed, parameters in the obtained trained first classification model and the trained third classification model may be identical, so in practical application, any branch in the twin structure may be used as the first classification model, and the other branch may be used as the third classification model.

In some embodiments, the preset penalty function includes a first classification penalty function for the first classification model, a second classification penalty function for the third classification model, and a similarity penalty function between the first classification model and the third classification model;

the first training result comprises a first feature vector of a designated middle layer in the first classification model, which is output by aiming at the first video, and further comprises a first classification result of the first classification model on the first video;

the second training result comprises a second feature vector of a designated middle layer in the third classification model, which is output by aiming at the second video, and further comprises a second classification result of the third classification model on the second video;

The step of obtaining a current loss value according to a preset loss function based on the first training result and the second training result and judging whether the current loss value meets a preset condition comprises the following steps:

Calculating a first loss value according to the first feature vector, the second feature vector and the similarity loss function;

calculating a second loss value according to the first classification result and the first classification loss function;

calculating a third loss value according to the second classification result and the second classification loss function;

Calculating the current loss value according to the first loss value, the second loss value and the third loss value;

and determining whether the current loss value meets a preset condition.

As shown in fig. 3, the first classification model and the third classification model are trained.

The first classification model can output a first feature vector by specifying a middle layer, the third classification model can output a second feature vector by specifying a middle layer, and then a first loss value can be calculated according to the first feature vector, the second feature vector and the similarity loss function. At this time, the first loss value may be used to indicate a similarity loss between the first classification model and the third classification model.

In addition, the first classification model may output a first classification result for the first video, and the third classification model may output a second classification result for the second video, so that a second loss value may be calculated according to the first classification result and the first classification loss function, and a third loss value may be calculated according to the second classification result and the second classification loss function. At this time, the second loss value may be used to indicate a classification loss of the first classification model, and the third loss value may be used to indicate a classification loss of the third classification model.

Thus, in combination with the first, second and third loss values, the current loss value may be calculated. Specifically, the current loss value may be calculated by weighting and summing according to weights corresponding to the first loss value, the second loss value and the third loss value, so as to more comprehensively evaluate whether the first classification model and the third classification model are trained.

Illustratively, the similarity loss function L _REG may be:

L_REG＝max(0,α-δ(y₁＝y₂)D(f₁-f₂))

Where f ₁ is the first feature vector, f ₂ is the second feature vector, D (·) is the selected distance function, δ (·) is taken to be 1 if and only if the first video and the second video are associated with the same base video, or 0 if not, α is the first predetermined weight.

The first classification loss function L _CE1 and the second classification loss function L _CE2 may be functions that indicate cross entropy loss.

The preset Loss function Loss may be:

Loss＝L_CE1+L_CE2+βL_REG

wherein β is a second predetermined weight.

Of course, other arrangements of the preset loss function are possible, and the above description is only an exemplary illustration of the preset loss function, and is not limiting.

Step S102, inputting the video into the trained first classification model for each video in the basic videos and the extended videos, and obtaining feature vectors obtained by the trained first classification model for the video.

It can be appreciated that in the embodiment of the present application, each video in the video training set includes a base video and an extended video.

In the embodiment of the application, the feature vector for the input video can be obtained through the trained first classification model. Wherein, the number of the feature vectors can be one or more than two. For example, the feature vector may include a class probability vector output by the trained first classification model for the video; the feature vector may further include feature extraction vectors output by one or more intermediate layers in the trained first classification model for the video, for example, feature extraction vectors output by a layer preceding a classifier (typically the last fully connected layer) in the trained first classification model.

In the embodiment of the application, the image characteristics of the video can be represented through the characteristic vector corresponding to the video, so that each video in the video training set can be further screened according to the characteristic vector.

Step S103, screening out target videos from the video training set according to the feature vectors corresponding to the videos in the video training set.

In the embodiment of the application, a plurality of modes for screening the target video from the video training set can be adopted. The target video may be screened from the video training set according to information such as entropy of each feature vector, accumulated playing times corresponding to each video in an actual application scene, video duration corresponding to each video, fuzzy frame duty ratio corresponding to each video, and number of key frames corresponding to each video.

In the embodiment of the application, the feature vectors of each video can be respectively extracted through the first classification model after training, so that the target video is screened out from the video training set according to the feature vectors respectively corresponding to each video in the video training set, and accordingly, the data of each video in the video training set can be cleaned based on the feature vectors according to requirements, and the target video meeting expectations is obtained.

By screening the video training set, some videos in states such as high repetition rate, less information quantity, unclear characteristics and the like can be deleted from the video training set, so that target videos more meeting training requirements can be obtained, and the negative influence of too many videos with poor quality on corresponding video classification models in the training process can be avoided.

In some embodiments, the feature vectors include a third feature vector and a fourth feature vector;

Inputting the video into the trained first classification model for each video of the plurality of base videos and the plurality of extended videos, and obtaining feature vectors obtained by the trained first classification model for the video, wherein the feature vectors comprise:

Inputting the video into the trained first classification model for each of a plurality of the base videos and a plurality of the expanded videos, obtaining a third feature vector of a designated middle layer of the trained first classification model for the video output, and/or obtaining a fourth feature vector of a last layer of the trained first classification model for the video output;

the screening the target video from the video training set according to the feature vectors corresponding to the videos in the video training set, including:

And screening out target videos from the video training set according to the third feature vector and/or the fourth feature vector respectively corresponding to each video in the video training set.

In the embodiment of the present application, the specified intermediate layer may be capable of outputting a third feature vector extracted for the video. The specified middle layer may be determined according to a structure of the first classification model. In some examples, the designated middle layer is a layer prior to a classifier (typically a last fully connected layer) in the trained first classification model. The third feature vector may include feature information extracted from the video, and the fourth feature vector may be a category probability vector.

After the third feature vector and/or the fourth feature vector corresponding to each video are obtained, the image features of the video frames contained in each video can be respectively judged according to the corresponding third feature vector and/or fourth feature vector, and the information such as similarity among the videos is judged, so that the videos in the video training set can be screened according to requirements.

The method for screening the target video from the video training set can be various. For example, clustering may be performed according to each third feature vector to obtain a clustering result, and then a target video may be screened from the clustering result according to each fourth feature vector; or the target video can be screened from the video training set according to the information such as the accumulated playing times of each video in the actual application scene, the video time length of each video, the fuzzy frame duty ratio of each video, the key frame number of each video and the like.

In some embodiments, the screening the target video from the video training set according to the third feature vector and/or the fourth feature vector corresponding to each video in the video training set includes:

Clustering the third feature vectors to obtain a clustering result, wherein the clustering result comprises at least two clusters, and each cluster comprises at least one third feature vector;

And for each cluster, screening out target videos corresponding to the clusters from videos corresponding to the third feature vectors of the clusters.

In the embodiment of the present application, the third feature vectors may be clustered in a plurality of ways, for example, each third feature vector may be clustered by at least one of a K-MEANS clustering algorithm, a Density-based clustering algorithm such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise), a desired maximization clustering algorithm using a gaussian mixture model (Gaussian Mixture Model, GMM), and the like. In some examples, when clustering the third feature vectors, in order to facilitate comparison of normalization of distances between the third feature vectors, and reduce performance degradation that may be caused by manually specifying superparameters, a cosine distance may be selected for a distance metric between the third feature vectors, and the clustering algorithm may employ a Density-based clustering algorithm such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

After the clustering result is obtained, for each cluster, the target video corresponding to the cluster can be screened from the videos corresponding to the cluster according to the information such as the video duration, the fuzzy frame duty ratio, the key frame number and the like respectively corresponding to the videos respectively corresponding to the third feature vectors of the cluster, so that the target video meeting the preset requirement is obtained.

In some embodiments, for each cluster, selecting a target video corresponding to the cluster from videos corresponding to respective third feature vectors of the cluster includes:

For each cluster, calculating entropy of fourth feature vectors corresponding to the third feature vectors of the cluster respectively;

Determining a target fourth feature vector corresponding to the cluster according to the entropy of each fourth feature vector;

and taking the video corresponding to the target fourth feature vector as the target video corresponding to the cluster.

Wherein the entropy of each fourth eigenvector can be calculated according to the following formula:

Wherein the fourth feature vector is B, and B _i∈B,H_B is the entropy of the fourth feature vector.

After obtaining the entropy of each fourth feature vector, the L fourth feature vectors with the maximum entropy corresponding to the cluster can be used as target fourth feature vectors corresponding to the cluster, so that target videos with better quality can be screened from the cluster, and other similar videos with possibly smaller information content in the video training set can be deleted.

In the embodiment of the application, the first classification model can be trained based on the video training set to obtain the trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively. Because each extended video is obtained according to the corresponding basic video, that is, each extended video has certain similarity with the corresponding basic video, the first classification model is trained according to the plurality of basic videos and the extended video corresponding to each basic video, so that the trained first classification model can better identify similar videos and can extract similar feature vectors from the similar videos, and the accuracy of the first classification model after training on the feature vectors extracted from the input videos is ensured in the subsequent processing process; then, inputting the video into the trained first classification model aiming at each video in a plurality of basic videos and a plurality of expanded videos to obtain a feature vector of the trained first classification model aiming at the video; at this time, feature vectors of each video can be extracted through the trained first classification model respectively, so that target videos can be screened out from the video training set according to the feature vectors corresponding to each video in the video training set respectively, and accordingly data cleaning can be performed on each video in the video training set based on the feature vectors according to requirements, and target videos meeting expectations can be obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained.

In some embodiments, after screening out the target video, further comprising:

And training a second classification model based on the target video to obtain a trained second classification model, wherein the structure of the second classification model is the same as that of the first classification model.

In the embodiment of the present application, the structure of the second classification model is the same as the structure of the first classification model. And because the quality of the target video obtained by screening is often better, the second classification model is trained based on the target video, and the obtained trained second classification model is often better in generalization performance and other performances, and the conditions of fitting and the like cannot occur, so that the accuracy of video classification can be improved.

Training the second classification model based on the target video to obtain a trained second classification model, and further comprising:

acquiring a video to be predicted;

if the format of the video to be predicted does not meet the preset format condition, carrying out format adjustment on the video to be predicted so that the format of the video to be predicted after the format adjustment meets the preset format condition;

inputting the video to be predicted after the format adjustment into the trained second classification model, and obtaining a class vector of the trained second classification model output aiming at the video to be predicted after the format adjustment;

and determining the category of the video to be predicted according to the category vector.

The class vector P may be, for example, p= { P1, P2, …, pn }. The manner of determining the category of the video to be predicted according to the category vector may be:

traversing P by a preset threshold T, and if pi > T exists, determining that the video to be predicted belongs to the category corresponding to pi.

Or according to the category vector, the manner of determining the category of the video to be predicted may be:

And screening out the first K elements with the maximum size from the category vector P, and determining that the video to be predicted belongs to the category corresponding to the first K elements.

It is understood that the video to be predicted may belong to one category, or may also belong to more than two categories.

In the embodiment of the application, the video training set comprises a plurality of basic videos, each video in the video training set can be subjected to data cleaning based on the feature vector according to requirements to obtain a target video which meets expectations, and then a second classification model is trained based on the target video to obtain a trained second classification model, wherein the structure of the second classification model is the same as that of the first classification model.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Corresponding to the video screening method described in the above embodiments, fig. 4 shows a block diagram of a video screening apparatus according to an embodiment of the present application, and for convenience of explanation, only the portions related to the embodiment of the present application are shown.

Referring to fig. 4, the video screening apparatus 4 includes:

The first training module 401 is configured to train the first classification model based on a video training set to obtain a trained first classification model, where the video training set includes a plurality of basic videos and extended videos corresponding to the basic videos, and each extended video is obtained according to the corresponding basic video;

A feature extraction module 402, configured to input, for each of a plurality of the base videos and a plurality of the extended videos, the videos into the trained first classification model, and obtain feature vectors obtained by the trained first classification model for the videos;

And the screening module 403 is configured to screen out a target video from the video training set according to the feature vectors corresponding to the videos in the video training set.

Optionally, the video screening apparatus 4 further includes:

And the second training module is used for training the second classification model based on the target video to obtain a trained second classification model, wherein the structure of the second classification model is the same as that of the first classification model.

Optionally, the video screening apparatus 4 further includes:

The sampling module is used for sampling the basic videos at a preset sampling rate for each basic video to obtain partial or all expanded videos corresponding to the basic videos, wherein if the number of the expanded videos obtained by sampling is more than two, the initial sampling frames corresponding to the expanded videos obtained by sampling in the basic videos are different.

Optionally, the video screening apparatus 4 further includes:

The image extraction module is used for carrying out image extraction on the appointed image areas in each video frame of the basic video aiming at each basic video to obtain partial or all expanded videos corresponding to the basic video, wherein if the number of the expanded videos obtained by image extraction is more than two, the appointed image areas respectively corresponding to each expanded video obtained by image extraction in the basic video are different.

Optionally, the video screening apparatus 4 further includes:

the noise adding module is used for adding specified noise to each video frame in the basic video aiming at each basic video to obtain partial or all expanded videos corresponding to the basic video, wherein if the number of the expanded videos obtained by adding the specified noise is more than two, the specified noise corresponding to each expanded video obtained by adding the specified noise is different.

Optionally, the first training module 401 specifically includes:

The first acquisition unit is used for randomly acquiring a first video and a second video from the video training set in each iterative training;

The first processing unit is used for inputting the first video into the first classification model, inputting the second video into a third classification model, and obtaining a first training result obtained by the first classification model for the first video and a second training result obtained by the third classification model for the second video, wherein the structure of the third classification model is the same as that of the first classification model;

The second processing unit is used for obtaining a current loss value according to a preset loss function based on the first training result and the second training result and judging whether the current loss value meets a preset condition or not;

the third processing unit is used for taking the first classification model as a first classification model after training is completed if the current loss value accords with a preset condition;

And the fourth processing unit is used for updating the first classification model according to the first training result and the second training result if the current loss value does not meet the preset condition, and executing the next iteration training according to the updated first classification model.

Optionally, the preset loss function includes a first classification loss function with respect to the first classification model, a second classification loss function with respect to the third classification model, and a similarity loss function between the first classification model and the third classification model;

the second processing unit includes:

A first calculating subunit, configured to calculate a first loss value according to the first feature vector, the second feature vector, and the similarity loss function;

a second calculating subunit, configured to calculate a second loss value according to the first classification result and the first classification loss function;

A third calculation subunit, configured to calculate a third loss value according to the second classification result and the second classification loss function;

A fourth calculating subunit, configured to calculate the current loss value according to the first loss value, the second loss value, and the third loss value;

and the determining subunit is used for determining whether the current loss value meets a preset condition.

Optionally, the feature vector includes a third feature vector and a fourth feature vector;

the feature extraction module 402 is specifically configured to:

The screening module 403 is specifically configured to:

Optionally, the screening module 403 specifically includes:

the clustering unit is used for clustering the third feature vectors to obtain a clustering result, wherein the clustering result comprises at least two clusters, and each cluster comprises at least one third feature vector;

The screening unit is used for screening target videos corresponding to each cluster from videos corresponding to the third feature vectors of the cluster.

In the embodiment of the application, the first classification model can be trained based on the video training set to obtain the trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively. Because each extended video is obtained according to the corresponding basic video, that is, each extended video has certain similarity with the corresponding basic video, the first classification model is trained according to the plurality of basic videos and the extended video corresponding to each basic video, so that the trained first classification model can better identify similar videos and can extract similar feature vectors from the similar videos, and the accuracy of the first classification model after training on the feature vectors extracted from the input videos is ensured in the subsequent processing process; then, inputting the video into the trained first classification model aiming at each video in a plurality of basic videos and a plurality of expanded videos to obtain a feature vector of the trained first classification model aiming at the video; at this time, feature vectors of each video can be extracted through the trained first classification model respectively, so that target videos can be screened out from the video training set according to the feature vectors corresponding to each video in the video training set respectively, and accordingly data cleaning can be performed on each video in the video training set based on the feature vectors according to requirements, and target videos meeting expectations can be obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained. The video training set includes a plurality of base videos

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, the terminal device 5 of this embodiment includes: at least one processor 50 (only one is shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, the processor 50 implementing the steps in any of the various video screening method embodiments described above when executing the computer program 52.

The terminal device 5 may be a server, a mobile phone, a wearable device, an Augmented Reality (AR)/Virtual Reality (VR) device, a desktop computer, a notebook computer, a desktop computer, a palm computer, or other computing devices. The terminal device may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the terminal device 5 and is not limiting of the terminal device 5, and may include more or fewer components than shown, or may combine certain components, or different components, such as may also include input devices, output devices, network access devices, etc. The input device may include a keyboard, a touch pad, a fingerprint collection sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, a camera, and the like, and the output device may include a display, a speaker, and the like.

The Processor 50 may be a central processing unit (Central Processing Unit, CPU), and the Processor 50 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may in some embodiments be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5 in other embodiments, for example, a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the terminal device 5. Further, the memory 51 may include both the internal storage unit and the external storage device of the terminal device 5. The memory 51 is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, other programs, and the like, such as program codes of the computer programs. The above-described memory 51 may also be used to temporarily store data that has been output or is to be output.

In addition, although not shown, the terminal device 5 may further include a network connection module, such as a bluetooth module Wi-Fi module, a cellular network module, and so on, which will not be described herein.

In the embodiment of the present application, when the processor 50 executes the computer program 52 to implement the steps in any of the embodiments of the video screening method, the first classification model may be trained based on a video training set, so as to obtain a trained first classification model, where the video training set includes a plurality of basic videos and extended videos corresponding to the basic videos respectively. Because each extended video is obtained according to the corresponding basic video, that is, each extended video has certain similarity with the corresponding basic video, the first classification model is trained according to the plurality of basic videos and the extended video corresponding to each basic video, so that the trained first classification model can better identify similar videos and can extract similar feature vectors from the similar videos, and the accuracy of the first classification model after training on the feature vectors extracted from the input videos is ensured in the subsequent processing process; then, inputting the video into the trained first classification model aiming at each video in a plurality of basic videos and a plurality of expanded videos to obtain a feature vector of the trained first classification model aiming at the video; at this time, feature vectors of each video can be extracted through the trained first classification model respectively, so that target videos can be screened out from the video training set according to the feature vectors corresponding to each video in the video training set respectively, and accordingly data cleaning can be performed on each video in the video training set based on the feature vectors according to requirements, and target videos meeting expectations can be obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained. The video training set includes a plurality of base videos

The embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.

Embodiments of the present application provide a computer program product enabling a terminal device to carry out the steps of the method embodiments described above when the computer program product is run on the terminal device.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program comprises computer program code, and the computer program code can be in a source code form, an object code form, an executable file or some intermediate form and the like. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of modules or elements described above is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of video screening comprising:

Screening out target videos from the video training set according to the feature vectors respectively corresponding to the videos in the video training set;

training the first classification model based on the video training set to obtain a trained first classification model, wherein the training comprises the following steps:

Randomly acquiring a first video and a second video from the video training set in each iterative training;

Inputting the first video into the first classification model, inputting the second video into a third classification model, and acquiring a first training result obtained by the first classification model for the first video and a second training result obtained by the third classification model for the second video, wherein the structure of the third classification model is the same as that of the first classification model;

based on the first training result and the second training result, obtaining a current loss value according to a preset loss function, and judging whether the current loss value meets a preset condition or not;

If the current loss value meets a preset condition, the first classification model is used as a trained first classification model;

if the current loss value does not meet the preset condition, updating the first classification model according to the first training result and the second training result, and executing the next iteration training according to the updated first classification model;

The preset loss function includes a first classification loss function with respect to the first classification model, a second classification loss function with respect to the third classification model, and a similarity loss function between the first classification model and the third classification model;

and determining whether the current loss value meets a preset condition.

2. The video screening method of claim 1, comprising, prior to training the first classification model based on the video training set:

3. The video screening method of claim 1, comprising, prior to training the first classification model based on the video training set:

4. The video screening method of claim 1, comprising, prior to training the first classification model based on the video training set:

5. The video screening method according to claim 1, further comprising, after screening out the target video:

6. The video filtering method according to any one of claims 1 to 5, wherein the feature vectors include a third feature vector and a fourth feature vector;

7. The video screening method according to claim 6, wherein the screening the target video from the video training set according to the third feature vector and/or the fourth feature vector corresponding to each video in the video training set, includes:

8. A video screening apparatus, comprising:

The screening module is used for screening out target videos from the video training set according to the feature vectors corresponding to the videos in the video training set respectively;

and determining whether the current loss value meets a preset condition.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the video screening method according to any of claims 1 to 7 when executing the computer program.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the video screening method of any one of claims 1 to 7.