CN112214639A

CN112214639A - Video screening method, video screening device and terminal equipment

Info

Publication number: CN112214639A
Application number: CN202011178088.8A
Authority: CN
Inventors: 尹康
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-12
Anticipated expiration: 2040-10-29
Also published as: CN112214639B

Abstract

The application provides a video screening method, which comprises the following steps: training the first classification model based on the video training set to obtain a trained first classification model; inputting the videos into the trained first classification model aiming at each of the plurality of basic videos and the plurality of extended videos, and obtaining feature vectors obtained by the trained first classification model aiming at the videos; and screening out the target video from the video training set according to the characteristic vector corresponding to each video in the video training set. By the method, the quality of the video data set can be improved.

Description

Video screening method, video screening device and terminal equipment

Technical Field

The present application belongs to the field of video processing technologies, and in particular, to a video screening method, a video screening apparatus, a terminal device, and a computer-readable storage medium.

Background

At present, various machine learning models are widely applied to scenes such as classification and detection of images and videos. In a practical application scenario, for a given task (e.g., video classification), a developer needs to collect a sufficient number of training data sets and train a specified machine learning model (e.g., video classification model) through the training data sets, so that the specified machine learning model obtains a better performance for the given task. It can be seen that the quality of the training data set is one of the key factors determining the actual performance of the machine learning model.

In application scenes such as video classification, compared with an image data set, the video data set has higher collection difficulty due to large data volume and high labeling cost, so that the quality of the current video data set is often poor, and the performance of a trained video classification model is limited in specific applications such as model training through a video training set.

Disclosure of Invention

The embodiment of the application provides a video screening method, a video screening device, a terminal device and a computer readable storage medium, which can improve the quality of a video data set.

In a first aspect, an embodiment of the present application provides a video screening method, including:

training a first classification model based on a video training set to obtain a trained first classification model, wherein the video training set comprises a plurality of basic videos and extended videos corresponding to the basic videos respectively, and each extended video is obtained according to the corresponding basic video;

inputting the videos into the trained first classification model aiming at each of the plurality of basic videos and the plurality of extended videos, and obtaining feature vectors obtained by the trained first classification model aiming at the videos;

and screening out the target video from the video training set according to the characteristic vector corresponding to each video in the video training set.

In a second aspect, an embodiment of the present application provides a video screening apparatus, including:

the video classification system comprises a first training module, a second training module and a third training module, wherein the first training module is used for training a first classification model based on a video training set to obtain the trained first classification model, the video training set comprises a plurality of basic videos and extended videos corresponding to the basic videos respectively, and each extended video is obtained according to the corresponding basic video;

the feature extraction module is used for inputting the videos into the trained first classification model aiming at each of the plurality of basic videos and the plurality of extended videos to obtain feature vectors obtained by the trained first classification model aiming at the videos;

and the screening module is used for screening out the target video from the video training set according to the characteristic vector corresponding to each video in the video training set.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, a display, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the video screening method according to the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the video screening method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the video screening method described in the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the application, the first classification model can be trained based on a video training set to obtain the trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively. Each extended video is obtained according to a corresponding basic video, namely, each extended video and the corresponding basic video have certain similarity, so that the first classification model is trained according to a plurality of basic videos and the extended videos corresponding to the basic videos respectively, the trained first classification model can better identify similar videos, similar feature vectors can be extracted from the similar videos, and the accuracy of extracting the feature vectors from the input videos through the trained first classification model is ensured in the subsequent processing process; then, aiming at each video in the plurality of basic videos and the plurality of extended videos, inputting the video into the trained first classification model to obtain a feature vector obtained by the trained first classification model aiming at the video; at this time, the feature vectors of the videos can be respectively extracted through the trained first classification model, so that the target videos are screened out from the video training set according to the feature vectors corresponding to the videos in the video training set, and therefore data cleaning can be performed on the videos in the video training set on the basis of the feature vectors according to requirements, and the target videos meeting expectations are obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video screening method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of step S101 according to an embodiment of the present application;

FIG. 3 is a schematic diagram of training the first classification model and the third classification model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a video screening apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The video screening method provided by the embodiment of the application can be applied to terminal devices such as a server, a desktop computer, a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the embodiment of the application does not limit the specific types of the terminal devices at all.

In the practical application process, in the application scene of video classification, compared with an image data set, the video data set has higher collection difficulty due to large data volume and high labeling cost, so that the quality of the current video data set is often poor, and the number of practical effective videos in a video training set is small, so that risks such as overfitting can be increased, and the generalization performance of a video classification model is limited.

The existing method for obtaining a video training set with higher quality usually identifies similarity between videos of the video training set based on a traditional Computer Vision (CV) feature descriptor, such as a Scale-invariant feature transform (SIFT), a Histogram of Oriented Gradients (HOG), and other feature descriptors, so as to screen and obtain a video with higher quality from the video training set for training. However, the feature descriptors have low feature characterization capability and often cannot effectively identify videos, so that videos with high quality cannot be accurately screened, the performance of a trained video classification model is greatly influenced, and other applications of the video data set are limited.

According to the embodiment of the application, the first classification model can be trained through the video training set which comprises a plurality of basic videos and extended videos corresponding to the basic videos respectively, so that the accuracy of extracting the feature vectors of the input videos through the trained first classification model is guaranteed, then the videos are input into the trained first classification model aiming at the feature vectors obtained by the videos aiming at each video in the extended videos aiming at the basic videos and the plurality of videos, and the target videos which accord with the expectation are screened out from the video training set according to the feature vectors.

Specifically, fig. 1 shows a flowchart of a video screening method provided in an embodiment of the present application, where the video screening method can be applied to a terminal device.

As shown in fig. 1, the video screening method may include:

step S101, training a first classification model based on a video training set to obtain a trained first classification model, wherein the video training set comprises a plurality of basic videos and extended videos corresponding to the basic videos respectively, and each extended video is obtained according to the corresponding basic video.

In this embodiment, the first classification model may be a model capable of video classification. For example, the first classification model may be a machine learning model such as a Convolutional Neural Networks (CNN) model. The structure of the first classification model is not limited herein.

In some embodiments, each of the base videos may correspond to a preset tag. At this time, the basic video may be a video in the video training set corresponding to a preset tag. Illustratively, the preset tag may include information such as content identification and video number. The preset tag corresponding to the basic video can be obtained in various ways. For example, the preset labels may be obtained by manual labeling, or may be obtained by an algorithm such as keyword extraction or other information extraction. The preset label may be used to evaluate a training result through a loss function and the like when the first classification model is trained, that is, to evaluate the classification accuracy of the first classification model, so as to determine whether training is completed.

The extension video can be obtained in advance according to the corresponding basic video. The number of the extension videos corresponding to each basic video can be different or the same. For example, in some examples, each base video may correspond to 10 extension videos, respectively. The label corresponding to the extended video may be a preset label of the corresponding basic video, and may be obtained by extracting the preset label of the corresponding basic video. For example, if the extension video only includes a part of the content in the base video, a part of the preset tag associated with the part of the content may be used as a tag of the extension video.

In the embodiment of the application, according to a plurality of basic videos and each extended video training corresponding to the basic videos respectively, the first classification model can enable the trained first classification model to better identify similar videos and extract similar feature vectors from the similar videos, so that in the training process, the accuracy of the first classification model in feature extraction of the videos is improved.

In some embodiments, each video in the video training set satisfies a preset format condition, so that the format of each video in the video training set is kept uniform, and the corresponding classification model is conveniently read and processed.

For example, the format of the preset label of each base video in the video training set may be a label vector with a fixed dimension, and the file types of each base video in the video training set are the same, the number of frames of video frames is the same, the video duration is the same, the size of video frames is the same, and/or the value ranges of pixels in video frames are the same, and so on.

The extended video may be obtained in a variety of manners, for example, the extended video may be obtained by sampling a corresponding base video; and/or the extension video can be obtained by performing image extraction on a specified image area in each video frame of the corresponding base video, and/or the extension video can be obtained by adding specified noise to each video frame of the corresponding base video.

In some examples, for each base video, a development video for the base video may be generated according to the following embodiments alone or in combination.

In some embodiments, before training the first classification model based on the video training set, the method further comprises:

and for each basic video, sampling the basic video at a preset sampling rate to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by sampling is more than two, the initial sampling frames respectively corresponding to the extended videos obtained by sampling in the basic video are different.

In the embodiment of the application, the initial sampling frame corresponding to each extension video can be determined according to the scene requirement. For example, for a base video a, the extension video a1 of the base video a may be obtained by sampling the 0 th, 5 th, 10 th, 15 … th video frames of the base video a, and the extension video a2 of the base video a may be obtained by sampling the 3 rd, 8 th, 13 th, 18 … th video frames of the base video a.

Optionally, in order to enable each video in the video training set to meet a preset format condition, so that formats of each video in the video training set are kept uniform, after the basic video is sampled and an extended video obtained by sampling is obtained, the basic video and/or the extended video obtained by sampling may be adjusted to obtain the basic video meeting the preset format condition, and the extended video obtained by sampling and meeting the preset format condition is obtained; for example, the preset format condition may be that the file type is a designated type, the number of video frames is a designated frame number, the video duration is a designated duration, the size of a video frame is a designated size, and/or the value range of a pixel point in a video frame is a designated range. Then, the basic video meeting the preset format condition and the extended video obtained by sampling meeting the preset format condition can be used as at least part of the basic video and at least part of the extended video in the video training set.

and for each basic video, performing image extraction on specified image areas in each video frame of the basic video to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by the image extraction is more than two, the specified image areas corresponding to the extended videos obtained by the image extraction in the basic video are different.

In an exemplary embodiment, the designated image areas respectively corresponding to the extended videos obtained by image extraction in the base video may have different sizes, or may have different area positions in the corresponding base video. For example, for a base video B, the extension video B1 of the base video B may be obtained by performing image extraction on an h × w image sub-region at the upper left corner of each frame of video frame in the base video B, and the extension video B2 of the base video B may be obtained by performing image extraction on an h × w image sub-region at the lower right corner of each frame of video frame in the base video B.

Optionally, in order to enable each video in the video training set to meet a preset format condition, so that formats of each video in the video training set are kept uniform, after the base video is subjected to image extraction and an extended video obtained by image extraction is obtained, the base video and/or the extended video obtained by image extraction may be adjusted to obtain the base video meeting the preset format condition, and the extended video obtained by image extraction meeting the preset format condition is obtained; for example, the preset format condition may be that the file type is a designated type, the number of video frames is a designated frame number, the video duration is a designated duration, the size of a video frame is a designated size, and/or the value range of a pixel point in a video frame is a designated range. Then, the basic video meeting the preset format condition and the extended video obtained by sampling meeting the preset format condition can be used as at least part of the basic video and at least part of the extended video in the video training set.

and adding specified noise to each video frame in the basic video aiming at each basic video to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by adding the specified noise is more than two, the specified noise corresponding to each extended video obtained by adding the specified noise is different.

In the embodiment of the present application, the specified noise may be gaussian distributed noise or uniformly distributed noise, for example. The assigned noises respectively corresponding to the extended videos obtained by adding the assigned noises may be different in distribution mode or different in noise size.

It should be noted that the above embodiments for acquiring the extension videos corresponding to the base videos in the video training set may be implemented separately or combined. For example, for a certain basic video, the extension video of the basic video may include 1 sample extension video, 2 image extraction extension videos, and 1 added extension video with specified noise. Or, in another scene, for another base video, the extension video of the base video may be an extension video obtained by sampling and then performing image extraction. Therefore, the acquisition modes of the extension videos corresponding to the basic videos can be various.

In some embodiments, the step S101 may specifically include:

step S201, in each iterative training, randomly acquiring a first video and a second video from the video training set;

step S202, inputting the first video into the first classification model, inputting the second video into a third classification model, and obtaining a first training result of the first classification model for the first video and a second training result of the third classification model for the second video, wherein the structure of the third classification model is the same as that of the first classification model;

step S203, based on the first training result and the second training result, obtaining a current loss value according to a preset loss function, and judging whether the current loss value meets a preset condition;

step S204, if the current loss value meets a preset condition, taking the first classification model as a trained first classification model;

step S205, if the current loss value does not meet a preset condition, updating the first classification model according to the first training result and the second training result, and executing the next iterative training according to the updated first classification model.

In this embodiment of the application, the first classification model and the third classification model may form a twin structure, that is, a structure having two identical branches, and then, the structure is iteratively updated according to a preset loss function until a loss value of the preset loss function corresponding to the twin structure obtained after the iterative update meets a preset condition, and the training process is ended.

Through the twin structure formed by the first classification model and the third classification model, the training results corresponding to the first classification model and the third classification model can be mutually verified in the training process, and the training results are used for updating the parameters of the first classification model and the third classification model in the subsequent training iteration.

In addition, the first video and the second video corresponding to each iterative training are randomly acquired from the video training set, so that the first video and the second video may be similar videos to each other, for example, the first video is a base video, and the second video may be an extension video corresponding to the base video. Alternatively, the first video and the second video may be significantly different from each other. At this time, during each training iteration, whether the first classification model and the third classification model can better identify similar features in the input video or not can be judged according to the first training result and the second training result, and different features in the input video can be distinguished, so that the first classification model and the third classification model after training can better identify similar videos, and similar feature vectors can be extracted from the similar videos, so that the accuracy of extracting the feature vectors from the input video through the first classification model after training can be ensured in the subsequent processing process.

It should be noted that after the training of the first classification model and the third classification model is completed, parameters in the obtained trained first classification model and the trained third classification model may be identical, and thus, in practical applications, any branch in the twin structure may be used as the first classification model, and another branch may be used as the third classification model.

In some embodiments, the preset loss functions include a first classification loss function with respect to the first classification model, a second classification loss function with respect to the third classification model, and a similarity loss function between the first classification model and the third classification model;

the first training result comprises a first feature vector of a specified middle layer in the first classification model for the first video output and further comprises a first classification result of the first classification model on the first video;

the second training result comprises a second feature vector output by a specified middle layer in the third classification model for the second video and also comprises a second classification result of the third classification model for the second video;

the obtaining a current loss value according to a preset loss function based on the first training result and the second training result, and judging whether the current loss value meets a preset condition, including:

calculating a first loss value according to the first feature vector, the second feature vector and the similarity loss function;

calculating a second loss value according to the first classification result and the first classification loss function;

calculating a third loss value according to the second classification result and the second classification loss function;

calculating the current loss value according to the first loss value, the second loss value and the third loss value;

and determining whether the current loss value meets a preset condition.

Fig. 3 is a schematic diagram illustrating the training of the first classification model and the third classification model.

Wherein the first classification model may output a first feature vector by specifying an intermediate layer, the third classification model may output a second feature vector by specifying an intermediate layer, and then, a first loss value may be calculated from the first feature vector, the second feature vector, and the similarity loss function. At this time, the first loss value may be used to indicate a loss of similarity between the first classification model and the third classification model.

Furthermore, the first classification model may output a first classification result for the first video, and the third classification model may output a second classification result for the second video, so that a second loss value may be calculated from the first classification result and the first classification loss function, and a third loss value may be calculated from the second classification result and the second classification loss function. At this time, the second loss value may be used to indicate a classification loss of the first classification model, and the third loss value may be used to indicate a classification loss of the third classification model.

Thus, in combination with the first loss value, the second loss value and the third loss value, the current loss value may be calculated. Specifically, the current loss value may be calculated by performing weighted summation and the like according to weights respectively corresponding to the first loss value, the second loss value and the third loss value, so that whether the training of the first classification model and the third classification model is completed may be more comprehensively evaluated.

Illustratively, the similarity loss function L_REGCan be as follows:

L_REG＝max(0,α-δ(y₁＝y₂)D(f₁-f₂))

wherein f is₁Is a first feature vector, f₂For the second feature vector, D (-) is a selected distance function, δ (-) takes 1 if and only if the first and second videos are associated with the same base video, otherwise 0, α is a first predetermined weight.

The first classification loss function L_CE1And said second classification loss function L_CE2May be a function indicating cross entropy loss.

The preset Loss function Loss may be:

Loss＝L_CE1+L_CE2+βL_REG

where β is a second predetermined weight.

Of course, the preset loss function may have other configurations, and the above description is only an exemplary illustration of the preset loss function, and is not limited thereto.

Step S102, aiming at each video in the plurality of basic videos and the plurality of extended videos, inputting the video into the trained first classification model, and obtaining a feature vector obtained by the trained first classification model aiming at the video.

It can be understood that, in the embodiment of the present application, each video in the video training set includes a base video and an extension video.

In the embodiment of the present application, the feature vector for the input video may be obtained through the trained first classification model. The number of the feature vectors may be one or more than two. For example, the feature vector may include a class probability vector output by the trained first classification model for the video; in addition, the feature vector may further include feature extraction vectors output by one or more middle layers in the trained first classification model for the video, for example, feature extraction vectors output by a layer before a classifier (typically, the last fully-connected layer) in the trained first classification model may be included.

In the embodiment of the application, the image features of the video can be represented through the feature vector corresponding to the video, so that each video in the video training set can be further screened according to the feature vector.

And S103, screening out target videos from the video training set according to the characteristic vectors corresponding to the videos in the video training set respectively.

In the embodiment of the application, various modes of screening the target videos from the video training set can be provided. Illustratively, the target video may be screened out from the video training set according to information such as entropy of each feature vector, cumulative playing times corresponding to each video in an actual application scene, video duration corresponding to each video, ratio of blurred frames corresponding to each video, and number of key frames corresponding to each video.

In the embodiment of the application, the feature vectors of the videos can be respectively extracted through the first classification model after training, so that the target videos are screened out in the video training set according to the feature vectors corresponding to the videos in the video training set respectively, and accordingly, according to the requirements, the data cleaning is carried out on the videos in the video training set based on the feature vectors, and the target videos meeting the expectation are obtained.

By screening the video training set, some videos with states such as too high repetition rate, less information amount, unclear features and the like can be deleted from the video training set, so that a target video which better meets the training requirement is obtained, and the negative influence of too many videos with poor quality on a corresponding video classification model in the training process is avoided.

In some embodiments, the feature vector comprises a third feature vector and a fourth feature vector;

the step of inputting the video into the trained first classification model for each of the plurality of base videos and the plurality of extended videos to obtain a feature vector obtained by the trained first classification model for the video includes:

inputting the video into the trained first classification model for each of the plurality of base videos and the plurality of extension videos, obtaining a third feature vector of a designated middle layer of the trained first classification model for the video output, and/or obtaining a fourth feature vector of a last layer of the trained first classification model for the video output;

the screening out the target video from the video training set according to the feature vector corresponding to each video in the video training set respectively comprises:

and screening out the target video from the video training set according to the third characteristic vector and/or the fourth characteristic vector corresponding to each video in the video training set.

In this embodiment, the specified middle layer may be capable of outputting a third feature vector extracted for the video. The specified middle layer may be determined according to a structure of the first classification model. In some examples, the designated middle layer is a layer preceding the classifier (typically the last fully-connected layer) in the trained first classification model. The third feature vector may include feature information extracted from the video, and the fourth feature vector may be a category probability vector.

After the third feature vector and/or the fourth feature vector respectively corresponding to each video is obtained, the image features of the video frames included in each video can be respectively distinguished according to the corresponding third feature vector and/or the fourth feature vector, and information such as similarity between the videos can be judged, so that the videos in the video training set can be screened according to requirements.

And the method for screening out the target video from the video training set can be various. For example, clustering may be performed according to each third feature vector to obtain a clustering result, and then a target video is screened from the clustering result according to each fourth feature vector; or, the target video may be screened out from the video training set according to information such as the accumulated playing times corresponding to each video in the actual application scene, the video duration corresponding to each video, the ratio of the blurred frames corresponding to each video, and the number of the key frames corresponding to each video.

In some embodiments, the screening out a target video from the video training set according to the third feature vector and/or the fourth feature vector corresponding to each video in the video training set includes:

clustering each third feature vector to obtain a clustering result, wherein the clustering result comprises at least two clusters, and each cluster comprises at least one third feature vector;

and screening out a target video corresponding to each cluster from videos corresponding to each third feature vector of the cluster.

In the embodiment of the present application, there may be a plurality of ways to cluster each of the third feature vectors, for example, each of the third feature vectors may be clustered by at least one of a K-MEANS Clustering algorithm, a Density-Based Clustering algorithm such as DBSCAN (Density-Based Clustering of Applications with Noise), an expectation-maximization Clustering algorithm using a Gaussian Mixture Model (GMM), and the like. In some examples, in Clustering each of the third feature vectors, in order to facilitate a comparison of normalization of distances between each of the third feature vectors and reduce performance degradation that may be caused by artificially specifying hyper-parameters, a distance metric between each of the third feature vectors may be selected as a cosine distance, and the Clustering algorithm may employ a Density-Based Clustering algorithm such as DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

After the clustering result is obtained in the embodiment of the application, for each cluster, the target video corresponding to the cluster can be screened from the videos corresponding to the cluster according to the information such as the video duration, the fuzzy frame ratio, the number of key frames and the like corresponding to the videos corresponding to the third feature vectors of the cluster, so that the target video meeting the preset requirement is obtained.

In some embodiments, for each cluster, screening out a target video corresponding to the cluster from videos corresponding to respective third feature vectors of the cluster, includes:

calculating the entropy of fourth eigenvectors corresponding to each third eigenvector of each cluster aiming at each cluster;

determining a target fourth feature vector corresponding to the cluster according to the entropy of each fourth feature vector;

and taking the video corresponding to the target fourth feature vector as the target video corresponding to the cluster.

Wherein the entropy of each fourth feature vector may be calculated according to the following formula:

wherein the fourth feature vector is B, B_i∈B，H_BIs the entropy of the fourth feature vector.

After the entropy of each fourth feature vector is obtained, the L fourth feature vectors with the largest entropy corresponding to the cluster may be used as the target fourth feature vectors corresponding to the cluster, so that a target video with better quality may be screened from the cluster, and other similar videos with possibly smaller information amount in the video training set may be deleted.

In the embodiment of the application, the first classification model can be trained based on a video training set to obtain the trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively. Each extended video is obtained according to a corresponding basic video, namely, each extended video and the corresponding basic video have certain similarity, so that the first classification model is trained according to a plurality of basic videos and the extended videos corresponding to the basic videos respectively, the trained first classification model can better identify similar videos, similar feature vectors can be extracted from the similar videos, and the accuracy of extracting the feature vectors from the input videos through the trained first classification model is ensured in the subsequent processing process; then, aiming at each video in the plurality of basic videos and the plurality of extended videos, inputting the video into the trained first classification model to obtain a feature vector obtained by the trained first classification model aiming at the video; at this time, the feature vectors of the videos can be respectively extracted through the trained first classification model, so that the target videos are screened out from the video training set according to the feature vectors corresponding to the videos in the video training set, and therefore data cleaning can be performed on the videos in the video training set on the basis of the feature vectors according to requirements, and the target videos meeting expectations are obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained.

In some embodiments, after the screening out the target video, the method further includes:

and training a second classification model based on the target video to obtain a trained second classification model, wherein the structure of the second classification model is the same as that of the first classification model.

In the embodiment of the present application, the structure of the second classification model is the same as that of the first classification model. And because the quality of the target video obtained by screening is often better, the second classification model is trained based on the target video, the obtained trained second classification model has better performances such as generalization performance and the like, and the situations such as overfitting and the like can not occur, so that the accuracy of video classification can be improved.

Training a second classification model based on the target video, and after obtaining the trained second classification model, further comprising:

acquiring a video to be predicted;

if the format of the video to be predicted does not meet the preset format condition, carrying out format adjustment on the video to be predicted so that the format of the video to be predicted after format adjustment meets the preset format condition;

inputting the format-adjusted video to be predicted into the trained second classification model to obtain a class vector output by the trained second classification model aiming at the format-adjusted video to be predicted;

and determining the category of the video to be predicted according to the category vector.

For example, the category vector P may be P ═ { P1, P2, …, pn }. According to the category vector, the manner of determining the category of the video to be predicted may be:

and traversing P by using a preset threshold T, and if pi > T exists, determining that the video to be predicted belongs to the category corresponding to pi.

Or, according to the category vector, the manner of determining the category of the video to be predicted may be:

and screening the maximum front K elements from the category vector P, and determining that the video to be predicted belongs to the category corresponding to the front K elements.

It is understood that the video to be predicted may belong to one category, or may belong to more than two categories.

In the embodiment of the application, including a plurality of basic videos in the video training set, can be based on as required the eigenvector is right each video in the video training set carries out data cleaning, obtains the target video that accords with expectations, again based on the target video trains the second classification model, obtains the second classification model that the training was accomplished, wherein, the structure of second classification model with the structure of first classification model is the same, at this moment, because the target video is for being based on the eigenvector is right each video in the video training set obtains after carrying out data cleaning, consequently, the target video more accords with expectations to can promote the performance of the second classification model that the training obtained.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 4 shows a block diagram of a video screening apparatus provided in the embodiment of the present application, which corresponds to the video screening method described in the foregoing embodiment, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 4, the video screening apparatus 4 includes:

the first training module 401 is configured to train a first classification model based on a video training set to obtain a trained first classification model, where the video training set includes a plurality of base videos and extension videos corresponding to the base videos, and each extension video is obtained according to a corresponding base video;

a feature extraction module 402, configured to, for each of the multiple base videos and the multiple extended videos, input the video into the trained first classification model, and obtain a feature vector, obtained by the trained first classification model, for the video;

the screening module 403 is configured to screen out a target video from the video training set according to the feature vector corresponding to each video in the video training set.

Optionally, the video screening apparatus 4 further includes:

and the second training module is used for training a second classification model based on the target video to obtain the trained second classification model, wherein the structure of the second classification model is the same as that of the first classification model.

Optionally, the video screening apparatus 4 further includes:

the sampling module is used for sampling the basic video at a preset sampling rate aiming at each basic video to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by sampling is more than two, the initial sampling frames respectively corresponding to the extended videos obtained by sampling in the basic video are different.

Optionally, the video screening apparatus 4 further includes:

the image extraction module is used for extracting images of specified image areas in each video frame of the basic video aiming at each basic video to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by image extraction is more than two, the specified image areas corresponding to the extended videos obtained by image extraction in the basic video are different.

Optionally, the video screening apparatus 4 further includes:

the noise adding module is used for adding specified noise to each video frame in the basic video aiming at each basic video to obtain partial or all extended videos corresponding to the basic video, wherein if the number of the extended videos obtained by adding the specified noise is more than two, the specified noise corresponding to each extended video obtained by adding the specified noise is different.

Optionally, the first training module 401 specifically includes:

the first acquisition unit is used for randomly acquiring a first video and a second video from the video training set in each iterative training;

the first processing unit is used for inputting the first video into the first classification model, inputting the second video into a third classification model, and acquiring a first training result of the first classification model for the first video and a second training result of the third classification model for the second video, wherein the structure of the third classification model is the same as that of the first classification model;

the second processing unit is used for obtaining a current loss value according to a preset loss function based on the first training result and the second training result and judging whether the current loss value meets a preset condition or not;

the third processing unit is used for taking the first classification model as a trained first classification model if the current loss value meets a preset condition;

and the fourth processing unit is used for updating the first classification model according to the first training result and the second training result if the current loss value does not meet the preset condition, and executing the next iterative training according to the updated first classification model.

Optionally, the preset loss function includes a first classification loss function regarding the first classification model, a second classification loss function regarding the third classification model, and a similarity loss function between the first classification model and the third classification model;

the second processing unit includes:

a first calculating subunit, configured to calculate a first loss value according to the first feature vector, the second feature vector, and the similarity loss function;

a second calculating subunit, configured to calculate a second loss value according to the first classification result and the first classification loss function;

a third calculating subunit, configured to calculate a third loss value according to the second classification result and the second classification loss function;

a fourth calculating subunit, configured to calculate the current loss value according to the first loss value, the second loss value, and the third loss value;

and the determining subunit is used for determining whether the current loss value meets a preset condition.

Optionally, the feature vector includes a third feature vector and a fourth feature vector;

the feature extraction module 402 is specifically configured to:

the screening module 403 is specifically configured to:

Optionally, the screening module 403 specifically includes:

the clustering unit is used for clustering the third feature vectors to obtain a clustering result, wherein the clustering result comprises at least two clusters, and each cluster comprises at least one third feature vector;

and the screening unit is used for screening out the target video corresponding to each cluster from the videos corresponding to the third characteristic vectors of the cluster.

In the embodiment of the application, the first classification model can be trained based on a video training set to obtain the trained first classification model, wherein the video training set comprises a plurality of basic videos and expansion videos corresponding to the basic videos respectively. Each extended video is obtained according to a corresponding basic video, namely, each extended video and the corresponding basic video have certain similarity, so that the first classification model is trained according to a plurality of basic videos and the extended videos corresponding to the basic videos respectively, the trained first classification model can better identify similar videos, similar feature vectors can be extracted from the similar videos, and the accuracy of extracting the feature vectors from the input videos through the trained first classification model is ensured in the subsequent processing process; then, aiming at each video in the plurality of basic videos and the plurality of extended videos, inputting the video into the trained first classification model to obtain a feature vector obtained by the trained first classification model aiming at the video; at this time, the feature vectors of the videos can be respectively extracted through the trained first classification model, so that the target videos are screened out from the video training set according to the feature vectors corresponding to the videos in the video training set, and therefore data cleaning can be performed on the videos in the video training set on the basis of the feature vectors according to requirements, and the target videos meeting expectations are obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained. The video training set comprises a plurality of basic videos

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned functions may be distributed as different functional units and modules according to needs, that is, the internal structure of the apparatus may be divided into different functional units or modules to implement all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, the terminal device 5 of this embodiment includes: at least one processor 50 (only one is shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, wherein the processor 50 implements the steps of any of the video screening method embodiments when executing the computer program 52.

The terminal device 5 may be a server, a mobile phone, a wearable device, an Augmented Reality (AR)/Virtual Reality (VR) device, a desktop computer, a notebook, a desktop computer, a palmtop computer, or other computing devices. The terminal device may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of the terminal device 5, and does not constitute a limitation of the terminal device 5, and may include more or less components than those shown, or combine some of the components, or different components, such as may also include input devices, output devices, network access devices, etc. The input device may include a keyboard, a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of a fingerprint), a microphone, a camera, and the like, and the output device may include a display, a speaker, and the like.

The Processor 50 may be a Central Processing Unit (CPU), and the Processor 50 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. In other embodiments, the memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of the computer programs. The above-mentioned memory 51 may also be used to temporarily store data that has been output or is to be output.

In addition, although not shown, the terminal device 5 may further include a network connection module, such as a bluetooth module Wi-Fi module, a cellular network module, and the like, which is not described herein again.

In this embodiment, when the processor 50 executes the computer program 52 to implement the steps in any of the video screening method embodiments, the first classification model may be trained based on a video training set to obtain a trained first classification model, where the video training set includes a plurality of basic videos and extended videos corresponding to the basic videos. Each extended video is obtained according to a corresponding basic video, namely, each extended video and the corresponding basic video have certain similarity, so that the first classification model is trained according to a plurality of basic videos and the extended videos corresponding to the basic videos respectively, the trained first classification model can better identify similar videos, similar feature vectors can be extracted from the similar videos, and the accuracy of extracting the feature vectors from the input videos through the trained first classification model is ensured in the subsequent processing process; then, aiming at each video in the plurality of basic videos and the plurality of extended videos, inputting the video into the trained first classification model to obtain a feature vector obtained by the trained first classification model aiming at the video; at this time, the feature vectors of the videos can be respectively extracted through the trained first classification model, so that the target videos are screened out from the video training set according to the feature vectors corresponding to the videos in the video training set, and therefore data cleaning can be performed on the videos in the video training set on the basis of the feature vectors according to requirements, and the target videos meeting expectations are obtained. At this time, the target video is obtained by performing data cleaning on each video in the video training set based on the feature vector, so that the obtained target video is more desirable, and a video data set with higher quality is obtained. The video training set comprises a plurality of basic videos

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. The computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file or some intermediate form. The computer-readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the above modules or units is only one logical function division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of video screening, comprising:

2. The video screening method of claim 1, wherein prior to training the first classification model based on the video training set, comprising:

3. The video screening method of claim 1, wherein prior to training the first classification model based on the video training set, comprising:

4. The video screening method of claim 1, wherein prior to training the first classification model based on the video training set, comprising:

5. The video screening method of claim 1, wherein the training the first classification model based on the video training set to obtain the trained first classification model comprises:

in each iteration training, randomly acquiring a first video and a second video from the video training set;

inputting the first video into the first classification model, inputting the second video into a third classification model, and obtaining a first training result of the first classification model for the first video and a second training result of the third classification model for the second video, wherein the structure of the third classification model is the same as that of the first classification model;

obtaining a current loss value according to a preset loss function based on the first training result and the second training result, and judging whether the current loss value meets a preset condition;

if the current loss value meets a preset condition, taking the first classification model as a trained first classification model;

and if the current loss value does not meet the preset condition, updating the first classification model according to the first training result and the second training result, and executing next iterative training according to the updated first classification model.

6. The video screening method of claim 5, wherein the preset loss functions include a first classification loss function with respect to the first classification model, a second classification loss function with respect to the third classification model, and a similarity loss function between the first classification model and the third classification model;

and determining whether the current loss value meets a preset condition.

7. The video screening method of claim 1, further comprising, after screening out the target video:

8. The video screening method of any one of claims 1 to 7, wherein the feature vectors include a third feature vector and a fourth feature vector;

9. The method for screening videos according to claim 8, wherein the screening a target video from the video training set according to the third feature vector and/or the fourth feature vector corresponding to each video in the video training set comprises:

10. A video screening apparatus, comprising:

11. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the video screening method according to any one of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a video screening method according to any one of claims 1 to 9.