CN116229330A

CN116229330A - Method, system, electronic equipment and storage medium for determining video effective frames

Info

Publication number: CN116229330A
Application number: CN202310293915.5A
Authority: CN
Inventors: 林燕丹; 张雷
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-06-06

Abstract

The invention discloses a method, a system, electronic equipment and a storage medium for determining video effective frames, and relates to the technical field of video frame selection, wherein the method comprises the following steps: acquiring all video frames of a target video; the target video is a video containing valid frames to be determined; feature extraction and feature dimension reduction are sequentially carried out on all video frames, so that a dimension reduction feature matrix of a target video is obtained; clustering all video frames based on a kaline index, a clustering algorithm, a preset cluster number range and a dimension reduction feature matrix to obtain a clustered video frame set to be screened; the clustered video frame sets to be screened comprise a plurality of groups of video frame sets to be screened; and determining all effective frames of the target video based on the clustered video frame set to be screened. The invention improves the sensitivity and the universality of video frame selection.

Description

Method, system, electronic equipment and storage medium for determining video effective frames

Technical Field

The present invention relates to the field of video frame selection technologies, and in particular, to a method, a system, an electronic device, and a storage medium for determining a video valid frame.

Background

The method proposed by Moccia et al in 18 years uses an approval operator (criteria function) to extract the features of the endoscope video frame, the effect of this method depends on whether the person's observation of the data is accurate or not, and the criterion operators are manually chosen, each operator needs to be processed once over the whole data set, thus being time-consuming; in addition, the criterion operator also affects the final indexes such as effective frame sensitivity and the like. In another situation, contrary to Moccia et al, patrini et al and Galdran et al simultaneously proposed in 2019 to use the transfer learning in advanced deep learning technology, and the characteristics of the image are widely learned by using the neural network to realize end-to-end selection of effective frames. But back into the clinical data, this end-to-end model cannot be packaged and used as it is: on the one hand, the tag may not be adapted; on the other hand, if the type method of the video frame changes, the model needs to be retrained, and the model is difficult to recycle. Summarizing, the existing video frame selection method has the problems of low sensitivity and poor universality.

Disclosure of Invention

The invention aims to provide a method, a system, electronic equipment and a storage medium for determining effective frames of video, which improve the sensitivity and universality of video frame selection.

In order to achieve the above object, the present invention provides the following solutions:

a method of determining a valid frame of a video, the method comprising:

acquiring all video frames of a target video; the target video is a video containing effective frames to be determined;

feature extraction and feature dimension reduction are sequentially carried out on all the video frames, so that a dimension reduction feature matrix of the target video is obtained;

clustering all the video frames based on a kaline index, a clustering algorithm, a preset cluster number range and the dimension reduction feature matrix to obtain a clustered video frame set to be screened; the clustered video frame sets to be screened comprise a plurality of groups of video frame sets to be screened;

and determining all valid frames of the target video based on the clustered video frame set to be screened.

Optionally, feature extraction and feature dimension reduction are sequentially performed on all the video frames to obtain a dimension reduction feature matrix of the target video, which specifically includes:

extracting features of each video frame to obtain an initial feature matrix of the target video;

and performing feature dimension reduction on the initial feature matrix to obtain the dimension reduction feature matrix.

Optionally, clustering all the video frames based on a kaline index, a clustering algorithm, a preset cluster number range and the dimension reduction feature matrix to obtain a clustered video frame set to be screened, which specifically comprises the following steps:

taking each preset cluster number in the preset cluster number range as a cluster number, and clustering all the video frames by using the clustering algorithm and the dimension reduction feature matrix to obtain a clustered video frame set corresponding to each preset cluster number;

respectively calculating the kaline index of the clustered video frame set corresponding to each preset cluster number;

and determining the clustered video frame set with the maximum kalina index as the clustered video frame set to be screened.

Optionally, determining all valid frames of the target video based on the clustered video frame set to be screened specifically includes:

judging whether a preset number of video frames in the current video frame set to be screened are valid frames or not;

if yes, all video frames in the current video frame set to be screened are determined to be effective frames.

Optionally, the clustering algorithm is an aggregation clustering algorithm, a K-means clustering algorithm, a spectral clustering algorithm or a density-based noisy application spatial clustering algorithm.

Optionally, when the target video includes N Zhang Shipin frames and the preset cluster number is K, the calculation formula of the kalina index is:

wherein CH is a kaline index; BGSS is an intra-cluster spacing index; WGSS is an inter-cluster spacing index; k is the serial number of the clustered video frame set; n is n _k The number of video frames in the video frame set for the kth cluster; c (C) _k Is the centroid of the kth cluster video frame set; c is the mass center of all the clustered video frame sets; WGSS (Wireless telecommunication System) _k The distances from all video frames to the centroid are concentrated for the kth cluster of video frames; i is the sequence number of the video frame in the k cluster video frame set; x is X _ik Is the ith video frame in the kth cluster of video frames set.

A system for determining a valid frame of a video, the system comprising:

the target video acquisition module is used for acquiring all video frames of the target video; the target video is a video containing effective frames to be determined;

the feature processing module is used for sequentially carrying out feature extraction and feature dimension reduction on all the video frames to obtain a dimension reduction feature matrix of the target video;

the clustering module is used for clustering all the video frames based on a kaline index, a clustering algorithm, a preset cluster number range and the dimension reduction feature matrix to obtain a clustered video frame set to be screened; the clustered video frame sets to be screened comprise a plurality of groups of video frame sets to be screened;

and the effective frame determining module is used for determining all effective frames of the target video based on the clustered video frame set to be screened.

An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of determining a video active frame as described above.

A storage medium having stored thereon a computer program which, when executed by a processor, implements a method of determining a video active frame as described above.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses a method, a system, electronic equipment and a storage medium for determining effective frames of a video, wherein the method comprises the steps of firstly, sequentially carrying out feature extraction and feature dimension reduction on all video frames of a target video to obtain a dimension reduction feature matrix of the target video; clustering all video frames based on a kaline index, a clustering algorithm, a preset cluster number range and a dimension reduction feature matrix to obtain a clustered video frame set to be screened; the clustered video frame sets to be screened comprise a plurality of groups of video frame sets to be screened; and determining all effective frames of the target video based on the clustered video frame set to be screened. Compared with the method for extracting the characteristics of the endoscope video frames by using the criterion operator, the method reduces manpower and improves the sensitivity of video frame selection; compared with the method for realizing end-to-end selection of effective frames, when the initial video frames are changed, retraining is not needed, and the universality of video frame selection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart illustrating a method for determining an effective frame of a video according to embodiment 1 of the present invention;

FIG. 2 is a flow chart of feature extraction for each video frame using a vanilla neural network;

fig. 3 is a schematic diagram of an automatic label distribution algorithm.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide a method, a system, electronic equipment and a storage medium for determining video effective frames, aiming at improving the sensitivity and universality of video frame selection.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

Fig. 1 is a flowchart illustrating a method for determining an effective frame of a video according to embodiment 1 of the present invention. As shown in fig. 1, the method for determining a video valid frame in this embodiment includes:

step 101: acquiring all video frames of a target video; the target video is a video containing valid frames to be determined.

Specifically, the target video includes, but is not limited to, endoscopic video.

Step 102: and carrying out feature extraction and feature dimension reduction on all the video frames in sequence to obtain a dimension reduction feature matrix of the target video.

Step 103: clustering all video frames based on a kaline index, a clustering algorithm, a preset cluster number range and a dimension reduction feature matrix to obtain a clustered video frame set to be screened; the clustered video frame sets to be screened comprise a plurality of groups of video frame sets to be screened.

Step 104: and determining all effective frames of the target video based on the clustered video frame set to be screened.

As an alternative embodiment, step 102 specifically includes:

step 1021: and extracting the characteristics of each video frame to obtain an initial characteristic matrix of the target video.

Specifically, before step 1021, the method further includes: each video frame is preprocessed.

Preprocessing is scaling: the original video frame is cropped to an RGB image of 224 x 224 size.

Step 1021 specifically includes:

and extracting the characteristics of each video frame by using the vanilla neural network to obtain the characteristic vector of each video frame, thereby obtaining the initial characteristic matrix of the target video.

As shown in fig. 2, the vanilla neural network (vanillaneural network) is a deep convolutional neural network, which inputs an RGB image of 224×224. The depth of the image is shrunk to 64 dimensions by 2 convolution kernels (3 x 3conv, 64) of 3 x 3, and then stretched to 128 dimensions by a maximum Pooling layer (Max Pooling); the depth of the image is stretched to 256 dimensions via 2 3 x 3 image convolution kernels (3 x 3conv, 128) and one maximum Pooling layer (Max Pooling); a 3-dimensional vector of 512 dimensions is obtained via 3 x 3 convolution kernels (3 x 3conv, 256), and the width and height of the image are reduced without changing the depth of the image via the fourth max pooling layer; passing through 3 convolution kernels (3×3conv, 512) of 3×3 and a Max Pooling layer (Max Pooling) again to obtain a 3-dimensional vector of 14×14×512; some features were randomly discarded via one-time-flattening (flat), the dimensions of the features became 4096; the last full link layer (Dense) does not change the feature size, and the feature remains 4096. Thus, the feature vector with the length of 4096 of each video frame is extracted, and when the target video comprises N video frames, a matrix (N, 4096) is obtained, namely an initial feature matrix of the target video.

Step 1022: and performing feature dimension reduction on the initial feature matrix to obtain a dimension reduction feature matrix.

Step 1022 specifically includes: and performing feature dimension reduction on the initial feature matrix by using UMAP algorithm to obtain a dimension reduction feature matrix.

UMAP is a dimension-reducing algorithm based on flow pattern learning, and has high calculation speed and high-dimension space mappingThe global retention of the radiation to the low-dimensional flow pattern space is good, and the radiation is transmitted through the following loss function C _UMAP To optimize the mapping of the high-dimensional space to the low-dimensional space; two points x in high-dimensional space _i And x _j (data from the initial feature matrix) is distributed v by a gaussian _j∣i To approximate; likewise, two points y on the low-dimensional flow pattern space _i And y _j Distributed w by student t _ij And (5) approximating. Finally, a final random gradient descent is made to the loss function such that x _i And x _j The relation is mapped as y _i And y _j 。

Wherein sigma _i Is Gaussian distribution v _j∣i The corresponding variance.

As an optional embodiment, step 103 specifically includes:

and clustering all video frames by using each preset cluster number in a preset cluster number range as a cluster number and using a clustering algorithm and a dimension reduction feature matrix to obtain a clustered video frame set corresponding to each preset cluster number.

And respectively calculating the kaline index of the clustered video frame set corresponding to each preset cluster number.

And determining the clustered video frame set with the maximum kalina index as a clustered video frame set to be screened.

As an optional implementation, step 104 specifically includes:

and judging whether the preset number of video frames in the current video frame set to be screened are valid frames or not.

And in order to detect the effective frames under the determined condition, the sensitivity of the classification result of the clustering algorithm is improved. Sensitivity reflects the detection rate, i.e., the ability to pick out the target frame from all frames, and is also evaluated.

If 4 video frame sets are finally obtained after clustering, namely k=4, k=0, 1,2,3, and the evaluation includes:

1. automatic label distribution algorithm (as shown in FIG. 3)

Through steps 101-103, the target video has been reduced to a (N, 2) matrix, and the clustering algorithm obtains clustering results (where the number of clusters is determined by the kalina index to be K), and the clustering results may be represented as K sets of digital numbers ('0', '1', '2', '3'), where each set contains several video frames. Consider that the automatic label assignment algorithm is to collect K numbers without true meaning

Conversion to a set of tags X of practical significance _G ("I" indicates active, "B" indicates blur, "S" indicates light reflection, "U" indicates low exposure). The flow of the algorithm is described as follows.

(1) Input: clustering result set

And a real tag set X _G . Wherein, the real labels are marked by manpower (expert); judging which category (label) a video frame belongs to through human experience; and the result of the clustering, the label output is classified by the algorithm according to the characteristics of the data (generally output is 0,1,2 and …), and the numbers have no practical meaning, but may represent one of the manually classified labels.

(2) Additional conditions: k=4; the number in the cluster set is o+ {0,1,2,3}, and the label set is p+ { I, B, S, U }.

(3) Traversing the image ids under the numbering of each of the cluster sets (e.g., starting from 0);

(4) On the basis of (3), the image id under the tag set is traversed (e.g., starting from I).

(5) Calculating the same number of image ids in the traversing results of (3) and (4),

(6) After one traversal is finished, the label number with the most intersection with the serial number of the current traversed cluster set (the set (a plurality of image ids in the set) with the cluster of 0 in the clustering result) is sequentially overlapped with the four sets (a plurality of image ids) of the real label I, B, S, U, the serial number of the label with the most overlapping of the 0 set and the number of the images in the label of the manual label data set is found,

(7) And finishing one-time mapping relation, wherein f is o-p.

(8) Until all the number traversal in all the step (3) is finished.

2. Index of evaluation

The results of the clustering algorithm are scored by these several indices.

2.1 precision

2.2 sensitivity

2.3F1 score

TP is called true positive, which represents a true positive sample; FP is known collectively as false positive, not positive but mistaken for positive; FN collectively refers to false positive, meaning that it is not a negative sample but rather predicted to be a negative sample, otherwise a positive sample. The sensitivity measurement model is used for detecting positive samples (positive); the accuracy measures how much (%) of the number of positive samples predicted to be true positive samples. The F1 score provides a single score to balance accuracy and sensitivity.

3. Loss function in super-parameter search algorithm

Since a dimension reduction algorithm, UMAP, is used, and 4 different clustering algorithms are used. Firstly, there are configurable parameters of UMAP, there are also parameters which are not only preset parameters, such as the optimal cluster number, but also configurable parameters in 4 kinds of clustering algorithms, in addition, the relation between UMAP and the clustering algorithm parameters is independent, and the number of the configurable parameters of UMAP is 4, the configurable parameters of 4 kinds of clustering algorithms are {3,3,4,1}, so that the searchable parameter space formed by the parameters is {12, 12, 16,4}. Taking K-means clustering in a clustering algorithm as an example (K-means), assuming that the configurable parameters are 12 in the set in cooperation with UMAP, one possibility that makes the final result (evaluation index, such as average sensitivity) the best needs to be searched out among the 12 possibilities.

The main purpose of the algorithm is to complement and balance the performance difference between the clustering algorithm and the supervised classification algorithm; the optimal classification performance is obtained by combining the feature dimension reduction (UMAP) and the super parameters in the clustering algorithm, and the search range is a parameter space formed by the two-part algorithm (feature dimension reduction and feature clustering). According to the process of the clustering algorithm, two cases are divided.

In the embodiment, K-means clustering (K-means), aggregation clustering (Agglomerative Clustering) and spectral clustering (Spectral Clustering) are divided into the same situation; while the density space (DBSCAN) is divided into another case.

Wherein, for the first case (K-means, agglomerative Clusteri and Spectral Clustering):

(1) Input: sensitivity, accuracy, and parameters of the algorithm (including UMAP, and one of three clustering algorithms) for each algorithm result.

(2) Additional conditions: k=4; weight α≡100; beta+.0.01.

(3) A penalty p is calculated, taking 10% of the absolute value between sensitivity and accuracy.

(4) Constructing a loss function, sensitivity deviates from a 1-weighted penalty value p, and is constrained by a parameter space

For the second case (DBSCAN):

(5) The number of outliers ('-1') in the statistical algorithm is denoted N _out The method comprises the steps of carrying out a first treatment on the surface of the Counting the number of non-outliers and recording N _in 。

(6) The variance between the number of samples in each non-outlier number and the mean of the population of samples was calculated and denoted as σ.

(7) The absolute value between the number of non-outlier tags (tags other than '-1') and the K value in (5) is weighted, and the variance in (6) is weighted.

(8) A loss function consisting of outliers and weighted results in (7), constrained by the parameter space

(9) The loss functions (4) and (8) are optimized using bayesian until a specified number of search steps is reached.

(10) The parameter space returns the optimal parameters.

Algorithm advantage: the task of the loss function is to minimize the gap (0) between the optimal target infinitely, and the search phase is very time-consuming and resource-consuming, so that the cost of searching in the oversized parameter space is greatly saved by means of Bayesian optimization search. It is stated that the sampling efficiency of bayesian optimized searching is 100 times that of random searching. In the present example, in a preliminary trial (finding the best average sensitivity task), the bayesian optimized search (BO) took 262s in the 200-step evaluation and 840s in the 500-step evaluation, which was faster than the Random Search (RS). Meanwhile, the average sensitivity difference between BO and RS is less than 0.3%.

As an alternative embodiment, the clustering algorithm is an aggregate clustering algorithm, a K-means clustering algorithm, a spectral clustering algorithm, or a density-based noisy applied spatial clustering algorithm.

As an optional implementation manner, when the target video includes N Zhang Shipin frames and the preset cluster number is K, the calculation formula of the kalina index is:

/>

Example 2

In order to implement the method for determining a video valid frame in embodiment 1, this embodiment provides a system for determining a video valid frame, including:

the target video acquisition module is used for acquiring all video frames of the target video; the target video is a video containing valid frames to be determined.

And the feature processing module is used for sequentially carrying out feature extraction and feature dimension reduction on all the video frames to obtain a dimension reduction feature matrix of the target video.

The clustering module is used for clustering all video frames based on a kaline index, a clustering algorithm, a preset cluster number range and a dimension reduction feature matrix to obtain a clustered video frame set to be screened; the clustered video frame sets to be screened comprise a plurality of groups of video frame sets to be screened.

Example 3

An electronic device, comprising:

one or more processors.

A storage device having one or more programs stored thereon.

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement a method of determining a video active frame as in embodiment 1.

Example 4

A storage medium having stored thereon a computer program which, when executed by a processor, implements a method of determining a video valid frame as in embodiment 1.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method for determining a valid frame of a video, the method comprising:

2. The method for determining a valid video frame according to claim 1, wherein feature extraction and feature dimension reduction are sequentially performed on all the video frames to obtain a dimension reduction feature matrix of the target video, and the method specifically comprises:

3. The method for determining video valid frames according to claim 1, wherein all the video frames are clustered based on a kalina index, a clustering algorithm, a preset cluster number range and the dimension-reduction feature matrix to obtain a clustered video frame set to be screened, comprising the following steps:

4. The method for determining video valid frames according to claim 1, wherein determining all valid frames of the target video based on the clustered video frame set to be filtered specifically comprises:

5. The method of claim 1, wherein the clustering algorithm is an aggregate clustering algorithm, a K-means clustering algorithm, a spectral clustering algorithm, or a density-based noisy applied spatial clustering algorithm.

6. The method for determining a valid frame of a video according to claim 1, wherein when the target video includes N Zhang Shipin frames and the preset cluster number is K, the calculation formula of the kalina index is:

wherein CH is a kaline index; BGSS is an intra-cluster spacing index; WGSS is an inter-cluster spacing index; k is the serial number of the clustered video frame set; n is n _k The number of video frames in the video frame set for the kth cluster; c (C) _k Is the centroid of the kth cluster video frame set; c is the mass center of all the clustered video frame sets; WGSS (Wireless telecommunication System) _k The distances from all video frames to centroid are concentrated for the kth cluster of video framesThe method comprises the steps of carrying out a first treatment on the surface of the i is the sequence number of the video frame in the k cluster video frame set; x is X _ik Is the ith video frame in the kth cluster of video frames set.

7. A system for determining a valid frame of a video, the system comprising:

8. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of determining video active frames of any of claims 1 to 6.

9. A storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of determining a video active frame according to any one of claims 1 to 6.