CN113591588A

CN113591588A - Video content key frame extraction method based on bidirectional space-time slice clustering

Info

Publication number: CN113591588A
Application number: CN202110750343.XA
Authority: CN
Inventors: 冯子亮; 单强达; 韩震博; 窦芙蓉; 何旭东; 张欣; 唐玄霜; 朱鑫; 冉旭松; 李昊岳
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-11-02

Abstract

The invention provides a video content key frame extraction method based on bidirectional space-time slice clustering, which adopts bidirectional multi-row space-time slices, increases the thickness of the slices and enables the extracted slices to express more video key information; the number of the self-adaptive clustering centers is set, and the number of key frames does not need to be set in advance; the clustering algorithm considers the time attribute of the slice image in the distance calculation, so that the accuracy of key frame identification is improved; the clustering result takes the time continuity of the slices into consideration, and the information redundancy is reduced; therefore, the effect of extracting the video key information is finally improved.

Description

Video content key frame extraction method based on bidirectional space-time slice clustering

Technical Field

The invention relates to the technical field of computer vision, in particular to a video content key frame extraction method based on bidirectional space-time slice clustering.

Background

Digital video has become an important way of information dissemination over networks today. With the great increase of the number of videos on the network, how to quickly and effectively find a required video clip in a large number of videos becomes a focus of people's attention, which is a video content retrieval problem.

Video is composed of continuously changing image frames, and the frames which can effectively represent the main content of the video in the image frames are called video content key frames. The video content key frame extraction technology is an important means for effectively solving the problem of video content retrieval, and has important functions in aspects of video similarity analysis, video content abstraction and the like.

The space-time slicing technology is a technology for abstracting video content in two dimensions of time and space; specifically, the video is expanded into a plurality of frames of images in a time dimension; slicing the image in a space dimension, and extracting one row or one column of information of the image; finally forming a video space-time slice image as an abstract image of the video; the processing of the abstract image can obtain information such as a content key frame of the video.

The general space-time slicing process generally only performs slicing operation in a single direction, which may result in that key information in the video cannot be extracted; the time continuity of the slice is not considered when the slice image is processed, so that the extracted content key frame is not accurate enough, and the problems of redundancy and the like exist.

Aiming at the problems, the invention provides a video content key frame extraction method based on bidirectional space-time slice clustering, which adopts bidirectional multi-row space-time slices to improve the extraction capability of video key information; the self-adaptive clustering algorithm considers the time attribute of the slice image in the distance calculation, so that the accuracy of key frame identification is improved; the clustering result takes the time continuity of the slices into consideration, and the information redundancy is reduced; therefore, the effect of extracting the video key information is finally improved.

Disclosure of Invention

The video content key frame extraction method based on the bidirectional space-time slice clustering comprises the following steps.

Step 1, expanding a video into continuous multi-frame images according to time, carrying out slicing operation on each frame of image along the horizontal and vertical directions of the center point of the image, and extracting a transverse slice image and a longitudinal slice image.

The image center point refers to the central center position of the image; taking an image with the width being the slice width as a transverse or horizontal slice image along the horizontal direction by taking the action center of the point; taking an image with the width being the slice width as a longitudinal or vertical slice image along the vertical direction by taking the column where the point is positioned as the center; the transverse slice image and the longitudinal slice image are subsequently referred to simply as slices or slice images.

The slice width refers to the pixel width of a slice and can be set in advance; the space-time slices with multiple rows or multiple columns can increase the information quantity and improve the effect of extracting the key information of the video.

And 2, respectively splicing the transverse slice image and the longitudinal slice image according to the time direction to form a transverse video slice image and a longitudinal video slice image.

For the transverse slice images, splicing the transverse slice images up and down according to the frame sequence when the video is unfolded to form the transverse video slice images from top to bottom in the time direction; for the longitudinal slice images, performing left-right splicing according to the frame sequence when the video is unfolded to form longitudinal video slice images from left to right in the time direction; the transverse video slice images and the longitudinal video slice images are subsequently referred to simply as video slice images.

Step 3, using a self-adaptive K-means clustering algorithm and taking each slice image as a basic unit of clustering; and respectively clustering the horizontal video slice images and the vertical video slice images along the time direction.

The self-adaptive K mean value clustering algorithm comprises the following steps:

step 3.1, the number of clustering centers is preset, and the number of clustering centers is subjected to self-adaptive adjustment;

step 3.2, according to the set number of the clustering centers, uniformly distributing each slice of the video slice image to each clustering center in the time direction;

step 3.3, according to the clustering result, recalculating the clustering center of each category, and updating the clustering center;

step 3.4, recalculating the distance between the slice image and the clustering center between two adjacent clustering centers along the time direction, thereby adjusting the boundary between the two adjacent categories;

step 3.5, repeating the steps 3.3 and 3.4 according to an iterative method of K-means clustering to obtain a clustering result and a key frame candidate frame;

step 3.6, combining the classes with less frame number and readjusting the boundary and the clustering center;

if the number of the continuous frames in the clustering result is less than the specified threshold value of the minimum number of frames, the category needs to be removed and merged, and the category is classified into the former category or the latter category according to the adjusting method of the clustering boundary in the step 3.4;

and updating the clustering center according to the step 3.3.

The cluster center, which refers to the center of the category, is a slice image along the time direction and can be represented by a frame number of the video.

Calculating the distance between the two slice images and the clustering center, namely calculating the distance between the two slice images, wherein the product of the Euclidean distance between the slice images and the time distance (namely the inter-frame distance) of the slice images along the time direction is adopted for representation; when the two slice images are similar, the value becomes small; this value becomes smaller as the inter-slice distance becomes smaller.

And recalculating the clustering center of each category by adopting a trial method, selecting each slice in the category as the clustering center, calculating and accumulating the distances between the rest slices of the category and the slice of the clustering center, and taking the clustering center with the minimum accumulated distance as the clustering center of the category.

The method for adjusting the boundary between two adjacent categories or the clustering boundary is to consider that the clustering is performed along the time direction, for example, when a longitudinal video slice image is clustered, the condition that the slice on the left side belongs to the clustering center on the left side and the slice on the right side belongs to the clustering center on the right side can only occur, so that the clustering result is the boundary between two categories, and the condition of staggering can not exist.

Taking the clustering of longitudinal video slice images as an example, the slices are arranged from left to right according to time; the adjustment method of the clustering boundary comprises the steps of firstly, searching slice images from a left clustering center and a right clustering center in opposite directions respectively to obtain a first slice image on the right of the left clustering center and a first slice image on the left of the right clustering center, respectively calculating the distance between the slice images and the respective clustering centers and accumulating the distances; then, on the side with the minimum accumulation distance, selecting the next slice image according to the principle of the opposite sum sequence, calculating the distance between the next slice image and the clustering center of the side and accumulating the distance; until all slices between the two cluster centers have been searched, resulting in a new boundary between the two classes.

The clustering result is the boundary and the clustering center of each category along the time direction after the clustering is finished; at this time, the frame corresponding to each cluster center slice image is a candidate frame of the video content key frame, or called key frame candidate frame.

The self-adaptive adjustment of the number of the clustering centers comprises the following steps:

when the number of clustering centers is increased, the intra-class distance average value of each class is reduced; the self-adaptive cluster center number refers to the situation that when the cluster center number changes within a certain range, the intra-class distance is close to the mean value;

the number of clustering centers is changed within a certain range, and clustering is respectively carried out according to the method; calculating the mean value of the intra-class distance in each case; and the mean value of the intra-class distances of all cases;

and taking the condition that the average value of the distance mean values is just smaller than the average value of the intra-class distances, and taking the number of the clustering centers at the moment as the number of the self-adaptive clustering centers.

And 4, merging the two key frame candidate frame sequences obtained by clustering in the step 3 to obtain the final video content key frame.

First, all the key frame candidate frames are sorted according to time, and repeated frames are removed.

And secondly, taking any one of the two frames of the frame with the frame interval smaller than the set minimum frame interval to obtain a final video content key frame sequence.

The above operation considers the temporal continuity of the spatio-temporal slice, and if the inter-frame distance of the key frame candidate frames is smaller, it indicates that the two frames are similar, and there is more redundant information between the two frames, and one of the frames should be removed.

Compared with the prior art, the invention has the following advantages: by using bidirectional multi-row space-time slices, the extracted slices can express more video key information; the self-adaptive clustering center is arranged, and the number of key frames does not need to be set in advance; the time continuity of the space-time slice is fully considered during clustering, and the accuracy of key frame identification is improved; information redundancy is reduced. Meanwhile, the method has the advantages of easiness in understanding, simplicity in calculation, good robustness and the like, and has a good video content key frame extraction effect.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of a video transverse slice image and a longitudinal video slice image in the invention.

FIG. 3 is a flow chart of adaptive K-means clustering according to the present invention.

Detailed Description

In the following, the technical solutions in the examples of the present invention are clearly and completely described with reference to the drawings in the examples of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, and not all embodiments.

A video content key frame extraction method based on bidirectional spatiotemporal slice clustering is shown in figure 1 and comprises the following steps.

Step 1, extracting a transverse slice image and a longitudinal slice image from a video.

And expanding the video into continuous multi-frame images according to time, and performing slicing operation on each frame of image along the horizontal and vertical directions of the center point of the image.

And taking the image with the width being the slice width as a transverse or horizontal slice image along the horizontal direction by taking the action center where the image center point is located.

Taking an image with the width being the slice width as a longitudinal or vertical slice image along the vertical direction by taking the column where the central point of the image is located as the center.

In this example, let the length of the video image be X, the width be Y, and the length of the video be L (frame), as shown in (1) in fig. 2.

The image center point refers to the center position in the middle of the image. Let 0 calculate the pixel number of the image, and divide the length and width of the image by an integer by 2 to obtain the row and column number of the center point of the image.

In this example, the coordinates of the row and column positions of the center point of the image are ([ X/2], [ Y/2 ]), and the square brackets represent the rounding.

The width of the slice image can be set as a parameter, and can be set to 1-3 in a common situation; a slice width greater than 1 can extract key information more efficiently.

In this example, the slice width is set to 1.

And 2, respectively splicing the transverse slice image and the longitudinal slice image to form a transverse video slice image and a longitudinal video slice image.

In this example, the size of the longitudinal slice image is 1 × Y, as shown in the upper-upper part of fig. 2 (2); and performing left-right stitching according to the frame sequence when the video is unfolded to form a left-right longitudinal video slice image with the size of L x Y, as shown in (2) of FIG. 2.

In this example, the size of the transverse slice image is X × 1, as shown on the left side in fig. 2 (3); splicing the video up and down according to the frame sequence when the video is unfolded to form a transverse video slice image from top to bottom; its size is X L; as shown in fig. 2 (3).

And 3, carrying out self-adaptive K-means clustering on the longitudinal video slice images by taking each slice image as a unit, and aiming at finding out the boundary and the center of the class according to the slices.

The flow of step 3 is shown in fig. 3.

And 3.1, setting the number of the initial clustering centers, and processing the number of the clustering centers to be 4.

In this example, the video length is L =80 frames, and the number is 0 to 79.

And 3.2, uniformly dividing 80 slices in the longitudinal video slice images into four types according to 4 clustering centers.

In this example, the frame numbers after the four categories are uniformly distributed are respectively: [0 to 19], [20 to 39], [40 to 59], and [60 to 79 ].

And 3.3, recalculating and updating the clustering center of each category according to the clustering result.

The distance between the two slice images i and j is calculated by the formula:

dis(i,j)=abs(i-j)*d(p(i),p(j))；

where i and j are frame numbers of two slice images, abs () is an absolute value, and d (p (i), p (j)) is a euclidean distance between the slice image p (i) and the slice image p (j).

Recalculating the clustering center of each category by adopting an attempt method, namely attempting to take each slice image of the category as the clustering center, calculating the distances between the rest slices of the category and the slice of the clustering center and accumulating; and taking the clustering center with the minimum accumulated distance as the updated clustering center.

In this example, the frame numbers of the updated cluster centers are respectively: 10, 35, 65, 75 frames.

And 3.4, adjusting the boundary between two adjacent categories according to the new clustering center.

And recalculating the distance from the clustering center of the slice image between two adjacent clustering centers along the time direction, thereby adjusting the boundary between two adjacent categories.

In this example, in the initial situation, the centers of the first class and the second class are respectively the 10 th frame and the 35 th frame, and the boundary between the two classes is the 19 th frame (first class) and the 20 th frame (second class); and performing opposite-direction and sequential search on all frames between 11 th to 34 th frames, namely sequentially searching from 11 frames to the right on the left side and sequentially searching from 34 frames to the left on the right side until meeting.

Firstly, calculating the distance between the 11 th frame on the left side and the 10 th frame at the center of the first class, and assuming that the distance is 5; the left cumulative distance is 5.

Then calculating the distance between the 34 th frame on the right side and the 35 th frame at the center of the second class, and assuming that the distance is 10; the right cumulative distance is 10.

Since the left cumulative distance 5 is less than the right cumulative distance 10, frame 12 is then taken in order on the left for calculation.

The distance of the 12 th frame on the left from the 10 th frame in the center of the first class is calculated, assuming 2, and the left cumulative distance is 5+2= 7.

Since the left cumulative distance 7 is still less than the right cumulative distance 10, frame 13 is then taken in order on the left for calculation.

New boundaries of the first class and the second class are finally obtained, which are assumed to be the 23 rd frame (first class) and the 24 th frame (second class).

And continuously calculating the boundaries between the second class and the third class and between the third class and the fourth class.

And 3.5, repeating the steps 3.3 and 3.4 according to an iterative method of K-means clustering to obtain a clustering result and a key frame candidate frame.

In this example, the final frame numbers of the four classes are obtained: [0 to 25], [26 to 28], [29 to 47], and [48 to 79 ].

And 3.6, merging the classes with less frames and readjusting the boundary.

In this example, the minimum frame number threshold is set to 4, and since the second type frame number is 3 and less than the threshold 4, the merge operation is required.

And 3.4, adjusting the boundary between the two classes according to the step 3.4 to obtain the final frame sequence numbers of the three classes: [ 0-27 ], [ 28-47 ], [ 48-79 ]; the frame numbers of the three cluster centers updated according to step 3.3 are respectively: 8,37, 73.

And 3.7, self-adaptive adjustment of the number of the clustering centers.

In the example, the range of the number of the clustering centers is set as [3,25], the intra-class distance average value interval of the 23 clustering results is calculated to be 80-30, and the average value is 56;

if the mean intra-class distance when the number of clustering centers k =5 is 55.6 and is just smaller than 56, the number of adaptive clustering centers k is 5.

In this example, the frame numbers of the clustering centers obtained in step 3 are respectively: 8, 21, 36, 62, 73, which are key frame candidates derived from the vertical video slice images.

And 4, adopting the method in the step 3 to perform self-adaptive K-means clustering on the transverse video slice images.

In this example, let the result of horizontal video slice image clustering be that the number of clustering centers k =4, and the frame numbers of the clustering centers are: 8,20, 60, 76.

And 5, merging the two key frame candidate frame sequences obtained by clustering in the steps 3 and 4 to obtain a final video content key frame sequence.

The key frame sequence number obtained by de-duplication is: 8,20, 21, 36, 60, 62, 73.

And if the minimum frame interval is 3, the 20 th frame and the 21 st frame, and the 60 th frame and the 62 nd frame are redundant frames, and the 20 th frame and the 60 th frame are taken respectively during the redundancy removing operation.

The number of the final obtained video content key frames is as follows: 8,20, 36, 60, 73.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced, or the order of use of the steps may still be modified; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions. The values of the various thresholds and ranges of the present invention may vary depending on the particular situation.

Claims

1. The video content key frame extraction method based on the bidirectional space-time slice clustering is characterized by comprising the following steps of:

step 1, expanding a video into a multi-frame image, and extracting a transverse slice image and a longitudinal slice image;

step 2, splicing the transverse and longitudinal slice images to form transverse and longitudinal video slice images;

step 3, clustering the horizontal and vertical video slice images respectively by using a self-adaptive K-means clustering algorithm;

and 4, merging the key frame candidate frame sequences obtained by clustering to obtain the final video content key frame.

2. The method of claim 1, wherein step 1 comprises:

carrying out horizontal slicing operation along the center of the image, and extracting a transverse slice image; performing vertical slicing operation along the center of the image to extract a longitudinal slice image; the slice width may be greater than 1.

3. The method of claim 1, wherein the step 2 comprises:

splicing the transverse slice images up and down according to the frame sequence to form a transverse video slice image from top to bottom;

and splicing the longitudinal slice images left and right according to the frame sequence to form a longitudinal video slice image from left to right.

4. The method of claim 1, wherein step 3 comprises:

step 3.2, uniformly distributing each slice of the video slice image to each clustering center according to the set number of the clustering centers;

step 3.4, recalculating the distance between the slice images and the clustering centers and adjusting the boundary of the slice images between the adjacent clustering centers;

step 3.5, repeating the steps 3.3 and 3.4 according to a K-means clustering method to obtain a clustering result and a key frame candidate frame;

and calculating the distance between the two slice images and the clustering center, namely calculating the distance between the two slice images, wherein the product of the Euclidean distance between the slice images and the inter-frame distance of the slice images along the time direction is used for expressing.

5. The method of claim 4, wherein the number of cluster centers is adaptively adjusted, and comprises:

the number of clustering centers is changed within a certain range, and clustering is respectively carried out according to the method;

calculating the mean value of the intra-class distance in each case; and the mean value of the intra-class distances of all cases;

and taking the number of the clustering centers under the condition that the number is just smaller than the average value of the intra-class distance mean values as the number of the self-adaptive clustering centers.