CN110826491A

CN110826491A - Video key frame detection method based on cascading manual features and depth features

Info

Publication number: CN110826491A
Application number: CN201911079839.8A
Authority: CN
Inventors: 毋立芳; 赵宽; 简萌
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-02-21

Abstract

A video key frame detection method combining manual features and depth features is used in the field of sports video content analysis. The depth feature based approach is inefficient because the network structure is complex. Broadcast video also includes many other types of shots such as midfield breaks, progressive shots, etc. The detection result contains a large number of irrelevant frames. Aiming at the problem, a video key frame detection method combining manual features and depth features is provided. First, shot boundary detection is performed based on color histogram features to obtain the last frame. A similar clustering method is further provided based on the histogram similarity to obtain candidate key frames. And finally, classifying the candidate key frames based on the deep neural network to obtain real key frames. The comparison experiment results on the curling game video and the basketball game video show that compared with the traditional background difference method, the optical flow method and the like, the method can quickly and reliably extract the key frames.

Description

Video key frame detection method based on cascading manual features and depth features

Technical Field

The invention is mainly used in the field of sports video content analysis, and particularly relates to digital image processing technologies such as image feature extraction, lens segmentation and neural network analysis. The method can quickly and effectively detect the key frames in the sports video, and further analyze the video content.

Background

Video key frame extraction is to automatically extract a required frame from a video with thousands of frames. The manual operation consumes a great deal of time and energy, and the key frame detection algorithm can quickly and accurately detect the key frame according to the inter-frame characteristics and the self characteristics of the video. Common key frame detection methods include background subtraction, optical flow, and the like. Background subtraction is used to detect moving objects by comparing frames in a video through background modeling. The method has high detection speed, is greatly influenced by the background, and can increase the difficulty of background modeling under the condition of severe background change, thereby influencing the detection effect of the moving target. The optical flow method has a remarkable detection effect on a moving target, detects the motion process of each point by taking each pixel point in an image as a motion vector and extracting the optical flow, thereby judging the motion state and detecting the key frame. The color histogram feature statistic color composition can judge the similarity of two adjacent frames by utilizing a certain distance measurement strategy, and is widely applied to the fields of video key frame detection and lens segmentation. The method of combining HOG and HSV is proposed by Quyang, so that higher recall ratio can be obtained, but the precision ratio is lower, and the calculation amount of HOG characteristics is large and the consumed time is more.

At present, deep learning is widely applied in the field of image processing, and some typical problems such as image classification and the like far exceed traditional artificial features. Wu et al propose a method for combining FCN and CNN for detecting the key posture of the weightlifting video, and achieve good effects. However, the method combines two depth frames, so that the processing speed is very low and the real-time processing cannot be realized. It is therefore a reasonable idea to combine the manual and depth features. The Yangfei et al adopts SIFT features to detect key points to obtain a plurality of candidate local images, further extracts a small number of regions on each image as candidates, extracts depth features of the candidate regions through AlexNet, obtains a good effect in the text and railway videos, but cannot realize real-time performance in speed because the SIFT features are adopted and each frame of image is required to be sent to a neural network for classification.

Disclosure of Invention

In order to solve the problems, the invention provides a video key frame detection method combining manual features and depth features. Firstly, shot boundary detection is carried out on a scene of match video based on the color histogram characteristics, and the last frame of a shot is obtained. Candidate keyframes are derived further based on histogram similarity. And finally classifying the candidate key frames based on the deep neural network to obtain real key frames. The curling game video is usually two to three hours and comprises hundreds of thousands of frames of images, shot segmentation is carried out on the video by using a simple and effective color histogram, about 700 candidate key frames are obtained from about 20 thousands of video frames, and a large number of negative samples are filtered by using the screening method provided by the text. And further, the image features can be accurately extracted based on the deep neural network classification model, so that the accuracy of extracting the key frames is ensured.

The method comprises the following specific steps:

1. selection of video

A video of a complete game, such as a basketball game video, a curling game video, etc., is selected as an input to the present invention.

2. Shot segmentation

This step is shot segmentation based on color histogram features. The method mainly comprises the steps of extracting the color histogram feature of each frame of the video, comparing the distance between feature values of two adjacent frames, judging that the lens jump occurs at the moment when the distance is greater than a set threshold value, and storing the last frame of each lens.

The color histogram feature extracts the probability of various colors appearing on the color histogram, i.e., R, G, B, three color channels. Each color channel is provided with a 256-dimensional vector, and in order to reduce the complexity of calculation and improve the detection speed, the invention quantizes the characteristic dimension on each color channel into a 16-dimensional vector. For example, the blue channel is an original 256-dimensional color vector, and is added and combined into 1-dimensional vector according to 16 adjacent color statistics, and finally the blue channel is called as a 16-dimensional vector. The quantization formula is:

result N in the formula_q(i) Value, N, representing the ith dimension characteristic after quantization_o(j) Representing the value of the j-th dimension of the feature before quantization. Wherein i is more than or equal to 0 and less than or equal to 15, and j is more than or equal to 0 and less than or equal to 255.

The distance measurement measures the distance between vectors in a manner including euclidean distance, cosine distance, etc. The euclidean distance is sensitive to changes of adjacent frame images, but has low calculation complexity relative to the cosine distance, and the time required by the euclidean distance is 75% of the cosine distance under the condition that the number of frames detected by lens segmentation is close by adjusting a threshold value. Therefore, the invention adopts the Euclidean distance as the inter-frame distance measuring method. The formula is as follows, wherein x and y vectors represent feature vectors of two frames respectively.

d＝{∑_i(x[i]-y[i])²}^0.5

The gradual change frames have a large number of gradual change frames in the video, such as overlapping, fade-in and fade-out, and wipe-over, for example, several frames of pictures belong to the same shot, but the next frame gradually presents the picture of the next shot, and the color characteristic distance between adjacent frames is correspondingly increased in the process of the gradual change of the overlapping. Since the euclidean distance is sensitive to frame-to-frame variations, successive fade frames are also retained, and only the first of these fade frames is actually required. Therefore, by comparing the length of each shot, it is found that 96.2% is greater than 100 frames, 0.05% is less than 50 frames, and 3.75% is between 50 and 100 frames. Finally, when the last frame of the shot is found, the last 50 frames are directly skipped, so that the gradual change frame can be avoided, and the speed can be increased.

In this step, a certain threshold is set to divide the video shots, and through experiments, the euclidean distance is selected as a distance calculation method, and the threshold is set to 0.2, that is, when the euclidean distance of the feature values of two adjacent frames is greater than 0.2, the shot conversion is determined.

3. Candidate key frame screening

The result of the last frame of each shot stored by shot segmentation contains a large number of negative samples, and sudden change frames can occur at small intervals for shots such as advertisements in a video, so that subsequent processing tasks are greatly increased. In the curling game, the similarity of most key frames is very close, and in order to screen out partial negative samples, a similar clustering method is provided: reducing a distance threshold value by using the color histogram features extracted in the lens segmentation step, and randomly selecting k pictures meeting the initial threshold value from the results of the lens segmentation in the step two; in order not to mistakenly screen out the key frames, the threshold is increased, the characteristic distances between the rest pictures and the k selected pictures are traversed, and if the threshold is met, the feature distances are reserved. And screening partial negative samples. Because only the key frames in the video frames extracted by using shot segmentation have concentrated characteristics, and the characteristics of the rest frames are randomly distributed in the whole characteristic space, the key frames are difficult to screen by using the traditional clustering mode. By adopting the method, the k value of a large number of texts segmented by the lens can be well set to be 4. The step uses the features extracted before, and repeated extraction is not needed, so that the time is saved. The candidate key frame algorithm is shown in the following table.

4. Neural network based picture classification

The method comprises the steps of utilizing a neural network to extract key frames, wherein a large number of irrelevant pictures except the key frames, namely negative samples, exist in picture data obtained by lens segmentation.

After the candidate video key frames are screened in the step 3, a large number of negative samples can be filtered, and real video key frames are reserved, but meanwhile, partial negative samples still exist in the reserved video frames, and the frames cannot be filtered by adopting manual features, so that in the step, the video frames are classified by adopting a deep neural network with higher classification accuracy.

The invention designs a neural network comprising 4 convolutional layers, 4 pooling layers and 3 full-connection layers, adds regularization loss in order to prevent the problem of overfitting, uses dropout layers behind the first two full-connection layers, randomly removes network nodes in the training process, and retains all the nodes in the test process. On different videos, through training of the neural network, in a testing link, a video key frame can be effectively separated from other negative samples, and under the operation preprocessing of the step 1-3, the number of pictures input by the neural network can be greatly reduced, so that on one hand, the processing time of the network (especially under the processing condition of a cpu) can be reduced, and on the other hand, the influence on the accuracy can be avoided. In the previous invention or thesis, after extracting the characteristics of each frame of the video or the characteristics of each frame are manually extracted, the video is sent to a neural network for classification operation; or each frame is directly fed into the neural network, which is especially inefficient without gpu. The invention only sends about 1000 pictures in the whole section of 20 to 30 ten thousand frames of video into the neural network, thus greatly reducing the processing time of the neural network and improving the final accuracy.

The invention has the following advantages:

in the key frame extraction of the cascading manual features and the depth features, the advantages of high manual feature extraction speed and high depth feature classification accuracy are fully utilized, the FPS can reach 120 in the video processing, the accuracy is 97%, and good effects are obtained in speed and accuracy. And a new candidate frame screening algorithm similar to clustering is provided in step 3, so that a large number of negative samples can be filtered, the previous features are utilized, and the feature extraction does not need to take time, so that the time consumption is basically negligible. Finally, the method can better detect the key frame of the video, and is superior to other algorithms in detection effect and real-time performance.

Drawings

FIG. 1 is a flowchart of a video key frame detection algorithm

FIG. 2 color histogram feature quantization diagram

Detailed Description

The invention provides a method for detecting video key frames by cascading manual features and depth features. The method comprises the following concrete implementation steps:

1. data set production

The training set used herein was 10 games of the last 5 years major curling game (world championship, subtropical, etc.). The main roles of the training set include: and adjusting and searching a proper threshold parameter in the shot segmentation part, and providing a positive sample and a negative sample of the training set in the part for filtering the key frame by using the neural network.

The size of each picture in the training set is 1280x720, and the training set is used as a positive sample; and randomly selecting a plurality of non-key frames in the video as negative samples of training. Wherein, the training set comprises 3776 positive samples and 3000 negative samples.

The quantity and richness of the training sets are an important standard for determining the training of the network model, but the quantity of the training sets cannot meet the requirement. Data enhancement is employed herein to address this issue. Data enhancement can increase the quantity of training data and improve the generalization capability of the model. Data enhancement can effectively reduce error rates. Data enhancement strategies employed herein include: turning over, contrast adjustment, hue adjustment and saturation adjustment.

2. Shot segmentation

The key frame of a video is defined as the last frame of a particular shot, so we first get the last frame of each shot by comparing the color histogram features with the distance measures.

Firstly, different shots in a video sequence are segmented, and finally, the last frame of each shot is extracted. The lens segmentation mainly utilizes the characteristic of a color histogram, a threshold value is set to be 0.2 by comparing Euclidean distances of characteristic values of the color histogram of two adjacent frames, and when the distance between the two frames is less than 0.2, the two frames are judged to belong to the same lens; and when the distance between the two frames is greater than 0.2, judging that the two frames belong to different shots respectively, and storing the last frame of the previous shot as the input of the subsequent step.

A color histogram, i.e., R, G, B probabilities of various colors appearing on three color channels. Each color channel is provided with a 256-dimensional vector, and in order to reduce the complexity of calculation and improve the detection speed, the invention quantizes the characteristic dimension on each color channel into a 16-dimensional vector. For example, the blue channel is an original 256-dimensional color vector, and is added and combined into 1-dimensional vector according to 16 adjacent color statistics, and finally the blue channel is called as a 16-dimensional vector. The way to measure the distance between vectors includes euclidean distance, cosine distance, etc. The euclidean distance is sensitive to changes of adjacent frame images, but has low calculation complexity relative to the cosine distance, and the time required by the euclidean distance is 75% of the cosine distance under the condition that the number of frames detected by lens segmentation is close by adjusting a threshold value. Therefore, the invention adopts the Euclidean distance as the inter-frame distance measuring method. After the distance between two adjacent frames is obtained every time, whether the two frames of books belong to the same shot and whether the last frame of the last shot needs to be stored is judged by comparing the obtained distance with a set threshold value.

3. Candidate key frame screening

The main operation of screening the candidate key frames is to reduce the distance threshold value by using the color histogram characteristics extracted in the step of lens segmentation, and randomly select k pictures meeting the initial threshold value in the result of the step two of lens segmentation; in order not to mistakenly screen out the key frames, the threshold is increased, the characteristic distances between the rest pictures and the k selected pictures are traversed, and if the threshold is met, the feature distances are reserved. And screening partial negative samples.

In the experimental process, the k value is selected to be too small, so that k pictures can be easily obtained as negative samples, and although the situation can be avoided to the greatest extent by reducing the threshold value, the effect is not good through experiments; if the k value is too large, the calculation time will be increased, and through experiments, setting a large k value will improve the final result to some extent, but at the same time, the time consumption will be increased greatly. Therefore, we adopt a value with both precision and speed, and finally set k to be 4, the initial threshold to be 0.1 and the second threshold to be 0.2. Meanwhile, the features used in the step are the features used in the previous shot segmentation, so that the features do not need to be repeatedly extracted, and the time used in the step is greatly reduced.

4. Neural network based picture classification

The method comprises the steps of utilizing a neural network to extract key frames, wherein a large number of irrelevant pictures except the key frames, namely negative samples, exist in picture data obtained by lens segmentation. The invention designs a neural network comprising 4 convolutional layers, 4 pooling layers and 3 full-connection layers, adds regularization loss in order to prevent the problem of overfitting, uses dropout layers behind the first two full-connection layers, randomly removes network nodes in the training process, and retains all the nodes in the test process.

The size of the pictures in the network input data set is 1280x720, the pictures are uniformly adjusted to 100x100 to be used as the input of the whole network, and the length and the width of the pictures are respectively reduced by 12.8 times and 7.2 times. If the difference between the original image size and the original image size is too large, part of details are lost, and the accuracy is reduced. With the input of 100x100, compared with the input of 256x256, the time required by the training of the latter is increased by 210% compared with the former, the accuracy is improved by 0.6 percentage point, and the latter is easy to cause memory overflow. We therefore finally use 100x100 pictures as input for the whole network.

The output of the pooling layer, the last pooling layer of the convolutional layer is a feature map of 6x6x128 size. The convolution kernel size used by the first two convolution layers is 5x5, which can effectively increase the receptive field, but the too large convolution kernel can increase the calculation amount, which slows down the network training. Studies have shown that a series of smaller convolution kernels can be used instead of a larger convolution kernel, for example two series of convolution layers of 3x3 function similarly to a convolution layer of 5x5, so that fewer parameters can be used (3x3x2/5x 5-72%); the latter two convolutional layers use a convolution kernel of 3x3, increasing in number of channels to 128. In this network, the feature map is not changed in size across the convolutional layer, and the downsizing operation is performed on the pooling layer.

Full connectivity layer the network comprises 3 full connectivity layers, 1024,512,2 in size. The final output is two probability values of whether the key frame is. The parameter quantity of the full link layer is (6x6x128) x1024x512x2 ≈ 4.83x109, while the parameter quantity of the convolution layer is 4704, so that the parameter quantity of the full link layer is much higher than that of the convolution layer.

The convolutional layer 1: convolution kernel 5x 32, input 100x 3, output 100x 32

A pooling layer 1: input 100x 32, output 50 x 32

And (3) convolutional layer 2: convolution kernel 5x 64, input 50 x 32, output 50 x 64

And (3) a pooling layer 2: input 50 x 64, output 25 x 64

And (3) convolutional layer: convolution kernel 3x 128, input 25 x 64, output 25 x128

A pooling layer 3: input 25 x128, output 12x 128

And (3) convolutional layer: convolution kernel 3x 128, input 12x 128, output 12x 128

A pooling layer 3: input 12x 128, output 6x128

The quantity and richness of the training sets are an important standard for determining the training of the network model, but the quantity of the training sets cannot meet the requirement. The present invention addresses this problem with data enhancement. Data enhancement can increase the quantity of training data and improve the generalization capability of the model. Data enhancement can effectively reduce error rates. Data enhancement strategies employed herein include: turning over, contrast adjustment, hue adjustment and saturation adjustment.

For a complex data set, the AlexNet, VGG and ResNet are adopted to test the data set respectively, and the data classification effect is better with the increase of the network depth, so that the VGG or ResNet can be selected as a classification network when the data set is complex. When only the cpu is used for testing, the lighter-magnitude network designed in the step 4 of the invention is recommended to be used, so that the influence of too slow operation time or overflow of the memory on use can be avoided.

The experimental results are as follows:

and extracting candidate key frames based on the histogram similarity, and directly utilizing the color histogram features extracted in the shot segmentation. Comparing the time of extracting the candidate frames with the results, as shown in table 1, it can be seen that the total number of detected frames of the three videos is respectively 60.9%, 59.9%, 78.3%, but the consumed time is only 1.7%, 0.87%, 0.86% of the total time. Therefore, the process can filter out a large number of negative samples only by consuming a small amount of time, and the subsequent classification task is relieved. In experiments, we observe that the process depends on the fact that the similarity of key frames in the whole video is relatively close, while the similarity of other frames is greatly different, and if the k value is selected to be too small or the initial threshold value is set to be too large, wrong k pictures are easily selected, and the key frames are filtered out. After a number of experiments we have chosen a k value of 4 and an initial threshold of 0.1.

Table 1 candidate key frame extraction

(note: N/N represents the ratio of the number of extracted candidate key frames to the total number of extracted frames in the video segmentation; T/T represents the ratio of the time of extracting candidate key frames to the total time)

As shown in table 2, after shot segmentation, candidate key frame detection, and neural network classification, the average accuracy in the test set video can reach 97.2%, and 122 frames per second can be detected on average.

TABLE 2 test set curling video key frame test results

The present invention compares the performance of this method with the conventional method and compares the method of key frame detection using HOG and HSV, as shown in table 3, where the number of video frames used is 209278 frames. It can be seen that the background difference method is higher than the text method in speed, and the text method is higher than the background difference method and the optical flow method in accuracy and reaches 97%; in the HOG + HSV method, the recall ratio and the accuracy are both more than 90%, but the HOG characteristic requires the gradient value of an image, the calculated amount is large, and the FPS is only 10 and is far lower than that of the method; in the block and global color histogram, the recall ratio is 97%, but the accuracy is low because only the color features are utilized, and the method utilizes the deep features of the image through the neural network method in addition to the color features, so the accuracy is high.

Table 3 comparison of the method herein with the conventional method

Claims

1. A video key frame detection method for cascading manual features and depth features is characterized by comprising the following steps:

(1) shot segmentation based on color histogram features; by extracting the color histogram feature of each frame of the video, comparing the distance between the feature values of two adjacent frames, when the distance is greater than a set threshold value, judging that the lens jump occurs at the moment, and storing the last frame of each lens;

(2) screening candidate key frames; after the shot segmentation is carried out by utilizing the color histogram characteristics, a large number of negative samples are removed by utilizing the candidate key frame screening process of the step, and the input of the subsequent steps is reduced;

(3) and (5) performing further fine classification on the screened pictures by utilizing a neural network to obtain a final video frame result.

2. The method according to claim 1, wherein in step (1), the specific method of shot segmentation based on color histogram features is as follows:

extracting the probability of various colors appearing on a color histogram, namely R, G, B three color channels by using the color histogram feature; each color channel is provided with a 256-dimensional vector, and the characteristic dimension of each color channel is quantized into a 16-dimensional vector;

the distance measurement adopts Euclidean distance as an interframe distance measurement method; when the distance is larger than the set threshold value of 0.2, the shot jump is judged to occur at the moment, and the last frame of each shot is stored.

3. The method according to claim 1, wherein in step (2), the distance threshold is reduced to 0.1 by using the color histogram features extracted in the shot segmentation step, and 4 pictures satisfying the initial threshold are randomly selected from the segmentation result; then increasing the threshold to 0.2, traversing the characteristic distances between the rest pictures and 4 selected pictures, and reserving the pictures when the threshold is met; and screening partial negative samples.

4. The method according to claim 1, wherein in step (3), a neural network comprising 4 convolutional layers, 4 pooling layers and 3 fully-connected layers is designed, in order to prevent the over-fitting problem, regularization loss is added, dropout layers are used after the first two fully-connected layers, network nodes are randomly removed in the training process, and all nodes are reserved during testing;

A pooling layer 1: input 100x 32, output 50 x 32

And (3) a pooling layer 2: input 50 x 64, output 25 x 64

A pooling layer 3: input 25 x128, output 12x 128

A pooling layer 3: input 12x 128, output 6x128

And finally, obtaining the key frame of the video through the classification of the neural network.