CN110460840B

CN110460840B - Shot boundary detection method based on three-dimensional dense network

Info

Publication number: CN110460840B
Application number: CN201910900958.9A
Authority: CN
Inventors: 赵晓丽; 张翔; 张嘉祺; 方志军; 李国平; 商习武; 王国中
Original assignee: Shanghai University of Engineering Science
Current assignee: Shanghai University of Engineering Science
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-06-26
Anticipated expiration: 2039-09-23
Also published as: CN110460840A

Abstract

The invention discloses a shot boundary detection method based on a three-dimensional dense network, which comprises the following steps: dividing the video into frame sections, randomly distributing labels, and inputting the labels into a three-dimensional dense network to finish classification; the three-dimensional dense network comprises three-dimensional convolution layers connected in sequence, a maximum pooling layer, four lens boundary detection blocks and linear layers, the three-dimensional convolution layers are input layers, the linear layers are output layers, the lens boundary detection blocks comprise a plurality of groups of repeating units connected end to end, each repeating unit comprises a bottleneck layer used as input and a dense block used as output and passing through three-dimensional convolution, the output of the last group of repeating units is used as the input of the next group of repeating units, a transition layer is connected behind each lens boundary detection block, and each transition layer comprises a Batch Normalization layer, a RELU, a convolution and an average pooling layer. The invention improves the space-time characteristics of the three-dimensional convolution combined video, and adopts the dense network to multiplex the characteristics, thereby not only improving the detection accuracy, but also reducing the calculation complexity.

Description

Shot boundary detection method based on three-dimensional dense network

Technical Field

The invention belongs to the technical field of video content analysis, relates to a shot boundary detection technology used in video analysis and retrieval, and particularly relates to a shot boundary detection method based on a three-dimensional dense convolutional network (3D DenseNet).

Background

The rapid development of computer and multimedia technologies has resulted in large amounts of video data. Video retrieval technology of how to find required information in a large number of videos becomes a hot issue of research. The first step of video retrieval is to extract features, the features are extracted to segment video shots at first, and shot boundary detection is an important mode of video segmentation. The general lens conversion methods are divided into two types: gradual (Gradual) and shear (Shape). Gradual change means that adjacent lenses change gradually and extend for dozens or dozens of frames; shear refers to the appearance of the next shot immediately after the end of the previous shot. Shot boundary detection is widely applied to relevant industries such as digital televisions, traffic monitoring, electronic policemen, bank monitoring, business information management, national security and the like. Commercial applications can bring great economic benefits, and national security applications can maintain social stability and development.

Common shot boundary detection methods include a histogram method, a threshold value method, a mutual information method, a support vector machine method, a deep learning method and the like. Many studies have been made by those skilled in the art on the above method. Fast Video Shot boundary detection Based on SVD and Pattern Matching (International Workshop on systems. IEEE,2007.) proposes extracting HSV domain color histogram of Video frame as feature, using singular value decomposition to describe color histogram, the calculation complexity is lower, the detection speed is improved, but the detection precision is not ideal; information-based shot/face detection and Video rendering (Transactions on Circuits and Systems for Video Technology,2005,16(1): 82-91) describes Video inter-frame similarity using mutual Information and joint entropy methods, compares the similarity of adjacent frames to a global threshold relationship to find shots, and does not consider that the accuracy is affected by local content changes; shot Boundard detection by a Hierarchical Supervised Aproach (International Workshop on systems. IEEE,2007.) the effect is not ideal by using a support vector machine as a classifier to distinguish Shot boundaries from non-Shot boundaries; a Learning spatial convolution Networks with 3D volumetric Networks (International Conference on Computer Vision (ICCV) 2015,4489 and 4497) proposes that the 3D convolution Networks are more suitable for Learning on large-scale video data sets and are easy to train and use; large-scale, Fast and double Detection through space-temporal conditional neural networks (arXiv preprinting arXiv:1705.03281,2017.) through a C3D network, with fixed length segments as input, and dividing the segments into gradient, shear and invariant categories, the method verifies the effectiveness of the ConvNet in the task, but when processing gradient of different scales, the lens Boundary cannot be located; the Ricdiculously Fast Shot BoundarCry detection with full volumetric computational Networks (arXiv preprint arXiv:1705.08214,2017.) uses a full convolution network that takes the entire video sequence as input and assigns a positive label to the frame in the transition, thereby detecting Shot boundaries, but does not solve the positioning problem of different scales; fast Video recording Localization with Deep Structured Models (arXiv prediction arXiv:1808.04234,2018.) is adopted to construct a detection framework consisting of three parts, namely initial filtering, shear detection and gradual change detection, and a cascade architecture of a C3D ConvNet and ResNet-18 network is adopted, so that the real-time speed is improved, but the problems of redundancy and the like caused by deepening of network hierarchy are not solved.

The deep learning convolutional neural network can better understand the high-level semantic information of the image, and can obtain a good detection result when being used for detecting the video shot boundary. The existing feature extraction network mainly uses 2D convolution and is generally used for processing images, but timing sequence information can be ignored when videos are analyzed, so that inter-frame information loss is caused, although the feature extraction is performed by adopting 3D convolution along with the deepening of a network model, the detection effect is better, but the deepening of a network layer can cause the problems of large calculation amount, low efficiency and the like.

Therefore, the development of the shot boundary detection method with small calculation amount, high efficiency and good detection effect is of great practical significance.

Disclosure of Invention

The invention aims to overcome the defects of poor detection effect, large calculated amount and low efficiency in the prior art, and provides a shot boundary detection method with small calculated amount, high efficiency and good detection effect.

In order to achieve the purpose, the invention provides the following technical scheme:

the shot boundary detection method based on the three-dimensional dense network comprises the following steps:

(1) dividing the video into frame segments, and then randomly distributing labels;

(2) inputting the frame sections with the distributed labels into a three-dimensional dense network for training, and outputting the classified frame sections;

the three-dimensional Dense network (3D DenseNet) comprises a three-dimensional convolution layer (Conv3D), a maximum pooling layer (Max Pooling), four lens boundary detection blocks (SBD Block) and a Linear layer (Linear, outputting 3 types of characteristics), wherein the three-dimensional convolution layer is an input layer, the Linear layer is an output layer, the lens boundary detection Block (SBD Block) comprises a plurality of groups of repeating units which are connected end to end, each repeating unit comprises a Bottleneck layer (Bottleneck) as an input and a Dense Block (Dense Block) which is subjected to three-dimensional convolution as an output, the output of the last group of repeating units is used as the input of the next group of repeating units, the original 2D convolution of the DenseBlock is replaced by 3D convolution, the three-dimensional convolution is used for combining the space-time characteristics of video to improve the detection accuracy, a Transition layer (Transition) is connected behind each lens boundary detection Block, and the Transition layer (Transition) comprises a Batchmaking normalization (normalization, a Linear rectification function (REactivating), an average function 351) of a Linear function (1 × 1), and an Av 2 convolution layer (2 × 2).

Conventional feature extraction networks use 2D convolution, which is typically used to process images, but ignore timing information when analyzing video, resulting in loss of inter-frame information. The method adopts the 3D convolution expanded by the 2D convolution to extract the characteristics, adds the time dimension, can directly extract the time and space information of the video and capture the motion information of the video. The 2D convolution aims at a single channel, the channel of an input image is 1, the input size is (1, height, weight), the convolution kernel size is (1, k _ h, k _ w), the convolution kernel performs sliding window operation on the space dimension of the input image, and each time the sliding window performs convolution operation with the value in the (k _ h, k _ w) window, a value in an output image is obtained. For multiple channels, assuming that the channel of the input image is 3, the input size is (3, height, weight), the size of the convolution kernel is (3, k _ h, k _ w), the convolution kernel performs a sliding window operation on the spatial dimension of the input image, and performs a convolution operation on all values in the (k _ h, k _ w) window on 3 channels and the sliding window each time to obtain one value of the output image. The 3D convolution is also divided into single and multi-channel. The single channel differs from the 2D convolution in that the input size is (1, time, height, weight), which adds one more time information. The convolution kernel also adds a k _ t dimension, so the convolution kernel performs a sliding window operation in both the spatial and temporal dimensions of the input video. The 3D convolution of multiple channels also adds time information to the input of the 2D convolution of multiple channels, and performs convolution operation on all values in the (k _ t, k _ h, k _ w) window on 3 channels every time to obtain an output value. The 3D convolution is more suitable for feature extraction of video than the 2D convolution, since temporal and spatial information of video is considered. Compared with 2D convolution, 3D convolution can better model spatio-temporal features through 3D convolution and 3D pooling, and is more suitable for classification tasks. But as the convolutional neural network deepens, the problems of gradient disappearance and model degradation become more serious.

In the invention, DenseNet utilizes the unique dense connection, ensures the propagation of gradient while obtaining depth, and greatly relieves the problems existing at present. In the traditional convolutional neural network, when the output characteristics of a certain layer in front need to be used by a later input layer, re-extraction is needed after convolution, and re-extraction is not needed by utilizing dense connection in Densenet, so that the data can be directly taken to a later layer for use, and the use amount and the calculation amount of parameters are greatly reduced.

The DenseNet of the present invention is mainly composed of two parts of Dense Block and Transition layer. In the sense Block, the feature maps of the layers are of the same size and can be connected in the channel dimension. Different from other networks, all layers in the Dense Block output l feature maps after convolution, and the obtained feature map has a channel of l, that is, l convolution kernels, namely, growth rate in DenseNet are adopted. The Transition layer is used for connecting two adjacent Dense blocks, reducing the size of a characteristic diagram, playing a role in compressing a model and preventing overfitting. The Transition layer comprises feature graphs from different layers, and the unique design of the Transition layer (including parameter design of an average pooling layer and convolution) realizes feature reuse and improves efficiency. In addition, the SBD Block can directly extract the features of the video, and the loss of the video features is reduced.

The invention combines the space-time characteristics of the video by three-dimensional convolution and adopts dense networks for characteristic multiplexing, thereby reducing the parameter quantity of the networks, not only being capable of accurately detecting the shot boundary, but also reducing the calculation complexity.

As a preferred technical scheme:

the method for detecting the shot boundary based on the three-dimensional Dense network comprises the convolution of Batch Normalization (BN), RELU (activation function) and 1 × 1 ×, wherein the convolution of 1 × 1 × 1 is used for reducing the number of features and improving the calculation efficiency, as the number of layers is increased, the input of a Dense Block is very much, so that the calculation amount is reduced by adopting Bottleneck, and the Bottlenk can be matched with Transition to compress a model to the maximum extent.

According to the shot boundary detection method based on the three-dimensional dense network, the frame segment labels have three types, specifically, gradient, shear and invariant.

In the shot boundary detection method based on the three-dimensional dense network, the classified frame segments need to be processed to obtain the final three types of frame segments, and the specific steps are as follows:

(i) combining the frame sections with the same label in the classified frame sections;

(ii) performing secondary detection on the frame sections marked as gradual change, detecting a color histogram from the first frame to the last frame of each section, measuring the Pasteur distance between the histograms, determining the color histogram as a constant section if the distance is small enough, and determining the detected color histogram and the Pasteur distance by adopting OpenCV and processing a speed block thereof;

(iii) and (3) merging the frame segments processed in the step (ii) and outputting the final three types of frame segments.

In the shot boundary detection method based on the three-dimensional dense network, the distance is small enough to mean that the distance is less than or equal to 0.5.

The shot boundary detection method based on the three-dimensional dense network comprises the following model parameters:

the size of the input feature map for the three-dimensional convolutional layer is 8 × 3 × 16 × 128 × 128;

the size of the feature map of the output of the three-dimensional convolutional layer is 8 × 64 × 16 × 64 × 64;

the size of the feature map for the output of the max pooling layer is 8 × 64 × 8 × 32 × 32;

the size of the feature map of the output of the first shot boundary detection block is 8 × 32 × 8 × 32 × 32;

the size of the feature map of the output of the second shot boundary detection block is 8 × 32 × 4 × 16 × 16;

the size of the feature map of the output of the third shot boundary detection block is 8 × 32 × 2 × 8 × 8;

the size of the feature map of the output of the fourth shot boundary detection block is 8 × 32 × 1 × 4 × 4;

the size of the feature map of the output of the linear layer is 1 × 3.

The parameters are shown in Table 1 below, where BRA in Table 1 refers to the abbreviations of Batch Normalization, RELU and AvgPooling:

TABLE 1

Layer	Kernel	Feature map	Followed by
				Input	-	8×3×16×128×128	-
Conv3D	7×7×7	8×64×16×64×64	BN、RELU
				Maxpooling	3×3×3	8×64×8×32×32	BN、RELU
SBD Block 1	-	8×32×8×32×32	BRA
				SBD Block 2	-	8×32×4×16×16	BRA
SBD Block 3	-	8×32×2×8×8	BRA
				SBD Block 4	-	8×32×1×4×4	BRA
Linear	-	1×3	-

The parameters of the present invention are not limited thereto, and only one possible parameter is listed here, and those skilled in the art can set the above parameters according to actual needs.

As described above, in the shot boundary detection method based on the three-dimensional dense network, dividing the video into frame segments means dividing the video into frame segments each having a length of 16 frames and an overlap of 8. The scope of the present invention is not limited thereto, and those skilled in the art can divide the frame into frame segments with proper size and overlap according to actual requirements.

In the shot boundary detection method based on the three-dimensional dense network, the arrangement order of the transition layers is as follows: batch Normalization → RELU → convolution → average pooling layer.

According to the shot boundary detection method based on the three-dimensional dense network, the arrangement sequence of the bottleneck layers is as follows: batch Normalization → RELU → convolution.

Has the advantages that:

the shot boundary detection method based on the three-dimensional dense network adopts the space-time characteristics of the three-dimensional convolution combined with the video and the dense network for characteristic multiplexing, reduces the parameter amount of the network, not only can accurately detect the shot boundary (high detection accuracy), but also reduces the calculation complexity (low calculation force requirement and low equipment requirement), and has great application prospect.

Drawings

FIG. 1 is a network architecture diagram of a three-dimensional dense network of the present invention;

FIG. 2 is a diagram of a shot boundary detection Block (SBD Block) according to the present invention;

FIG. 3 is a flowchart illustrating shot boundary detection according to the present invention;

FIG. 4 is a histogram comparison of the average accuracy of example 1 and comparative examples 1 to 2;

FIG. 5 is a broken-line comparison of network parameters of example 1 and comparative examples 1 to 2;

FIG. 6 is a graph comparing training speeds of example 1 and comparative examples 1-2.

Detailed Description

The following further describes the embodiments of the present invention with reference to the attached drawings.

Example 1

A shot boundary detection method based on a three-dimensional dense network comprises the following steps as shown in FIG. 3:

(1) dividing a video into frame sections with the length of 16 frames and the overlap of 8, and randomly distributing labels, wherein the frame section labels are classified into three types, namely gradual change, shear change and invariance;

(2) inputting the frame sections with the distributed labels into a three-dimensional dense network, and outputting the classified frame sections;

(3) combining the frame sections with the same label in the classified frame sections;

(4) carrying out secondary detection on the frame sections marked as gradual change, detecting color histograms from the first frame to the last frame of each section, measuring the Papanicolaou distance between the histograms, and determining the frame sections marked as gradual change as unchanged sections if the distance is less than or equal to 0.5;

(5) combining the frame sections processed in the step (4) and outputting the final three types of frame sections;

the three-dimensional dense network is shown in FIG. 1 and comprises a three-dimensional convolution layer, a maximum pooling layer, four lens boundary detection blocks and a linear layer which are sequentially connected, wherein the three-dimensional convolution layer is an input layer, the linear layer is an output layer, the lens boundary detection block is shown in FIG. 2 and comprises a plurality of groups of repeating units which are connected end to end, the repeating units comprise a bottleneck layer and a dense block which is used as an output and is subjected to three-dimensional convolution, the output of the last group of repeating units is used as the input of the next group of repeating units, the bottleneck layer comprises sequentially connected Batch Normalization, RELU and a convolution of 1 × 1 × 1, a transition layer is connected behind each lens boundary detection block, the transition layer comprises sequentially connected Batch Normalization, RELU, a convolution of 1 × 1 and an average pooling layer of 2 × 2, in this example, a cross entropy loss function is selected as a loss function, so that the distribution of predicted data is closer to the distribution of real data, the compression coefficient theta is set to be 0.5, the growth is set to be 32, and the learning rate is set to be 0.001;

wherein, the model parameters of the three-dimensional dense network are as follows:

the size of the feature map of the output of the linear layer is 1 × 3.

The test environment of the invention is as follows: the test video card GTX1080Ti, the memory 16G and the operating system Linux programming software Python. The invention selects three data sets (UCF101_ SBD data set, TRECVID data set and ClipShots data set) for testing.

For objectively measuring the detection effect, the detection precision P (precision), the recall rate R (recall) and the comprehensive evaluation index F are counted₁. The detection precision is the proportion of the detected correct lens number to all the detected lenses, the recall rate represents the proportion of the detected correct lens number to the total number of the lenses, and the comprehensive evaluation index F₁The evaluation is to evaluate the comprehensive performance of the detection precision and the recall rate.

Where TP represents all predicted correct samples, FP represents all predicted erroneous samples, and FN represents unpredicted correct samples.

Comparative example 1

A shot boundary Detection method, which adopts the C3D ConvNet method recorded in the document "Large-scale, Fast and Accurate shot boundary Detection through-temporal spatial-temporal conditional Neural Networks" to test the same three test sets as example 1.

Comparative example 2

A Shot boundary detection method, which adopts the C3D + ResNet method recorded in Fast Video Shot transform localization with Deep Structured Models to test the same three test sets as example 1.

The test results of example 1 and comparative examples 1 to 2 are shown in tables 2 to 4:

TABLE 2

Experimental results on UCF101_ SBD dataset

TABLE 3

Experimental results on TRECVID data set

TABLE 4

Experimental results on ClipShots data set

As can be seen from tables 2 and 3, the shot boundary detection method based on the three-dimensional dense network of the present invention has higher detection accuracy than the C3D ConvNet method (comparative example 1) and the C3D + ResNet method (comparative example 2) during the test of the UCF101_ SBD and TRECVID data sets; as can be seen from the experimental results in table 4, under the condition that the detection data is huge, the detection accuracy of the method of the present invention is higher than that of the other two methods, which indicates that the method of the present invention also has a good detection effect on a large data set, fig. 4 is an average accuracy histogram comparison diagram of 3 types of methods on 3 data sets, C3D ConvNet is comparative example 1, C3D + ResNet is comparative example 2, and Ours is example 1 in fig. 4-6, and it can also be clearly seen that the average accuracy of the method of the present invention in shearing and gradual change is superior to that of the other two prior art methods;

the number of network parameters of the three methods of the embodiment 1 and the comparative examples 1-2 is shown in table 5, and fig. 5 compares the parameter quantity converted into the size of M as a unit, so that it can be seen that the parameter quantity used by the shot boundary detection method based on the three-dimensional dense network is far smaller than that of the other two methods, which greatly saves the video memory space occupied during calculation, and fig. 6 shows that the training time of our method is shortest under the condition of the same training period.

TABLE 5

Number of network parameters

Methods	Parameters
		Comparative example 1	219687875
Comparative example 2	33205443
		Example 1	4234771

Although specific embodiments of the present invention have been described above, it will be appreciated by those skilled in the art that these embodiments are merely illustrative and various changes or modifications may be made without departing from the principles and spirit of the invention.

Claims

1. The shot boundary detection method based on the three-dimensional dense network is characterized by comprising the following steps of:

(1) dividing a video into frame sections, and then randomly distributing labels, wherein the labels of the frame sections are classified into gradient, shear and invariant;

(4) carrying out secondary detection on the frame sections marked as gradual change, detecting color histograms from the first frame to the last frame of each section, measuring the Papanicolaou distance between the histograms, and determining the section as an unchanged section if the distance is small enough;

the three-dimensional intensive network includes three-dimensional convolutional layer, the biggest pooling layer, four camera lens boundary detection pieces and the linear layer of sequential connection, and three-dimensional convolutional layer is the input layer, and the linear layer is the output layer, camera lens boundary detection piece includes end to end connection's multiunit repetitive unit, and the repetitive unit includes the bottleneck layer as the input and the intensive piece through three-dimensional convolution as the output, and the output of last a set of repetitive unit is as the input of next set of repetitive unit, all links there is the transition layer behind every camera lens boundary detection piece, the transition layer includes the convolution of Batch Normalization, linear rectification function, 1 × 1 and 2 × 2's average pooling layer.

2. The three-dimensional dense network-based shot boundary detection method of claim 1, wherein the bottleneck layer comprises a convolution of Batch Normalization, linear rectification function and 1 × 1 × 1.

3. The shot boundary detection method based on the three-dimensional dense network as claimed in claim 1, wherein the distance is small enough to mean that the distance is less than or equal to 0.5.

4. The shot boundary detection method based on the three-dimensional dense network as claimed in claim 1, wherein the model parameters of the three-dimensional dense network are:

the size of the feature map of the output of the linear layer is 1 × 3.

5. The shot boundary detection method based on the three-dimensional dense network as claimed in claim 1, wherein the dividing of the video into frame segments means dividing the video into frame segments each having a length of 16 frames and an overlap of 8.

6. The shot boundary detection method based on the three-dimensional dense network as claimed in claim 1, wherein the transition layers are arranged in the following order: batch Normalization → linear rectification function → convolution → average pooling layer.

7. The shot boundary detection method based on the three-dimensional dense network as claimed in claim 2, wherein the arrangement order of the bottleneck layers is as follows: batch Normalization → linear rectification function → convolution.