CN110460840A

CN110460840A - Lens boundary detection method based on three-dimensional dense network

Info

Publication number: CN110460840A
Application number: CN201910900958.9A
Authority: CN
Inventors: 赵晓丽; 张翔; 张嘉祺; 方志军; 李国平; 商习武; 王国中
Original assignee: Shanghai University of Engineering Science
Current assignee: Shanghai University of Engineering Science
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2019-11-15
Anticipated expiration: 2039-09-23
Also published as: CN110460840B

Abstract

The invention discloses a kind of lens boundary detection method based on three-dimensional dense network, steps are as follows: video is divided into after frame section and is randomly assigned label, then is inputted three-dimensional dense network and completes classification；Three-dimensional dense network includes the Three dimensional convolution layer being linked in sequence, maximum pond layer, four shot boundary detector blocks and linear layer, Three dimensional convolution layer is input layer, linear layer is output layer, shot boundary detector block includes the multiple groups repetitive unit of head and the tail connection, repetitive unit includes bottleneck layer as input and the intensive block by Three dimensional convolution as output, output of the output of upper one group of repetitive unit as next group of repetitive unit, transition zone is connected with after shot boundary detector block, transition zone includes Batch Normalization, RELU, convolution peace pond layer.The present invention improves the space-time characteristic of Three dimensional convolution combination video, carries out feature multiplexing using dense network, not only increases accuracy in detection, also reduce computation complexity.

Description

Lens boundary detection method based on three-dimensional dense network

Technical field

The invention belongs to Video content analysis technique fields, are related to a kind of camera lens side that can be used in video analysis and retrieval Boundary's detection technique, in particular to a kind of lens boundary detection method based on three-dimensional intensive convolutional network (3D DenseNet).

Background technique

The rapid development of computer and multimedia technology generates multitude of video data.How institute is found in multitude of video The video retrieval technology of information is needed to have become a hot topic of research problem.The first step of video frequency searching is to extract feature, and it is first to extract feature First video lens are split, shot boundary detector is exactly a kind of important way of Video segmentation.General camera lens conversion side Formula is divided into two kinds: gradual change (Gradual) and shear (Shape).Gradual change refers to gradually to change between adjacent camera lens, continues ten Several or tens frames；Shear refers to that next camera lens occurs at once after a upper camera lens.Shot boundary detector is wide at present It is general to be applied to the relevant industries such as DTV, traffic monitoring, electronic police, bank monitoring, business information management and national security. Business application can bring huge economic interests, and the application of national security can safeguard the stability and development of society.

Common lens boundary detection method has histogram method, threshold method, mutual information method, support vector machines method and depth Habit method etc..Those skilled in the art have done many research work for above method."Fast Video Shot Boundary Detection Based on SVD and Pattern Matching》(International Workshop on Systems.IEEE, 2007.) domain the HSV color histogram for extracting video frame is proposed as feature, come using singular value decomposition Color histogram is described, computation complexity is lower, improves the speed of detection, but detection accuracy is undesirable；《Information theory-based shot cut/fade detection and video summarization》 (Transactions on Circuits&Systems for Video Technology, 2005,16 (1): 82-91.) use The method of mutual information and combination entropy describes video frame-to-frame coherence, and the relationship of the similitude and global threshold that compare consecutive frame is found Camera lens, the method does not account for the variation of local content so that accuracy rate is affected；"Shot Boundary Detection by a Hierarchical Supervised Approach》(International Workshop on Systems.IEEE, 2007.) using support vector machines as a classifier differentiation shot boundary and non-shot boundary, effect It is unsatisfactory；"Learning Spatiotemporal Features with 3D Convolutional Networks" (International Conference on Computer Vision (ICCV), 2015,4489-4497.) proposes 3D volumes Product network is more suitable for learning in extensive sets of video data, is easy to training and uses；"Large-scale,Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks " (arXiv preprint arXiv:1705.03281,2017.) by a C3D network, with regular length Duan Zuowei input, and gradual change, shear and constant three classes are classified as, it is effective in the task that this method demonstrates ConvNet Property, but when the gradual change of processing different scale, shot boundary can not be positioned；"Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks》(arXiv preprint arXiv: 1705.08214,2017.) full convolutional network is used, it divides positive label in transition using entire video sequence as input Dispensing frame, thus detector lens boundary, but there is no solve the different orientation problem of scale；"Fast Video Shot Transition Localization with Deep Structured Models》(arXiv preprint arXiv: 1808.04234,2018.) detection framework for constructing initial filter, change detecte and gradual transition detection three parts composition, uses The cascade of C3D ConvNet and ResNet-18 network improves real-time speed, but network layer deepens the redundancy occurred The problems such as do not solve.

The convolutional neural networks of deep learning better understood when the high-layer semantic information of image, be used for video mirror Head border detection can obtain good testing result.Current feature extraction network mainly uses 2D convolution, commonly used to pair Image is handled, and timing information can be ignored when analyzing video, causes the loss of inter-frame information, although with net The intensification of network model depth uses 3D convolution to carry out feature extraction, and detection effect can be better, but network layer intensification will lead to The problems such as computationally intensive and efficiency reduces.

Therefore, the great reality meaning of lens boundary detection method that a kind of calculation amount is small, high-efficient and detection effect is good is developed Justice.

Summary of the invention

Prior art detection effect is bad, computationally intensive and inefficiency defect it is an object of the invention to overcoming, and mentions The good lens boundary detection method of small, high-efficient and detection effect for a kind of calculation amount.

To achieve the above object, the invention provides the following technical scheme:

Based on the lens boundary detection method of three-dimensional dense network, steps are as follows:

(1) after video being divided into frame section, it is randomly assigned label；

(2) the frame section for distributing label is inputted into training in three-dimensional dense network, the frame section that output category is completed；

The three-dimensional dense network (3D DenseNet) includes the Three dimensional convolution layer (Conv3D) being linked in sequence, maximum pond Change layer (MaxPooling), four shot boundary detector blocks (SBD Block) and linear layer (Linear exports 3 category features), three Dimension convolutional layer is input layer, and linear layer is output layer, and the shot boundary detector block (SBD Block) includes the more of head and the tail connection Group repetitive unit, repetitive unit include bottleneck layer (Bottleneck) as input and as output by Three dimensional convolution Intensive block (Dense Block), output of the output of upper one group of repetitive unit as next group of repetitive unit, using by 3D convolution Replace the original 2D convolution of DenseBlock, Three dimensional convolution is used to the space-time characteristic in conjunction with video, improves the accuracy rate of detection, often It is connected with after a shot boundary detector block transition zone (Transition), the transition zone (Transition) includes Batch Normalization (crowd normalization, BN), RELU (activation primitive), one 1 × 1 convolution sum 2 × 2 average pond layer (AvgPooling)。

Traditional feature extraction Web vector graphic 2D convolution commonly used to handle image, but is worked as and is divided video Timing information can be ignored when analysis, cause the loss of inter-frame information.Present invention employs the 3D convolution expanded by 2D convolution to carry out spy Sign is extracted, and be joined time dimension, can directly be extracted to the time of video and spatial information, and the movement letter of video is captured Breath.2D convolution is directed to single channel, and the channel of input picture is 1, and the size of input is (1, height, weight), convolution kernel ruler Very little is (1, k_h, k_w), and convolution kernel carries out sliding window operation on the Spatial Dimension of input picture, each sliding window and (k_h, k_w) Value in window carries out convolution operation, obtains a value in output image.For multichannel, it is assumed that the channel of input picture is 3, the size of input is (3, height, weight), and convolution kernel is having a size of (3, k_h, k_w), and convolution kernel is over an input image Sliding window operation is carried out on Spatial Dimension, all values in (k_h, k_w) window in each sliding window and 3 channels carry out convolution behaviour Make, obtains a value of output image.3D convolution is equally divided into single channel and multichannel.The wherein difference of single channel and 2D convolution Place is that the size inputted is (1, time, height, weight), more temporal informations.Convolution kernel also increases one A k_t dimension, therefore convolution kernel carries out sliding window operation on the Spatial Dimension and time dimension of input video.The 3D of multichannel Convolution is also that joined temporal information in the input of the 2D convolution of multichannel, in each sliding window and 3 channels (k_t, k_h, K_w) all values in window carry out convolution operation, the value exported.Due to considering time and the space letter of video Breath, feature extraction of the 3D convolution ratio 2D convolution more suitable for video.Compared with 2D convolution, 3D convolution can pass through 3D convolution sum 3D Pondization operation preferably models space-time characteristic, is more suitable for classification task.But with the intensification of convolutional neural networks, gradient disappear and The problem of model degradation, is more serious.

DenseNet utilizes its unique intensive connection in the present invention, ensure that the biography of gradient while obtaining depth Broadcast, largely alleviate presently, there are these problems.In traditional convolutional neural networks, input layer rearward needs It when using the output feature of forward certain layer, is extracted again again after needing convolution, and utilizes the intensive connection in DenseNet, be It does not need to extract again, can directly give subsequent layer and use, which greatly reduces the usage amount of parameter and calculation amounts.

DenseNet of the invention is mainly by Dense Block (intensive block) and Transition (transition zone) two parts group At.In Dense Block, each layer characteristic pattern size is consistent, can connect on channel dimension.It is different from other networks It is to export l characteristic pattern in all Dense Block after each layer convolution, the channel of obtained characteristic pattern is l, that is, is used L convolution kernel, that is, growth rate (growth rate) in DenseNet.Transition layers are for connection to two Adjacent Dense Block, reduces the size of characteristic pattern, plays the role of compact model, prevent over-fitting.Transition layers Contain the characteristic pattern from different layers, the unique design of Transition layer of the invention (including average pond layer and convolution Parameter designing), realize feature reuse, improve efficiency.It is mentioned in addition, SBD Block directly can carry out feature to video It takes, reduces the loss of video features.

The present invention uses the space-time characteristic of Three dimensional convolution combination video, carries out feature multiplexing using dense network, reduces net The parameter amount of network, can not only accurate detector lens boundary, and also reduce computation complexity.

As a preferred technical scheme:

Lens boundary detection method as described above based on three-dimensional dense network, the bottleneck layer includes Batch Normalization (batch normalization, BN), RELU (activation primitive) and one 1 × 1 × 1 convolution, 1 × 1 × 1 convolution are used To reduce feature quantity, raising computational efficiency.With the increase of the number of plies, the input of Dense Block can be very more, therefore use Bottleneck reduces calculation amount, can be compressed to greatest extent to model with Transition cooperation.Although The intensive connection of DenseNet can make full use of all information of front layer, can be big but if not controlling the size of convolution kernel Increase the complexity of computation system.Therefore, it is carried out using feature of the Bottleneck of 1 × 1 × 1 convolution to different channels linear Combination achievees the purpose that dimensionality reduction, and the frame section completed in the last layer output category.

Lens boundary detection method as described above based on three-dimensional dense network, the frame segment mark label share three classes, tool Body is gradual change, shear and constant.

Lens boundary detection method as described above based on three-dimensional dense network, the frame section that the classification is completed need to also be into Row processing can just obtain final three classes frame section, specific steps are as follows:

(i) merge the frame section with same label in the frame section that classification is completed；

(ii) secondary detection is carried out to the frame section that label is, detects every section of first frame to the color histogram of tail frame, surveys Pasteur's distance between histogram is measured, apart from sufficiently small, regards as being constant section, detects color histogram and Ba Shi distance OpenCV, processing speed block can be used in determination；

(iii) merge step (ii) treated frame section, export final three classes frame section.

Lens boundary detection method as described above based on three-dimensional dense network, it is described to refer to apart from sufficiently small apart from small In equal to 0.5.

Lens boundary detection method as described above based on three-dimensional dense network, the model ginseng of the three-dimensional dense network Number are as follows:

The size of the Feature Mapping figure of the input of Three dimensional convolution layer is 8 × 3 × 16 × 128 × 128；

The size of the Feature Mapping figure of the output of Three dimensional convolution layer is 8 × 64 × 16 × 64 × 64；

The size of the Feature Mapping figure of the output of maximum pond layer is 8 × 64 × 8 × 32 × 32；

The size of the Feature Mapping figure of the output of first shot boundary detector block is 8 × 32 × 8 × 32 × 32；

The size of the Feature Mapping figure of the output of second shot boundary detector block is 8 × 32 × 4 × 16 × 16；

The size of the Feature Mapping figure of the output of third shot boundary detector block is 8 × 32 × 2 × 8 × 8；

The size of the Feature Mapping figure of the output of 4th shot boundary detector block is 8 × 32 × 1 × 4 × 4；

The size of the Feature Mapping figure of the output of linear layer is 1 × 3.

Its parameter is as shown in table 1 below, and BRA refers to Batch Normalization, RELU and AvgPooling in table 1 Abbreviation:

Table 1

Layer	Kernel	Feature map	Followed by
				Input	-	8×3×16×128×128	-
Conv3D	7×7×7	8×64×16×64×64	BN、RELU
				Maxpooling	3×3×3	8×64×8×32×32	BN、RELU
SBD Block 1	-	8×32×8×32×32	BRA
				SBD Block 2	-	8×32×4×16×16	BRA
SBD Block 3	-	8×32×2×8×8	BRA
				SBD Block 4	-	8×32×1×4×4	BRA
Linear	-	1×3	-

Parameter of the invention is not limited to that, only enumerates a kind of feasible parameter, those skilled in the art personnel herein The above parameter can be set according to actual needs.

Lens boundary detection method as described above based on three-dimensional dense network, it is described by video be divided into frame section refer to by Video be divided into every 16 frame of segment length, be laminated in 8 frame section.Protection scope of the present invention is not limited to that, those skilled in the art It can be arranged according to actual needs and be classified as that size is suitable and the suitable frame section of degree of overlapping.

Lens boundary detection method as described above based on three-dimensional dense network, the transition zone put in order as Under: Batch Normalization → RELU → convolution → average pond layer.

Lens boundary detection method as described above based on three-dimensional dense network, the bottleneck layer put in order as Under: Batch Normalization → RELU → convolution.

The utility model has the advantages that

Lens boundary detection method based on three-dimensional dense network of the invention, using the space-time of Three dimensional convolution combination video Feature carries out feature multiplexing using dense network, reduces the parameter amount of network, can not only accurate detector lens boundary (inspection It is high to survey accuracy), and also reduce computation complexity (it is small calculating force request, low for equipment requirements), great application prospect.

Detailed description of the invention

Fig. 1 is the network structure of three-dimensional dense network of the invention；

Fig. 2 is the schematic diagram of shot boundary detector block (SBD Block) of the invention；

Fig. 3 is the flow chart that the present invention carries out shot boundary detector；

Fig. 4 is the mean accuracy histogram comparison diagram of embodiment 1 and comparative example 1~2；

Fig. 5 is the network parameter amount broken line comparison diagram of embodiment 1 and comparative example 1~2；

Fig. 6 is the comparison diagram of the training speed of embodiment 1 and comparative example 1~2.

Specific embodiment

With reference to the accompanying drawing, a specific embodiment of the invention is further elaborated.

Embodiment 1

A kind of lens boundary detection method based on three-dimensional dense network, step are as shown in Figure 3:

(1) after being divided into every 16 frame of segment length by video, be laminated in 8 frame section, it is randomly assigned label, frame segment mark label share three Class, specially gradual change, shear and constant；

(2) the frame section for distributing label is inputted into three-dimensional dense network, the frame section that output category is completed；

(3) merge the frame section with same label in the frame section that classification is completed；

(4) secondary detection is carried out to the frame section that label is, detects color histogram of the every section of first frame to tail frame, measurement Pasteur's distance between histogram, distance are less than or equal to 0.5 and regard as being constant section；

(5) the frame section that merges that step (4) treated, exports final three classes frame section；

Three-dimensional dense network is as shown in Figure 1, include the Three dimensional convolution layer being linked in sequence, maximum pond layer, four camera lens sides Boundary's detection block and linear layer, Three dimensional convolution layer are input layer, and linear layer is output layer, and shot boundary detector block is as shown in Fig. 2, packet The multiple groups repetitive unit of head and the tail connection is included, repetitive unit includes bottleneck layer as input and passes through Three dimensional convolution as output Intensive block, output of the output as next group of repetitive unit of upper one group of repetitive unit, bottleneck layer includes sequential connection Batch Normalization, RELU and one 1 × 1 × 1 convolution are connected with transition zone after each shot boundary detector block, Transition zone include Batch Normalization being linked in sequence, RELU, one 1 × 1 convolution sum 2 × 2 average pond Layer, this example have chosen cross entropy loss function as loss function, make prediction data distribution closer in the distribution of truthful data. Compressed coefficient θ is set as 0.5, growth rate and is set as 32, and learning rate is set as 0.001；

Wherein, the model parameter of three-dimensional dense network are as follows:

The size of the Feature Mapping figure of the output of linear layer is 1 × 3.

Test environment of the invention are as follows: test video card GTX1080Ti, memory 16G, operating system Linux programming software Python.The present invention has selected three kinds of data sets (UCF101_SBD data set, TRECVID data set and ClipShots data Collection) it is tested.

For objective measure detection effect, detection accuracy P (Precision), recall rate R (Recall) and synthesis have been counted Evaluation index F₁.Detection accuracy is that the correct camera lens number detected accounts for all ratios for detecting camera lens, and recall rate indicates detection Correct camera lens number out accounts for the ratio of camera lens sum, comprehensive evaluation index F₁It is the comprehensive performance to detection accuracy and recall rate Evaluation.

Wherein, TP indicates that all correct samples predicted, FP indicate all error samples predicted, and FN indicates not pre- The correct sample measured.

Comparative example 1

A kind of lens boundary detection method uses document " Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks " it records C3D ConvNet method three test sets same as Example 1 are tested.

Comparative example 2

A kind of lens boundary detection method uses document " Fast Video Shot Transition Localization with Deep Structured Models " record C3D+ResNet method to same as Example 1 Three test sets tested.

The test result of embodiment 1 and comparative example 1~2 is compared as shown in table 2~4:

Table 2

Experimental result on UCF101_SBD data set

Table 3

Experimental result on TRECVID data set

Table 4

Experimental result on ClipShots data set

It is of the invention based on three-dimensional by table 2 and 3 it is found that in the test process of UCF101_SBD and TRECVID data set The detection accuracy ratio C3D ConvNet method (comparative example 1) of the lens boundary detection method of dense network and the side C3D+ResNet Method (comparative example 2) is higher；By the experimental result in table 4 it is found that in the case where detection data is more huge, method of the invention Detection accuracy will be high than other two methods, illustrates that method of the invention also has good detection effect on large data sets Fruit, Fig. 4 are mean accuracy histogram comparison diagram of the 3 class methods on 3 data sets, and C3D ConvNet is to compare in Fig. 4~6 Example 1, C3D+ResNet are comparative example 2, and Ours is embodiment 1, be also clear that method of the invention in shear and The method that the mean accuracy of gradual change is better than other two kinds of prior arts；

The number of the network parameter of three kinds of methods of embodiment 1 and comparative example 1~2 is as shown in table 5, and Fig. 5 changes parameter amount It is counted as the size that M is unit to be compared, it can be seen that the lens boundary detection method of the invention based on three-dimensional dense network Used parameter amount is far smaller than other two methods, and the video memory space occupied when calculating is greatly saved in this, as shown in Figure 6 Our method training time in the case where identical cycle of training is most short, illustrates the mirror of the invention based on three-dimensional dense network Head boundary detection method has lower computation complexity, is better than other algorithms, can significantly reduce the requirement to equipment, same to time The time is about handled, working efficiency, great application prospect are accelerated.

Table 5

The number of network parameter

Methods	Parameters
		Comparative example 1	219687875
Comparative example 2	33205443
		Embodiment 1	4234771

Although specific embodiments of the present invention have been described above, it should be appreciated by those skilled in the art these It is merely illustrative of, under the premise of without prejudice to the principle and substance of the present invention, a variety of changes can be made to these embodiments More or modify.

Claims

1. the lens boundary detection method based on three-dimensional dense network, which is characterized in that steps are as follows:

The three-dimensional dense network include the Three dimensional convolution layer being linked in sequence, maximum pond layer, four shot boundary detector blocks and Linear layer, Three dimensional convolution layer are input layer, and linear layer is output layer, and the shot boundary detector block includes the multiple groups of head and the tail connection Repetitive unit, repetitive unit include bottleneck layer as input and the intensive block by Three dimensional convolution as output, and upper one group Output of the output as next group of repetitive unit of repetitive unit is connected with transition zone after each shot boundary detector block, described Transition zone include Batch Normalization, RELU, one 1 × 1 convolution sum 2 × 2 average pond layer.

2. the lens boundary detection method according to claim 1 based on three-dimensional dense network, which is characterized in that the bottle Neck layer includes the convolution of Batch Normalization, RELU and one 1 × 1 × 1.

3. the lens boundary detection method according to claim 1 based on three-dimensional dense network, which is characterized in that the frame Segment mark label share three classes, specially gradual change, shear and constant.

4. the lens boundary detection method according to claim 3 based on three-dimensional dense network, which is characterized in that described point The frame section that class is completed, which also needs to be handled just, can obtain final three classes frame section, specific steps are as follows:

(ii) secondary detection is carried out to the frame section that label is, detects the color histogram of every section of first frame to tail frame, measurement is straight Pasteur's distance between square figure is regarded as being constant section apart from sufficiently small；

5. the lens boundary detection method according to claim 4 based on three-dimensional dense network, which is characterized in that it is described away from It is less than or equal to 0.5 with a distance from sufficiently small refer to.

6. the lens boundary detection method according to claim 1 based on three-dimensional dense network, which is characterized in that described three Tie up the model parameter of dense network are as follows:

The size of the Feature Mapping figure of the output of linear layer is 1 × 3.

7. the lens boundary detection method according to claim 1 based on three-dimensional dense network, which is characterized in that described to incite somebody to action Video be divided into frame section refer to by video be divided into every 16 frame of segment length, be laminated in 8 frame section.

8. the lens boundary detection method according to claim 1 based on three-dimensional dense network, which is characterized in that the mistake It is as follows to cross putting in order for layer: Batch Normalization → RELU → convolution → average pond layer.

9. the lens boundary detection method according to claim 2 based on three-dimensional dense network, which is characterized in that the bottle Putting in order for neck layer is as follows: Batch Normalization → RELU → convolution.