CN111461043B - Video significance detection method based on deep network - Google Patents
Video significance detection method based on deep network Download PDFInfo
- Publication number
- CN111461043B CN111461043B CN202010266351.2A CN202010266351A CN111461043B CN 111461043 B CN111461043 B CN 111461043B CN 202010266351 A CN202010266351 A CN 202010266351A CN 111461043 B CN111461043 B CN 111461043B
- Authority
- CN
- China
- Prior art keywords
- video frame
- final
- obtaining
- video
- saliency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000010586 diagram Methods 0.000 claims abstract description 26
- 238000003708 edge detection Methods 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000003068 static effect Effects 0.000 claims description 6
- 230000010339 dilation Effects 0.000 claims description 4
- 238000011478 gradient descent method Methods 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 2
- 239000000203 mixture Substances 0.000 claims 1
- 239000003086 colorant Substances 0.000 abstract description 3
- 230000007547 defect Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 16
- 230000003287 optical effect Effects 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 241000282326 Felis catus Species 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/70—Denoising; Smoothing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video saliency detection method based on a depth network, which relates to the field of image data processing, and is characterized in that a ResNet50 depth network is used for extracting spatial features, then time and edge information are extracted to jointly obtain a saliency prediction result map, and the video saliency detection based on the depth network is completed; extracting video frame I ′ The initial spatial feature map S; obtaining a space characteristic diagram S of five scales final (ii) a Obtaining a characteristic diagram F; obtaining a rough spatio-temporal saliency map Y ST And edge profile E of salient objects t (ii) a Obtaining a final significance prediction result graph Y final (ii) a And calculating the loss of the input video frame I, and completing the video significance detection based on the deep network. The method overcomes the defects that the detection of the salient object is incomplete and the algorithm detection is inaccurate when the background colors of the foreground are similar in the video saliency detection in the prior art.
Description
Technical Field
The technical scheme of the invention relates to the field of image data processing, in particular to a video saliency detection method based on a depth network.
Background
Video saliency detection aims at extracting the regions of most interest to the human eye in successive video frames. In particular to a method for extracting a human eye interested region from a video frame by utilizing a computer to simulate a human eye visual attention mechanism, which is one of key technologies in the field of computer vision.
Most conventional video saliency detection methods are based on low-level manual features (e.g. color, texture, etc.), which are typically heuristic methods with the disadvantages of slow speed (due to time-consuming optical flow computations) and low prediction accuracy (due to limited characterizability of low-level features). In recent years, a deep neural network is applied to the field of video saliency detection, and a deep learning method is to calculate a saliency value of an image by using a high-level semantic feature of a convolutional neural network extraction image, but position information and detail information of a target can be lost by using the deep convolutional network, misleading information can be introduced when the salient target is detected, and the detected target is incomplete.
In 2016, liu et al, in the article "sales detection for unconjugated video using superpixel-level graph and spatial-temporal propagation", propose the SGSP algorithm for video Saliency detection using a superpixel-level graph model and spatiotemporal propagation, first extracting motion and color histograms at the superpixel level and a global motion histogram to construct a graph. Next, motion saliency is iteratively computed through a shortest path on the graph using a background prior based on the graph model. Then propagate forward and backward in time, propagate locally and globally in space, and finally fuse these two results to form the final saliency map. The algorithm is large in calculation amount, but the obtained saliency map still has the problem of incomplete detection of saliency targets. The deep learning model is based on the aim of obtaining richer depth characteristics by utilizing the convolutional neural network so as to obtain more accurate detection results. Wang et al proposed a Video saliency detection method based on a full convolution network in 2017, which is the first time that a full convolution network based on deep learning is used in the Video saliency detection field, but because time information between frames is not considered, the edges of the obtained saliency map are not fine enough, and the edge noise is large. CN106372636A discloses a video saliency detection method based on HOG _ TOP, which utilizes original video to calculate on three orthogonal planes XY, XT and YT to obtain HOG _ TOP characteristics, calculates on the XY plane to obtain a space domain saliency map and calculates on the XT and YT plane to obtain a time domain saliency map, and finally obtains a final saliency map through self-adaptive fusion. CN109784183A discloses a video saliency target detection method based on a cascade convolution network and an optical flow, which uses a cascade network structure to perform pixel-level saliency prediction on an image of a current frame on three scales, namely high, medium and low. A cascade network structure is trained by using an MSAR10K image data set, a significance label graph is used as training supervision information, and a loss function is a cross entropy loss function. And after the training is ended, performing static significance prediction on each frame of image in the video by using the trained cascade network, and extracting an optical flow field by using a Locus-Kanada algorithm. And then constructing a dynamic optimization network structure by using the three layers of convolution network structures. And splicing the static detection result and the optical flow field detection result of each frame of image to obtain the input data of the optimized network. The method is time-consuming, and optical flow information extracted by using the Locus-Kanada algorithm is inaccurate and poor in robustness when the method is used for a complex scene. CN109118469A discloses a prediction method for video saliency, which quantizes an image to obtain a sparse matrix response, obtains a decomposition matrix according to local coordinate constraint, and performs saliency map calculation for each frame in a video and performs quality prediction. The method loses some detail information of the significant target, so that the prediction result has the problem of incomplete detection of the significant target. CN105913456B discloses a video saliency detection method by region segmentation, which first uses nonlinear clustering to obtain superpixel blocks to extract static features, then uses a light splitting flow method to obtain dynamic features, and finally uses a linear regression model to predict a saliency map after the two features are fused. CN109034001A discloses a cross-modal video saliency detection method based on spatio-temporal cues, which constructs a saliency map by using weights of an initial saliency map, visible light and thermal infrared, and is difficult to find a proper weight value, resulting in poor robustness. CN108241854A discloses a depth video saliency detection method based on motion and memory information, which extracts local information and global information according to the human eye attention view of a current frame, and then inputs the local information and global information as prior information and an original image into a depth network model to predict a final saliency map. CN110598537A discloses a video saliency detection method based on a deep convolution network, which uses a current frame of a video and an optical flow image corresponding to the current frame as input of a feature extraction network to predict a final saliency map, and this method needs to calculate optical flow information of the current frame in advance, and the calculation amount is large.
In summary, the prior art of video salient object detection still has the problems that the salient object detection is incomplete, and the algorithm detection is inaccurate when the background colors of the foreground are similar.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method comprises the steps of firstly using a ResNet50 deep network to fetch spatial features, then extracting time and edge information to jointly obtain a significance prediction result graph, completing video significance detection based on the deep network, and overcoming the defects of incomplete significant target detection and inaccurate algorithm detection when background colors of a foreground are similar in video significance detection in the prior art.
The technical scheme adopted by the invention for solving the technical problem is as follows: the video significance detection method based on the deep network comprises the following specific steps of firstly using a ResNet50 deep network to extract spatial features, then extracting time and edge information to jointly obtain a significance prediction result graph, and completing video significance detection based on the deep network:
firstly, inputting a video frame I, and preprocessing:
inputting video frames I, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the average value of the corresponding channel from each pixel value in the video frame I, wherein the average value of the R channel in each video frame I is 104.00698793, the average value of the G channel in each video frame I is 116.66876762, and the average value of the B channel in each video frame I is 122.67891434, so that the shape of the video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed is denoted as I', as shown in the following formula (1):
I′=Resize(I-Mean(R,G,B)) (1),
in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';
secondly, extracting an initial spatial feature map S of the video frame I':
sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:
S=ResNet50(I′) (2),
in equation (2), resNet50 (-) is a ResNet50 deep network,
the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;
thirdly, obtaining a space characteristic diagram S with five scales final :
Extracted in the second stepThe initial space characteristic diagram S of the video frame I' is respectively sent into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with expansion rates of 2,4,8 and 16 in four scales k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales final ,
Fourthly, obtaining a characteristic diagram F:
obtaining the space characteristic diagram S of five scales by the third step final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (3),
F=BN(Relu(Conv(S final ))) (3),
in formula (3), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;
fifthly, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t :
Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F ST And obtaining an edge profile E of the salient object t The specific operation is as follows,
inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F ST As shown in the following formula (4),
F ST =ConvLSTM(F,H t-1 ) (4),
in equation (4), convLSTM (. Cndot.) is a ConvLSTM operation, H t-1 The state of the ConvLSTM unit at the previous moment;
then the obtained space-time characteristic diagram F ST Then feeding the data into a convolution layer with convolution kernel size of 1 × 1 to obtain a rough space-time saliency map Y ST The formula is as follows:
Y ST =Conv(F ST ) (5),
in equation (5), conv (·) is a convolution operation;
inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object t The concrete operation is as follows,
obtaining static state of input video of T frames through ResNet50 deep network and expansion convolutionWherein X t Given X for the video frame of the t-th frame t ,X t After the edge detection branch, outputting the edge profile E t ∈[0,1] W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are taken from the edge detection network &>Calculated taking into account the previous video frame, as shown in equations (6) and (7),
H t =ConvLSTM(X t ,H t-1 ) (6),
in the formula (6) and the formula (7),for the 3D tensor hidden state, M is the number of channels, E t ' is an unweighted edge profile, H t For the current ConvLSTM cell state, H t-1 Is the state of the ConvLSTM cell at the previous time, X 1 Is a video frame that is the first frame,
embedding ConvLSTM in ConvLSTM, obtaining edge profile E t Is the edge detection networkAs shown in the following formula (8),
then using the edge detection networkWeighting to obtain an edge profile E of the salient object t As shown in the following equation (9),
in the formula (9), the first and second groups of the chemical reaction are shown in the specification,is a 1 x 1 convolution kernel used to map the edge detection network->Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%];
Thus, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t ;
Sixthly, obtaining a final significance prediction result picture Y final :
The rough space-time saliency map Y obtained in the fifth step is used ST And edge profile E of salient objects t Fusing to obtain a final significance prediction result graph Y final As shown in the following equation (10),
in the formula (10), the first and second groups of the chemical reaction are shown in the formula,for matrix multiplication, σ is a sigmoid function, resize (·) is a function for adjusting the video frame size,
restoring the obtained video frame to 473 × 473 the size of the original input video frame;
seventh step, calculating the loss for input video frame I:
calculating a saliency map for the input video frame I through the first step to the sixth step, and obtaining a final saliency prediction result map Y for measuring the final saliency final The difference between the method and the ground-truth adopts a binary cross entropy loss function during trainingAs shown in the following formula (11),
in the formula (11), G (i, j) belongs to [0,1] is the true value of the pixel (i, j), M (i, j) belongs to [0,1] is the predicted value of the pixel (i, j), N =473 is taken,
by continuously shrinkingThe size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->
And completing the video saliency detection based on the deep network.
In the video saliency detection method based on the deep network, the five-scale spatial feature map S is obtained final The specific operation of (1) is as follows:
the dilated convolution kernel in the ResNet50 deep network is represented asWherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>for expanding the parameters of the convolution, whose step size is set to 1, four output characteristic maps are derived on the basis of these parameters>Wherein W and H are the width and height, respectively, as shown in the following equation (12),
in the formula (12), C k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,for the dilation convolution operation, S is the initial spatial feature map,
the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4 k is 1,2,3,4]Expansion ratio r k Has four values of r k = 2,4,8,16, and it expands the convolution kernel C k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scalesAnd then they are connected in series in turn, as shown by the following formula (13),
S final =[S,T 1 ,T 2 ,…,T K ] (13),
in the formula (13), S final For the final multi-scale space characteristic map, S is the initial space characteristic map S, T extracted by the ResNet50 deep network K So as to obtain a feature map after expansion convolution, a space feature map S of five scales final The shape of (2) is 60X 4096.
The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:
(1) Compared with CN106372636A, the method of the invention adopts a deep learning-based method, firstly uses ResNet50 and expansion convolution to extract multi-scale spatial features, then uses ConvLSTM to extract time information, and finally integrates the time information into space-time information. The invention has the outstanding substantive characteristics and remarkable progress that the detection precision of the remarkable object is better and faster than the method for calculating the optical flow by using ConvLSTM to extract the time information without calculating the optical flow information.
(2) Compared with CN109784183A, the method of the invention adopts a connection mode with a residual error network, and a plurality of convolution layers are connected with the residual error block.
(3) Compared with CN109118469A, the method of the invention has the prominent substantive features and obvious progress that fussy sparse matrix extraction is not needed, advanced features are extracted from video frames by adopting a deep neural network, each pixel point is predicted, the detection result is more accurate, and the robustness is better.
(4) Compared with CN105913456B, the method of the invention has the prominent substantive characteristics and remarkable progress that the method directly adopts an end-to-end neural network method without linear iteration and k-means clustering with larger calculation amount, and can obtain a prediction result more quickly after training is finished.
(5) Compared with CN109034001A, the method of the present invention adopts the edge detection branch based on the depth network to extract the edge of the salient object in the original image, and thus guides the generation of the following complete salient image. The invention has the prominent substantive features and remarkable progress that the remarkable objects in the obtained remarkable graph are more complete.
(6) Compared with CN108241854A, although the method of the present invention is a deep learning method, the method of the present invention adopts the expansion convolution to extract four feature maps with different scales, compared with the method of the present invention, the extracted features of the present invention are more comprehensive, therefore, the outstanding substantive features and the obvious progress of the present invention are that the edges of the salient objects in the final salient map are smoother.
(7) Compared with CN110598537A, the method of the invention has the prominent substantive features and the remarkable progress that the ConvLSTM is used for simulating the optical flow information between frames, and the extracted optical flow information is more accurate than that calculated by the traditional method.
(8) Compared with Video salt Object Detection view full conditional Networks, the method has the prominent substantive characteristics and remarkable progress that time information between frames is utilized, and an obtained prediction result image is more accurate.
(9) The invention provides a video saliency detection method model based on a deep network. The method is different from the traditional edge detection algorithm, and can accurately detect the outline of the salient object in each frame in a video sequence to guide the prediction of a salient map.
(10) The method utilizes the depth saliency target edge detection branch to generate a saliency target contour map and fuse the saliency target contour map with the space-time saliency map of each frame in the video, so that the contour of the saliency target contour map is smoother, and the saliency target in each frame in the video sequence can be predicted more accurately.
Drawings
The invention is further illustrated with reference to the following figures and examples.
Fig. 1 is a schematic block diagram of a process of the video saliency detection method based on a deep network according to the present invention.
FIG. 2 is a graph Y of the significance prediction result of a video frame I with one cat and one box as significant targets in an embodiment of the invention final 。
Detailed Description
The embodiment shown in fig. 1 shows that the video saliency detection method based on the deep network has the following processes:
inputting video frame I, preprocessing → extracting initial space characteristic diagram S of video frame I' → obtaining space of five scalesCharacteristic diagram S final → obtaining a feature map F → obtaining a coarse spatio-temporal saliency map Y ST And edge profile E of salient objects t → obtaining the final significance prediction result graph Y final Calculating loss for input video frame I → completing video saliency detection based on the deep web.
Example 1
The method for detecting video saliency based on the deep network comprises the following specific steps:
firstly, inputting a video frame I, and preprocessing:
inputting video frames I with significant targets of one cat and one box, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the mean value of its corresponding channel from each pixel value in the video frame I, wherein the mean value of the R channel in each video frame I is 104.00698793, the mean value of the G channel in each video frame I is 116.66876762, and the mean value of the B channel in each video frame I is 122.67891434, so that the shape of the video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed as such is denoted as I', as shown in the following equation (1):
I′=Resize(I-Mean(R,G,B)) (1),
in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';
secondly, extracting an initial spatial feature map S of the video frame I':
sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:
S=ResNet50(I′) (2),
in formula (2), resNet50 (·) is a ResNet50 deep network,
the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;
the third stepObtaining a space characteristic map S with five scales final :
Respectively sending the initial spatial feature map S of the video frame I' extracted in the second step into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with four scales with expansion rates of 2,4,8 and 16 respectively k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales final ,
Obtaining a five-scale spatial feature map S final The specific operation is as follows:
the dilated convolution kernel in the ResNet50 deep network is represented asWherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>for expanding the parameters of the convolution, whose step size is set to 1, four output characteristic maps are derived on the basis of these parameters>Wherein W and H are the width and height, respectively, as shown in the following equation (3),
in the formula (3), C k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,for the dilation convolution operation, S is the initial spatial feature map,
the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4, k is [1,2,3,4 ]]Expansion ratio r k Has a value of fourEach is r k = 2,4,8,16, and it expands the convolution kernel C k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scalesThen they are connected in series in turn, as shown in the following formula (4),
S final =[S,T 1 ,T 2 ,…,T K ] (4),
in the formula (4), S final For the final multi-scale space feature map, S is the initial space feature map S, T extracted by ResNet50 deep network K In order to obtain a feature map, a five-scale spatial feature map S, after an expansion convolution final The shape of (1) is 60X 4096;
fourthly, obtaining a characteristic diagram F:
obtaining the five-scale space characteristic diagram S obtained in the third step final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (5),
F=BN(Relu(Conv(S final ))) (5),
in equation (5), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;
fifthly, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t :
Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F ST And obtaining an edge profile E of the salient object t The specific operation is as follows,
inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F ST As shown in the following equation (6),
F ST =ConvLSTM(F,H t-1 ) (6),
in equation (6), convLSTM (·) is a ConvLSTM operation, H t-1 Is the state of the ConvLSTM cell at the previous time;
then the obtained space-time characteristic diagram F ST Then sending the data into a layer of convolution with convolution kernel size of 1 multiplied by 1 to obtain a rough space-time saliency map Y ST The formula is as follows:
Y ST =Conv(F ST ) (7),
in equation (7), conv (·) is a convolution operation;
inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object t The specific operation is as follows,
the edge detection branch comprises a two-layer ConvLSTM, which is a strong cyclic model, and is used for capturing time sequence information, describing outline edges of salient objects according to time information, distinguishing the salient objects from non-salient objects in an image, and more specifically obtaining static state of an input video of a T frame through ResNet50 depth network and expansion convolutionWherein X t Given X for the video frame of the t-th frame t ,X t Outputting the edge profile E after edge detection branching t ∈[0,1] W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are based on the edge detection network->Calculated taking into account the previous video frame, as shown in equations (8) and (9),
H t =ConvLSTM(X t ,H t-1 ) (8),
in the formula (8) and the formula (9),for the hidden state of the 3D tensor, M is the number of channels, E t ' is an unweighted edge profile, H t Is the current state of the ConvLSTM cell, H t-1 Is the state of the ConvLSTM cell at the previous time, X 1 Is a video frame that is the first frame,
embedding ConvLSTM in ConvLSTM, obtaining edge profile E t Is the edge detection networkAs shown in the following formula (10),
then using the edge detection networkWeighting to obtain an edge profile E of the salient object t As shown in the following equation (11),
in the formula (11), the first and second groups,is a 1 x 1 convolution kernel used to map edge detection networks->Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%];
Thus, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t ;
Sixthly, obtaining a final significance prediction result picture Y final :
The rough space-time saliency map Y obtained in the fifth step is used ST And edge profile E of salient objects t Fusing to obtain a final significance prediction result graph Y final As shown in the following equation (12),
in equation (12),' is matrix multiplication, σ is sigmoid function, resize (-) is a function of adjusting the video frame size,
restoring the obtained video frame to 473 × 473 of the original input video frame;
FIG. 2 is a final saliency prediction result graph Y of the video frame I of the present embodiment final There are two prominent targets, cats and boxes.
Seventh step, calculating the loss for input video frame I:
calculating a saliency map for the input video frame I through the first step to the sixth step, and measuring a final saliency prediction result map Y obtained in the sixth step final The difference between the method and the ground-truth adopts a binary cross entropy loss function during trainingAs shown in the following formula (13),
in the formula (13), G (i, j) ∈ [0,1] is the true value of the pixel (i, j), M (i, j) ∈ [0,1] is the predicted value of the pixel (i, j), N =473 is selected,
by continuous reduction ofThe size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->
And completing the video significance detection based on the deep network.
In the above embodiments, the ResNet50 deep network, convLSTM, ground-truth, and stochastic gradient descent methods are all known in the art.
Claims (2)
1. The video saliency detection method based on the deep network is characterized by comprising the following steps: firstly, a ResNet50 deep network is used for extracting spatial features, then time and edge information are extracted to jointly obtain a significance prediction result graph, and video significance detection based on the deep network is completed, and the method comprises the following specific steps:
firstly, inputting a video frame I, and preprocessing:
inputting video frame I, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the average value of the corresponding channel from each pixel value in video frame I, wherein the average value of the R channel in each video frame I is 104.00698793, the average value of the G channel in each video frame I is 116.66876762, and the average value of the B channel in each video frame I is 122.67891434, so that the shape of video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed in this way is denoted as I', as shown in the following formula (1):
I′=Resize(I-Mean(R,G,B)) (1),
in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';
secondly, extracting an initial spatial feature map S of the video frame I':
sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:
S=ResNet50(I′) (2),
in equation (2), resNet50 (-) is a ResNet50 deep network,
the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;
thirdly, obtaining a space characteristic diagram S with five scales final :
The initial spatial feature map S of the video frame I' extracted in the second step is respectively sent into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with four scales with expansion rates of 2,4,8 and 16 respectively k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales final ,
Fourthly, obtaining a characteristic diagram F:
obtaining the five-scale space characteristic diagram S obtained in the third step final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (3),
F=BN(Relu(Conv(S final ))) (3),
in formula (3), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;
fifthly, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t :
Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F ST And obtaining an edge profile E of the salient object t The specific operation is as follows,
inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F ST As shown in the following formula (4),
F ST =ConvLSTM(F,H t-1 ) (4),
in equation (4), convLSTM (. Cndot.) is a ConvLSTM operation, H t-1 Is the state of the ConvLSTM cell at the previous time;
then the obtained space-time characteristic diagram F ST Then feeding the obtained mixture into a layer of convolution with convolution kernel size of 1X 1 to obtain a coarse productSlight space-time saliency map Y ST The formula is as follows:
Y ST =Conv(F ST ) (5),
in equation (5), conv (·) is a convolution operation;
inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object t The specific operation is as follows,
obtaining static state of input video of T frames through ResNet50 deep network and expansion convolutionWherein X t Given X for the video frame of the t-th frame t ,X t Outputting the edge profile E after edge detection branching t ∈[0,1] W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are based on the edge detection network->Calculated taking into account the previous video frame, as shown in equations (6) and (7),
H t =ConvLSTM(X t ,H t-1 ) (6),
in the formula (6) and the formula (7),for the 3D tensor hidden state, M is the number of channels, E t ' is an unweighted edge profile, H t Is the current state of the ConvLSTM cell, H t-1 Is the state of the ConvLSTM cell at the previous time, X 1 Is a video frame that is the first frame,
embedding ConvLSTM in ConvLSTM to obtain edge profile E t The key component of (A) is the edge detection netCollateral channelAs shown in the following formula (8),
then using the edge detection networkWeighting to obtain an edge profile E of the salient object t As shown in the following equation (9),
in the formula (9), the first and second groups,is a 1 x 1 convolution kernel used to map the edge detection network->Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%];
Thus, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t ;
Sixthly, obtaining a final significance prediction result graph Y final :
The rough space-time saliency map Y obtained in the fifth step is used ST And edge profile E of salient objects t Fusing to obtain a final significance prediction result graph Y final As shown in the following equation (10),
in the formula (10), the first and second groups,for matrix multiplication, σ is a sigmoid function, resize (·) is a function for adjusting the video frame size,
restoring the obtained video frame to 473 × 473 of the original input video frame;
seventh step, calculating the loss for input video frame I:
calculating a saliency map for the input video frame I through the first step to the sixth step, and obtaining a final saliency prediction result map Y for measuring the final saliency final The difference between the method and the ground-truth adopts a binary cross entropy loss function during trainingAs shown in the following formula (11),
in the formula (11), G (i, j) is the true value of the pixel (i, j), M (i, j) is the predicted value of the pixel (i, j), N =473 is selected,
by continuous reduction ofThe size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->
And completing the video significance detection based on the deep network.
2. The method for detecting video saliency based on deep network as claimed in claim 1, wherein: said obtaining being of five scalesSpatial feature map S final The specific operation is as follows:
the dilated convolution kernel in the ResNet50 deep network is represented asWherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>for the parameters of the dilation convolution, whose step size is set to 1, four output characteristic maps +are derived on the basis of these parameters>Wherein W and H are the width and height, respectively, as shown in the following equation (12),
in the formula (12), C k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,for the dilation convolution operation, S is the initial spatial feature map,
the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4 k is 1,2,3,4]Expansion ratio r k Has four values of r k = {2,4,8,16}, and it expands the convolution kernel C k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scalesAnd then they are connected in series in turn, as shown by the following formula (13),
S final =[S,T 1 ,T 2 ,…,T K ] (13),
in the formula (13), S final For the final multi-scale space feature map, S is the initial space feature map S, T extracted by ResNet50 deep network K In order to obtain a feature map, a five-scale spatial feature map S, after an expansion convolution final The shape of (2) is 60X 4096.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010266351.2A CN111461043B (en) | 2020-04-07 | 2020-04-07 | Video significance detection method based on deep network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010266351.2A CN111461043B (en) | 2020-04-07 | 2020-04-07 | Video significance detection method based on deep network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111461043A CN111461043A (en) | 2020-07-28 |
CN111461043B true CN111461043B (en) | 2023-04-18 |
Family
ID=71685906
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010266351.2A Active CN111461043B (en) | 2020-04-07 | 2020-04-07 | Video significance detection method based on deep network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111461043B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111931732B (en) * | 2020-09-24 | 2022-07-15 | 苏州科达科技股份有限公司 | Method, system, device and storage medium for detecting salient object of compressed video |
CN112861733B (en) * | 2021-02-08 | 2022-09-02 | 电子科技大学 | Night traffic video significance detection method based on space-time double coding |
CN112950477B (en) * | 2021-03-15 | 2023-08-22 | 河南大学 | Dual-path processing-based high-resolution salient target detection method |
CN117152670A (en) * | 2023-10-31 | 2023-12-01 | 江西拓世智能科技股份有限公司 | Behavior recognition method and system based on artificial intelligence |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256562A (en) * | 2018-01-09 | 2018-07-06 | 深圳大学 | Well-marked target detection method and system based on Weakly supervised space-time cascade neural network |
CN109448015A (en) * | 2018-10-30 | 2019-03-08 | 河北工业大学 | Image based on notable figure fusion cooperates with dividing method |
CN110929736A (en) * | 2019-11-12 | 2020-03-27 | 浙江科技学院 | Multi-feature cascade RGB-D significance target detection method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106157307B (en) * | 2016-06-27 | 2018-09-11 | 浙江工商大学 | A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF |
-
2020
- 2020-04-07 CN CN202010266351.2A patent/CN111461043B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108256562A (en) * | 2018-01-09 | 2018-07-06 | 深圳大学 | Well-marked target detection method and system based on Weakly supervised space-time cascade neural network |
CN109448015A (en) * | 2018-10-30 | 2019-03-08 | 河北工业大学 | Image based on notable figure fusion cooperates with dividing method |
CN110929736A (en) * | 2019-11-12 | 2020-03-27 | 浙江科技学院 | Multi-feature cascade RGB-D significance target detection method |
Non-Patent Citations (2)
Title |
---|
Guo, YC,et al.Video Object Extraction Based on Spatiotemporal Consistency Saliency Detection.IEEE Access.2018,第6卷35171-35181. * |
师硕.图像局部不变特征及应用研究.中国博士学位论文全文数据库 信息科技辑.2015,第2015年卷(第2015年期),I138-45. * |
Also Published As
Publication number | Publication date |
---|---|
CN111461043A (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111461043B (en) | Video significance detection method based on deep network | |
Kim et al. | Deep monocular depth estimation via integration of global and local predictions | |
US10839543B2 (en) | Systems and methods for depth estimation using convolutional spatial propagation networks | |
CN109584248B (en) | Infrared target instance segmentation method based on feature fusion and dense connection network | |
CN109886066B (en) | Rapid target detection method based on multi-scale and multi-layer feature fusion | |
US11100401B2 (en) | Predicting depth from image data using a statistical model | |
US11361456B2 (en) | Systems and methods for depth estimation via affinity learned with convolutional spatial propagation networks | |
Lin et al. | Depth estimation from monocular images and sparse radar data | |
Cao et al. | Exploiting depth from single monocular images for object detection and semantic segmentation | |
Zhang et al. | Deep hierarchical guidance and regularization learning for end-to-end depth estimation | |
CN107680106A (en) | A kind of conspicuousness object detection method based on Faster R CNN | |
CN111612807A (en) | Small target image segmentation method based on scale and edge information | |
CN110096961B (en) | Indoor scene semantic annotation method at super-pixel level | |
CN107506792B (en) | Semi-supervised salient object detection method | |
CN113344932B (en) | Semi-supervised single-target video segmentation method | |
CN110827312B (en) | Learning method based on cooperative visual attention neural network | |
CN104766065B (en) | Robustness foreground detection method based on various visual angles study | |
WO2016165064A1 (en) | Robust foreground detection method based on multi-view learning | |
CN113095371B (en) | Feature point matching method and system for three-dimensional reconstruction | |
Wang et al. | A feature-supervised generative adversarial network for environmental monitoring during hazy days | |
CN108388901B (en) | Collaborative significant target detection method based on space-semantic channel | |
Chen et al. | Pgnet: Panoptic parsing guided deep stereo matching | |
CN106327513B (en) | Shot boundary detection method based on convolutional neural network | |
Li et al. | Spatiotemporal road scene reconstruction using superpixel-based Markov random field | |
Tseng et al. | Semi-supervised image depth prediction with deep learning and binocular algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |