CN111461043B - Video significance detection method based on deep network - Google Patents

Video significance detection method based on deep network Download PDF

Info

Publication number
CN111461043B
CN111461043B CN202010266351.2A CN202010266351A CN111461043B CN 111461043 B CN111461043 B CN 111461043B CN 202010266351 A CN202010266351 A CN 202010266351A CN 111461043 B CN111461043 B CN 111461043B
Authority
CN
China
Prior art keywords
video frame
final
obtaining
video
saliency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010266351.2A
Other languages
Chinese (zh)
Other versions
CN111461043A (en
Inventor
于明
夏斌红
刘依
郭迎春
郝小可
朱叶
师硕
于洋
阎刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202010266351.2A priority Critical patent/CN111461043B/en
Publication of CN111461043A publication Critical patent/CN111461043A/en
Application granted granted Critical
Publication of CN111461043B publication Critical patent/CN111461043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video saliency detection method based on a depth network, which relates to the field of image data processing, and is characterized in that a ResNet50 depth network is used for extracting spatial features, then time and edge information are extracted to jointly obtain a saliency prediction result map, and the video saliency detection based on the depth network is completed; extracting video frame I The initial spatial feature map S; obtaining a space characteristic diagram S of five scales final (ii) a Obtaining a characteristic diagram F; obtaining a rough spatio-temporal saliency map Y ST And edge profile E of salient objects t (ii) a Obtaining a final significance prediction result graph Y final (ii) a And calculating the loss of the input video frame I, and completing the video significance detection based on the deep network. The method overcomes the defects that the detection of the salient object is incomplete and the algorithm detection is inaccurate when the background colors of the foreground are similar in the video saliency detection in the prior art.

Description

Video significance detection method based on deep network
Technical Field
The technical scheme of the invention relates to the field of image data processing, in particular to a video saliency detection method based on a depth network.
Background
Video saliency detection aims at extracting the regions of most interest to the human eye in successive video frames. In particular to a method for extracting a human eye interested region from a video frame by utilizing a computer to simulate a human eye visual attention mechanism, which is one of key technologies in the field of computer vision.
Most conventional video saliency detection methods are based on low-level manual features (e.g. color, texture, etc.), which are typically heuristic methods with the disadvantages of slow speed (due to time-consuming optical flow computations) and low prediction accuracy (due to limited characterizability of low-level features). In recent years, a deep neural network is applied to the field of video saliency detection, and a deep learning method is to calculate a saliency value of an image by using a high-level semantic feature of a convolutional neural network extraction image, but position information and detail information of a target can be lost by using the deep convolutional network, misleading information can be introduced when the salient target is detected, and the detected target is incomplete.
In 2016, liu et al, in the article "sales detection for unconjugated video using superpixel-level graph and spatial-temporal propagation", propose the SGSP algorithm for video Saliency detection using a superpixel-level graph model and spatiotemporal propagation, first extracting motion and color histograms at the superpixel level and a global motion histogram to construct a graph. Next, motion saliency is iteratively computed through a shortest path on the graph using a background prior based on the graph model. Then propagate forward and backward in time, propagate locally and globally in space, and finally fuse these two results to form the final saliency map. The algorithm is large in calculation amount, but the obtained saliency map still has the problem of incomplete detection of saliency targets. The deep learning model is based on the aim of obtaining richer depth characteristics by utilizing the convolutional neural network so as to obtain more accurate detection results. Wang et al proposed a Video saliency detection method based on a full convolution network in 2017, which is the first time that a full convolution network based on deep learning is used in the Video saliency detection field, but because time information between frames is not considered, the edges of the obtained saliency map are not fine enough, and the edge noise is large. CN106372636A discloses a video saliency detection method based on HOG _ TOP, which utilizes original video to calculate on three orthogonal planes XY, XT and YT to obtain HOG _ TOP characteristics, calculates on the XY plane to obtain a space domain saliency map and calculates on the XT and YT plane to obtain a time domain saliency map, and finally obtains a final saliency map through self-adaptive fusion. CN109784183A discloses a video saliency target detection method based on a cascade convolution network and an optical flow, which uses a cascade network structure to perform pixel-level saliency prediction on an image of a current frame on three scales, namely high, medium and low. A cascade network structure is trained by using an MSAR10K image data set, a significance label graph is used as training supervision information, and a loss function is a cross entropy loss function. And after the training is ended, performing static significance prediction on each frame of image in the video by using the trained cascade network, and extracting an optical flow field by using a Locus-Kanada algorithm. And then constructing a dynamic optimization network structure by using the three layers of convolution network structures. And splicing the static detection result and the optical flow field detection result of each frame of image to obtain the input data of the optimized network. The method is time-consuming, and optical flow information extracted by using the Locus-Kanada algorithm is inaccurate and poor in robustness when the method is used for a complex scene. CN109118469A discloses a prediction method for video saliency, which quantizes an image to obtain a sparse matrix response, obtains a decomposition matrix according to local coordinate constraint, and performs saliency map calculation for each frame in a video and performs quality prediction. The method loses some detail information of the significant target, so that the prediction result has the problem of incomplete detection of the significant target. CN105913456B discloses a video saliency detection method by region segmentation, which first uses nonlinear clustering to obtain superpixel blocks to extract static features, then uses a light splitting flow method to obtain dynamic features, and finally uses a linear regression model to predict a saliency map after the two features are fused. CN109034001A discloses a cross-modal video saliency detection method based on spatio-temporal cues, which constructs a saliency map by using weights of an initial saliency map, visible light and thermal infrared, and is difficult to find a proper weight value, resulting in poor robustness. CN108241854A discloses a depth video saliency detection method based on motion and memory information, which extracts local information and global information according to the human eye attention view of a current frame, and then inputs the local information and global information as prior information and an original image into a depth network model to predict a final saliency map. CN110598537A discloses a video saliency detection method based on a deep convolution network, which uses a current frame of a video and an optical flow image corresponding to the current frame as input of a feature extraction network to predict a final saliency map, and this method needs to calculate optical flow information of the current frame in advance, and the calculation amount is large.
In summary, the prior art of video salient object detection still has the problems that the salient object detection is incomplete, and the algorithm detection is inaccurate when the background colors of the foreground are similar.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method comprises the steps of firstly using a ResNet50 deep network to fetch spatial features, then extracting time and edge information to jointly obtain a significance prediction result graph, completing video significance detection based on the deep network, and overcoming the defects of incomplete significant target detection and inaccurate algorithm detection when background colors of a foreground are similar in video significance detection in the prior art.
The technical scheme adopted by the invention for solving the technical problem is as follows: the video significance detection method based on the deep network comprises the following specific steps of firstly using a ResNet50 deep network to extract spatial features, then extracting time and edge information to jointly obtain a significance prediction result graph, and completing video significance detection based on the deep network:
firstly, inputting a video frame I, and preprocessing:
inputting video frames I, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the average value of the corresponding channel from each pixel value in the video frame I, wherein the average value of the R channel in each video frame I is 104.00698793, the average value of the G channel in each video frame I is 116.66876762, and the average value of the B channel in each video frame I is 122.67891434, so that the shape of the video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed is denoted as I', as shown in the following formula (1):
I′=Resize(I-Mean(R,G,B)) (1),
in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';
secondly, extracting an initial spatial feature map S of the video frame I':
sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:
S=ResNet50(I′) (2),
in equation (2), resNet50 (-) is a ResNet50 deep network,
the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;
thirdly, obtaining a space characteristic diagram S with five scales final
Extracted in the second stepThe initial space characteristic diagram S of the video frame I' is respectively sent into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with expansion rates of 2,4,8 and 16 in four scales k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales final
Fourthly, obtaining a characteristic diagram F:
obtaining the space characteristic diagram S of five scales by the third step final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (3),
F=BN(Relu(Conv(S final ))) (3),
in formula (3), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;
fifthly, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t
Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F ST And obtaining an edge profile E of the salient object t The specific operation is as follows,
inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F ST As shown in the following formula (4),
F ST =ConvLSTM(F,H t-1 ) (4),
in equation (4), convLSTM (. Cndot.) is a ConvLSTM operation, H t-1 The state of the ConvLSTM unit at the previous moment;
then the obtained space-time characteristic diagram F ST Then feeding the data into a convolution layer with convolution kernel size of 1 × 1 to obtain a rough space-time saliency map Y ST The formula is as follows:
Y ST =Conv(F ST ) (5),
in equation (5), conv (·) is a convolution operation;
inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object t The concrete operation is as follows,
obtaining static state of input video of T frames through ResNet50 deep network and expansion convolution
Figure BDA0002441399420000031
Wherein X t Given X for the video frame of the t-th frame t ,X t After the edge detection branch, outputting the edge profile E t ∈[0,1] W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are taken from the edge detection network &>
Figure BDA0002441399420000041
Calculated taking into account the previous video frame, as shown in equations (6) and (7),
H t =ConvLSTM(X t ,H t-1 ) (6),
Figure BDA0002441399420000042
in the formula (6) and the formula (7),
Figure BDA0002441399420000043
for the 3D tensor hidden state, M is the number of channels, E t ' is an unweighted edge profile, H t For the current ConvLSTM cell state, H t-1 Is the state of the ConvLSTM cell at the previous time, X 1 Is a video frame that is the first frame,
embedding ConvLSTM in ConvLSTM, obtaining edge profile E t Is the edge detection network
Figure BDA0002441399420000044
As shown in the following formula (8),
Figure BDA0002441399420000045
/>
then using the edge detection network
Figure BDA0002441399420000046
Weighting to obtain an edge profile E of the salient object t As shown in the following equation (9),
Figure BDA0002441399420000047
in the formula (9), the first and second groups of the chemical reaction are shown in the specification,
Figure BDA0002441399420000048
is a 1 x 1 convolution kernel used to map the edge detection network->
Figure BDA0002441399420000049
Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%];
Thus, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t
Sixthly, obtaining a final significance prediction result picture Y final
The rough space-time saliency map Y obtained in the fifth step is used ST And edge profile E of salient objects t Fusing to obtain a final significance prediction result graph Y final As shown in the following equation (10),
Figure BDA00024413994200000410
in the formula (10), the first and second groups of the chemical reaction are shown in the formula,
Figure BDA00024413994200000411
for matrix multiplication, σ is a sigmoid function, resize (·) is a function for adjusting the video frame size,
restoring the obtained video frame to 473 × 473 the size of the original input video frame;
seventh step, calculating the loss for input video frame I:
calculating a saliency map for the input video frame I through the first step to the sixth step, and obtaining a final saliency prediction result map Y for measuring the final saliency final The difference between the method and the ground-truth adopts a binary cross entropy loss function during training
Figure BDA00024413994200000412
As shown in the following formula (11),
Figure BDA00024413994200000413
in the formula (11), G (i, j) belongs to [0,1] is the true value of the pixel (i, j), M (i, j) belongs to [0,1] is the predicted value of the pixel (i, j), N =473 is taken,
by continuously shrinking
Figure BDA00024413994200000414
The size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->
Figure BDA00024413994200000415
And completing the video saliency detection based on the deep network.
In the video saliency detection method based on the deep network, the five-scale spatial feature map S is obtained final The specific operation of (1) is as follows:
the dilated convolution kernel in the ResNet50 deep network is represented as
Figure BDA00024413994200000416
Wherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>
Figure BDA0002441399420000051
for expanding the parameters of the convolution, whose step size is set to 1, four output characteristic maps are derived on the basis of these parameters>
Figure BDA0002441399420000052
Wherein W and H are the width and height, respectively, as shown in the following equation (12),
Figure BDA0002441399420000053
in the formula (12), C k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,
Figure BDA0002441399420000054
for the dilation convolution operation, S is the initial spatial feature map,
the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4 k is 1,2,3,4]Expansion ratio r k Has four values of r k = 2,4,8,16, and it expands the convolution kernel C k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scales
Figure BDA0002441399420000055
And then they are connected in series in turn, as shown by the following formula (13),
S final =[S,T 1 ,T 2 ,…,T K ] (13),
in the formula (13), S final For the final multi-scale space characteristic map, S is the initial space characteristic map S, T extracted by the ResNet50 deep network K So as to obtain a feature map after expansion convolution, a space feature map S of five scales final The shape of (2) is 60X 4096.
The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:
(1) Compared with CN106372636A, the method of the invention adopts a deep learning-based method, firstly uses ResNet50 and expansion convolution to extract multi-scale spatial features, then uses ConvLSTM to extract time information, and finally integrates the time information into space-time information. The invention has the outstanding substantive characteristics and remarkable progress that the detection precision of the remarkable object is better and faster than the method for calculating the optical flow by using ConvLSTM to extract the time information without calculating the optical flow information.
(2) Compared with CN109784183A, the method of the invention adopts a connection mode with a residual error network, and a plurality of convolution layers are connected with the residual error block.
(3) Compared with CN109118469A, the method of the invention has the prominent substantive features and obvious progress that fussy sparse matrix extraction is not needed, advanced features are extracted from video frames by adopting a deep neural network, each pixel point is predicted, the detection result is more accurate, and the robustness is better.
(4) Compared with CN105913456B, the method of the invention has the prominent substantive characteristics and remarkable progress that the method directly adopts an end-to-end neural network method without linear iteration and k-means clustering with larger calculation amount, and can obtain a prediction result more quickly after training is finished.
(5) Compared with CN109034001A, the method of the present invention adopts the edge detection branch based on the depth network to extract the edge of the salient object in the original image, and thus guides the generation of the following complete salient image. The invention has the prominent substantive features and remarkable progress that the remarkable objects in the obtained remarkable graph are more complete.
(6) Compared with CN108241854A, although the method of the present invention is a deep learning method, the method of the present invention adopts the expansion convolution to extract four feature maps with different scales, compared with the method of the present invention, the extracted features of the present invention are more comprehensive, therefore, the outstanding substantive features and the obvious progress of the present invention are that the edges of the salient objects in the final salient map are smoother.
(7) Compared with CN110598537A, the method of the invention has the prominent substantive features and the remarkable progress that the ConvLSTM is used for simulating the optical flow information between frames, and the extracted optical flow information is more accurate than that calculated by the traditional method.
(8) Compared with Video salt Object Detection view full conditional Networks, the method has the prominent substantive characteristics and remarkable progress that time information between frames is utilized, and an obtained prediction result image is more accurate.
(9) The invention provides a video saliency detection method model based on a deep network. The method is different from the traditional edge detection algorithm, and can accurately detect the outline of the salient object in each frame in a video sequence to guide the prediction of a salient map.
(10) The method utilizes the depth saliency target edge detection branch to generate a saliency target contour map and fuse the saliency target contour map with the space-time saliency map of each frame in the video, so that the contour of the saliency target contour map is smoother, and the saliency target in each frame in the video sequence can be predicted more accurately.
Drawings
The invention is further illustrated with reference to the following figures and examples.
Fig. 1 is a schematic block diagram of a process of the video saliency detection method based on a deep network according to the present invention.
FIG. 2 is a graph Y of the significance prediction result of a video frame I with one cat and one box as significant targets in an embodiment of the invention final
Detailed Description
The embodiment shown in fig. 1 shows that the video saliency detection method based on the deep network has the following processes:
inputting video frame I, preprocessing → extracting initial space characteristic diagram S of video frame I' → obtaining space of five scalesCharacteristic diagram S final → obtaining a feature map F → obtaining a coarse spatio-temporal saliency map Y ST And edge profile E of salient objects t → obtaining the final significance prediction result graph Y final Calculating loss for input video frame I → completing video saliency detection based on the deep web.
Example 1
The method for detecting video saliency based on the deep network comprises the following specific steps:
firstly, inputting a video frame I, and preprocessing:
inputting video frames I with significant targets of one cat and one box, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the mean value of its corresponding channel from each pixel value in the video frame I, wherein the mean value of the R channel in each video frame I is 104.00698793, the mean value of the G channel in each video frame I is 116.66876762, and the mean value of the B channel in each video frame I is 122.67891434, so that the shape of the video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed as such is denoted as I', as shown in the following equation (1):
I′=Resize(I-Mean(R,G,B)) (1),
in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';
secondly, extracting an initial spatial feature map S of the video frame I':
sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:
S=ResNet50(I′) (2),
in formula (2), resNet50 (·) is a ResNet50 deep network,
the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;
the third stepObtaining a space characteristic map S with five scales final
Respectively sending the initial spatial feature map S of the video frame I' extracted in the second step into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with four scales with expansion rates of 2,4,8 and 16 respectively k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales final
Obtaining a five-scale spatial feature map S final The specific operation is as follows:
the dilated convolution kernel in the ResNet50 deep network is represented as
Figure BDA0002441399420000071
Wherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>
Figure BDA0002441399420000072
for expanding the parameters of the convolution, whose step size is set to 1, four output characteristic maps are derived on the basis of these parameters>
Figure BDA0002441399420000073
Wherein W and H are the width and height, respectively, as shown in the following equation (3),
Figure BDA0002441399420000074
in the formula (3), C k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,
Figure BDA0002441399420000075
for the dilation convolution operation, S is the initial spatial feature map,
the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4, k is [1,2,3,4 ]]Expansion ratio r k Has a value of fourEach is r k = 2,4,8,16, and it expands the convolution kernel C k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scales
Figure BDA0002441399420000076
Then they are connected in series in turn, as shown in the following formula (4),
S final =[S,T 1 ,T 2 ,…,T K ] (4),
in the formula (4), S final For the final multi-scale space feature map, S is the initial space feature map S, T extracted by ResNet50 deep network K In order to obtain a feature map, a five-scale spatial feature map S, after an expansion convolution final The shape of (1) is 60X 4096;
fourthly, obtaining a characteristic diagram F:
obtaining the five-scale space characteristic diagram S obtained in the third step final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (5),
F=BN(Relu(Conv(S final ))) (5),
in equation (5), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;
fifthly, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t
Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F ST And obtaining an edge profile E of the salient object t The specific operation is as follows,
inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F ST As shown in the following equation (6),
F ST =ConvLSTM(F,H t-1 ) (6),
in equation (6), convLSTM (·) is a ConvLSTM operation, H t-1 Is the state of the ConvLSTM cell at the previous time;
then the obtained space-time characteristic diagram F ST Then sending the data into a layer of convolution with convolution kernel size of 1 multiplied by 1 to obtain a rough space-time saliency map Y ST The formula is as follows:
Y ST =Conv(F ST ) (7),
in equation (7), conv (·) is a convolution operation;
inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object t The specific operation is as follows,
the edge detection branch comprises a two-layer ConvLSTM, which is a strong cyclic model, and is used for capturing time sequence information, describing outline edges of salient objects according to time information, distinguishing the salient objects from non-salient objects in an image, and more specifically obtaining static state of an input video of a T frame through ResNet50 depth network and expansion convolution
Figure BDA0002441399420000081
Wherein X t Given X for the video frame of the t-th frame t ,X t Outputting the edge profile E after edge detection branching t ∈[0,1] W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are based on the edge detection network->
Figure BDA0002441399420000082
Calculated taking into account the previous video frame, as shown in equations (8) and (9),
H t =ConvLSTM(X t ,H t-1 ) (8),
Figure BDA0002441399420000083
in the formula (8) and the formula (9),
Figure BDA0002441399420000084
for the hidden state of the 3D tensor, M is the number of channels, E t ' is an unweighted edge profile, H t Is the current state of the ConvLSTM cell, H t-1 Is the state of the ConvLSTM cell at the previous time, X 1 Is a video frame that is the first frame,
embedding ConvLSTM in ConvLSTM, obtaining edge profile E t Is the edge detection network
Figure BDA0002441399420000085
As shown in the following formula (10),
Figure BDA0002441399420000086
then using the edge detection network
Figure BDA0002441399420000087
Weighting to obtain an edge profile E of the salient object t As shown in the following equation (11),
Figure BDA0002441399420000088
in the formula (11), the first and second groups,
Figure BDA0002441399420000089
is a 1 x 1 convolution kernel used to map edge detection networks->
Figure BDA00024413994200000810
Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%];
Thus, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t
Sixthly, obtaining a final significance prediction result picture Y final
The rough space-time saliency map Y obtained in the fifth step is used ST And edge profile E of salient objects t Fusing to obtain a final significance prediction result graph Y final As shown in the following equation (12),
Figure BDA0002441399420000091
in equation (12),' is matrix multiplication, σ is sigmoid function, resize (-) is a function of adjusting the video frame size,
restoring the obtained video frame to 473 × 473 of the original input video frame;
FIG. 2 is a final saliency prediction result graph Y of the video frame I of the present embodiment final There are two prominent targets, cats and boxes.
Seventh step, calculating the loss for input video frame I:
calculating a saliency map for the input video frame I through the first step to the sixth step, and measuring a final saliency prediction result map Y obtained in the sixth step final The difference between the method and the ground-truth adopts a binary cross entropy loss function during training
Figure BDA0002441399420000092
As shown in the following formula (13),
Figure BDA0002441399420000093
in the formula (13), G (i, j) ∈ [0,1] is the true value of the pixel (i, j), M (i, j) ∈ [0,1] is the predicted value of the pixel (i, j), N =473 is selected,
by continuous reduction of
Figure BDA0002441399420000094
The size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->
Figure BDA0002441399420000095
And completing the video significance detection based on the deep network.
In the above embodiments, the ResNet50 deep network, convLSTM, ground-truth, and stochastic gradient descent methods are all known in the art.

Claims (2)

1. The video saliency detection method based on the deep network is characterized by comprising the following steps: firstly, a ResNet50 deep network is used for extracting spatial features, then time and edge information are extracted to jointly obtain a significance prediction result graph, and video significance detection based on the deep network is completed, and the method comprises the following specific steps:
firstly, inputting a video frame I, and preprocessing:
inputting video frame I, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the average value of the corresponding channel from each pixel value in video frame I, wherein the average value of the R channel in each video frame I is 104.00698793, the average value of the G channel in each video frame I is 116.66876762, and the average value of the B channel in each video frame I is 122.67891434, so that the shape of video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed in this way is denoted as I', as shown in the following formula (1):
I′=Resize(I-Mean(R,G,B)) (1),
in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';
secondly, extracting an initial spatial feature map S of the video frame I':
sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:
S=ResNet50(I′) (2),
in equation (2), resNet50 (-) is a ResNet50 deep network,
the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;
thirdly, obtaining a space characteristic diagram S with five scales final
The initial spatial feature map S of the video frame I' extracted in the second step is respectively sent into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with four scales with expansion rates of 2,4,8 and 16 respectively k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales final
Fourthly, obtaining a characteristic diagram F:
obtaining the five-scale space characteristic diagram S obtained in the third step final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (3),
F=BN(Relu(Conv(S final ))) (3),
in formula (3), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;
fifthly, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t
Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F ST And obtaining an edge profile E of the salient object t The specific operation is as follows,
inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F ST As shown in the following formula (4),
F ST =ConvLSTM(F,H t-1 ) (4),
in equation (4), convLSTM (. Cndot.) is a ConvLSTM operation, H t-1 Is the state of the ConvLSTM cell at the previous time;
then the obtained space-time characteristic diagram F ST Then feeding the obtained mixture into a layer of convolution with convolution kernel size of 1X 1 to obtain a coarse productSlight space-time saliency map Y ST The formula is as follows:
Y ST =Conv(F ST ) (5),
in equation (5), conv (·) is a convolution operation;
inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object t The specific operation is as follows,
obtaining static state of input video of T frames through ResNet50 deep network and expansion convolution
Figure FDA0002441399410000021
Wherein X t Given X for the video frame of the t-th frame t ,X t Outputting the edge profile E after edge detection branching t ∈[0,1] W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are based on the edge detection network->
Figure FDA0002441399410000022
Calculated taking into account the previous video frame, as shown in equations (6) and (7),
H t =ConvLSTM(X t ,H t-1 ) (6),
Figure FDA0002441399410000023
in the formula (6) and the formula (7),
Figure FDA0002441399410000024
for the 3D tensor hidden state, M is the number of channels, E t ' is an unweighted edge profile, H t Is the current state of the ConvLSTM cell, H t-1 Is the state of the ConvLSTM cell at the previous time, X 1 Is a video frame that is the first frame,
embedding ConvLSTM in ConvLSTM to obtain edge profile E t The key component of (A) is the edge detection netCollateral channel
Figure FDA0002441399410000025
As shown in the following formula (8),
Figure FDA0002441399410000026
then using the edge detection network
Figure FDA0002441399410000027
Weighting to obtain an edge profile E of the salient object t As shown in the following equation (9),
Figure FDA0002441399410000028
in the formula (9), the first and second groups,
Figure FDA0002441399410000029
is a 1 x 1 convolution kernel used to map the edge detection network->
Figure FDA00024413994100000210
Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%];
Thus, obtaining a rough space-time saliency map Y ST And edge profile E of salient objects t
Sixthly, obtaining a final significance prediction result graph Y final
The rough space-time saliency map Y obtained in the fifth step is used ST And edge profile E of salient objects t Fusing to obtain a final significance prediction result graph Y final As shown in the following equation (10),
Figure FDA00024413994100000211
in the formula (10), the first and second groups,
Figure FDA00024413994100000212
for matrix multiplication, σ is a sigmoid function, resize (·) is a function for adjusting the video frame size,
restoring the obtained video frame to 473 × 473 of the original input video frame;
seventh step, calculating the loss for input video frame I:
calculating a saliency map for the input video frame I through the first step to the sixth step, and obtaining a final saliency prediction result map Y for measuring the final saliency final The difference between the method and the ground-truth adopts a binary cross entropy loss function during training
Figure FDA0002441399410000031
As shown in the following formula (11),
Figure FDA0002441399410000032
in the formula (11), G (i, j) is the true value of the pixel (i, j), M (i, j) is the predicted value of the pixel (i, j), N =473 is selected,
by continuous reduction of
Figure FDA0002441399410000033
The size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->
Figure FDA0002441399410000034
And completing the video significance detection based on the deep network.
2. The method for detecting video saliency based on deep network as claimed in claim 1, wherein: said obtaining being of five scalesSpatial feature map S final The specific operation is as follows:
the dilated convolution kernel in the ResNet50 deep network is represented as
Figure FDA0002441399410000035
Wherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>
Figure FDA0002441399410000036
for the parameters of the dilation convolution, whose step size is set to 1, four output characteristic maps +are derived on the basis of these parameters>
Figure FDA0002441399410000037
Wherein W and H are the width and height, respectively, as shown in the following equation (12),
Figure FDA00024413994100000310
in the formula (12), C k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,
Figure FDA0002441399410000039
for the dilation convolution operation, S is the initial spatial feature map,
the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4 k is 1,2,3,4]Expansion ratio r k Has four values of r k = {2,4,8,16}, and it expands the convolution kernel C k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scales
Figure FDA0002441399410000038
And then they are connected in series in turn, as shown by the following formula (13),
S final =[S,T 1 ,T 2 ,…,T K ] (13),
in the formula (13), S final For the final multi-scale space feature map, S is the initial space feature map S, T extracted by ResNet50 deep network K In order to obtain a feature map, a five-scale spatial feature map S, after an expansion convolution final The shape of (2) is 60X 4096.
CN202010266351.2A 2020-04-07 2020-04-07 Video significance detection method based on deep network Active CN111461043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010266351.2A CN111461043B (en) 2020-04-07 2020-04-07 Video significance detection method based on deep network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010266351.2A CN111461043B (en) 2020-04-07 2020-04-07 Video significance detection method based on deep network

Publications (2)

Publication Number Publication Date
CN111461043A CN111461043A (en) 2020-07-28
CN111461043B true CN111461043B (en) 2023-04-18

Family

ID=71685906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010266351.2A Active CN111461043B (en) 2020-04-07 2020-04-07 Video significance detection method based on deep network

Country Status (1)

Country Link
CN (1) CN111461043B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931732B (en) * 2020-09-24 2022-07-15 苏州科达科技股份有限公司 Method, system, device and storage medium for detecting salient object of compressed video
CN112861733B (en) * 2021-02-08 2022-09-02 电子科技大学 Night traffic video significance detection method based on space-time double coding
CN112950477B (en) * 2021-03-15 2023-08-22 河南大学 Dual-path processing-based high-resolution salient target detection method
CN117152670A (en) * 2023-10-31 2023-12-01 江西拓世智能科技股份有限公司 Behavior recognition method and system based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
CN109448015A (en) * 2018-10-30 2019-03-08 河北工业大学 Image based on notable figure fusion cooperates with dividing method
CN110929736A (en) * 2019-11-12 2020-03-27 浙江科技学院 Multi-feature cascade RGB-D significance target detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157307B (en) * 2016-06-27 2018-09-11 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
CN109448015A (en) * 2018-10-30 2019-03-08 河北工业大学 Image based on notable figure fusion cooperates with dividing method
CN110929736A (en) * 2019-11-12 2020-03-27 浙江科技学院 Multi-feature cascade RGB-D significance target detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Guo, YC,et al.Video Object Extraction Based on Spatiotemporal Consistency Saliency Detection.IEEE Access.2018,第6卷35171-35181. *
师硕.图像局部不变特征及应用研究.中国博士学位论文全文数据库 信息科技辑.2015,第2015年卷(第2015年期),I138-45. *

Also Published As

Publication number Publication date
CN111461043A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111461043B (en) Video significance detection method based on deep network
Kim et al. Deep monocular depth estimation via integration of global and local predictions
US10839543B2 (en) Systems and methods for depth estimation using convolutional spatial propagation networks
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN109886066B (en) Rapid target detection method based on multi-scale and multi-layer feature fusion
US11100401B2 (en) Predicting depth from image data using a statistical model
US11361456B2 (en) Systems and methods for depth estimation via affinity learned with convolutional spatial propagation networks
Lin et al. Depth estimation from monocular images and sparse radar data
Cao et al. Exploiting depth from single monocular images for object detection and semantic segmentation
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN107680106A (en) A kind of conspicuousness object detection method based on Faster R CNN
CN111612807A (en) Small target image segmentation method based on scale and edge information
CN110096961B (en) Indoor scene semantic annotation method at super-pixel level
CN107506792B (en) Semi-supervised salient object detection method
CN113344932B (en) Semi-supervised single-target video segmentation method
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN104766065B (en) Robustness foreground detection method based on various visual angles study
WO2016165064A1 (en) Robust foreground detection method based on multi-view learning
CN113095371B (en) Feature point matching method and system for three-dimensional reconstruction
Wang et al. A feature-supervised generative adversarial network for environmental monitoring during hazy days
CN108388901B (en) Collaborative significant target detection method based on space-semantic channel
Chen et al. Pgnet: Panoptic parsing guided deep stereo matching
CN106327513B (en) Shot boundary detection method based on convolutional neural network
Li et al. Spatiotemporal road scene reconstruction using superpixel-based Markov random field
Tseng et al. Semi-supervised image depth prediction with deep learning and binocular algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant