CN111461043B

CN111461043B - Video significance detection method based on deep network

Info

Publication number: CN111461043B
Application number: CN202010266351.2A
Authority: CN
Inventors: 于明; 夏斌红; 刘依; 郭迎春; 郝小可; 朱叶; 师硕; 于洋; 阎刚
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2023-04-18
Anticipated expiration: 2040-04-07
Also published as: CN111461043A

Abstract

The invention relates to a video saliency detection method based on a depth network, which relates to the field of image data processing, and is characterized in that a ResNet50 depth network is used for extracting spatial features, then time and edge information are extracted to jointly obtain a saliency prediction result map, and the video saliency detection based on the depth network is completed; extracting video frame I ^′ The initial spatial feature map S; obtaining a space characteristic diagram S of five scales _final (ii) a Obtaining a characteristic diagram F; obtaining a rough spatio-temporal saliency map Y _ST And edge profile E of salient objects _t (ii) a Obtaining a final significance prediction result graph Y _final (ii) a And calculating the loss of the input video frame I, and completing the video significance detection based on the deep network. The method overcomes the defects that the detection of the salient object is incomplete and the algorithm detection is inaccurate when the background colors of the foreground are similar in the video saliency detection in the prior art.

Description

Video significance detection method based on deep network

Technical Field

The technical scheme of the invention relates to the field of image data processing, in particular to a video saliency detection method based on a depth network.

Background

Video saliency detection aims at extracting the regions of most interest to the human eye in successive video frames. In particular to a method for extracting a human eye interested region from a video frame by utilizing a computer to simulate a human eye visual attention mechanism, which is one of key technologies in the field of computer vision.

Most conventional video saliency detection methods are based on low-level manual features (e.g. color, texture, etc.), which are typically heuristic methods with the disadvantages of slow speed (due to time-consuming optical flow computations) and low prediction accuracy (due to limited characterizability of low-level features). In recent years, a deep neural network is applied to the field of video saliency detection, and a deep learning method is to calculate a saliency value of an image by using a high-level semantic feature of a convolutional neural network extraction image, but position information and detail information of a target can be lost by using the deep convolutional network, misleading information can be introduced when the salient target is detected, and the detected target is incomplete.

In 2016, liu et al, in the article "sales detection for unconjugated video using superpixel-level graph and spatial-temporal propagation", propose the SGSP algorithm for video Saliency detection using a superpixel-level graph model and spatiotemporal propagation, first extracting motion and color histograms at the superpixel level and a global motion histogram to construct a graph. Next, motion saliency is iteratively computed through a shortest path on the graph using a background prior based on the graph model. Then propagate forward and backward in time, propagate locally and globally in space, and finally fuse these two results to form the final saliency map. The algorithm is large in calculation amount, but the obtained saliency map still has the problem of incomplete detection of saliency targets. The deep learning model is based on the aim of obtaining richer depth characteristics by utilizing the convolutional neural network so as to obtain more accurate detection results. Wang et al proposed a Video saliency detection method based on a full convolution network in 2017, which is the first time that a full convolution network based on deep learning is used in the Video saliency detection field, but because time information between frames is not considered, the edges of the obtained saliency map are not fine enough, and the edge noise is large. CN106372636A discloses a video saliency detection method based on HOG _ TOP, which utilizes original video to calculate on three orthogonal planes XY, XT and YT to obtain HOG _ TOP characteristics, calculates on the XY plane to obtain a space domain saliency map and calculates on the XT and YT plane to obtain a time domain saliency map, and finally obtains a final saliency map through self-adaptive fusion. CN109784183A discloses a video saliency target detection method based on a cascade convolution network and an optical flow, which uses a cascade network structure to perform pixel-level saliency prediction on an image of a current frame on three scales, namely high, medium and low. A cascade network structure is trained by using an MSAR10K image data set, a significance label graph is used as training supervision information, and a loss function is a cross entropy loss function. And after the training is ended, performing static significance prediction on each frame of image in the video by using the trained cascade network, and extracting an optical flow field by using a Locus-Kanada algorithm. And then constructing a dynamic optimization network structure by using the three layers of convolution network structures. And splicing the static detection result and the optical flow field detection result of each frame of image to obtain the input data of the optimized network. The method is time-consuming, and optical flow information extracted by using the Locus-Kanada algorithm is inaccurate and poor in robustness when the method is used for a complex scene. CN109118469A discloses a prediction method for video saliency, which quantizes an image to obtain a sparse matrix response, obtains a decomposition matrix according to local coordinate constraint, and performs saliency map calculation for each frame in a video and performs quality prediction. The method loses some detail information of the significant target, so that the prediction result has the problem of incomplete detection of the significant target. CN105913456B discloses a video saliency detection method by region segmentation, which first uses nonlinear clustering to obtain superpixel blocks to extract static features, then uses a light splitting flow method to obtain dynamic features, and finally uses a linear regression model to predict a saliency map after the two features are fused. CN109034001A discloses a cross-modal video saliency detection method based on spatio-temporal cues, which constructs a saliency map by using weights of an initial saliency map, visible light and thermal infrared, and is difficult to find a proper weight value, resulting in poor robustness. CN108241854A discloses a depth video saliency detection method based on motion and memory information, which extracts local information and global information according to the human eye attention view of a current frame, and then inputs the local information and global information as prior information and an original image into a depth network model to predict a final saliency map. CN110598537A discloses a video saliency detection method based on a deep convolution network, which uses a current frame of a video and an optical flow image corresponding to the current frame as input of a feature extraction network to predict a final saliency map, and this method needs to calculate optical flow information of the current frame in advance, and the calculation amount is large.

In summary, the prior art of video salient object detection still has the problems that the salient object detection is incomplete, and the algorithm detection is inaccurate when the background colors of the foreground are similar.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method comprises the steps of firstly using a ResNet50 deep network to fetch spatial features, then extracting time and edge information to jointly obtain a significance prediction result graph, completing video significance detection based on the deep network, and overcoming the defects of incomplete significant target detection and inaccurate algorithm detection when background colors of a foreground are similar in video significance detection in the prior art.

The technical scheme adopted by the invention for solving the technical problem is as follows: the video significance detection method based on the deep network comprises the following specific steps of firstly using a ResNet50 deep network to extract spatial features, then extracting time and edge information to jointly obtain a significance prediction result graph, and completing video significance detection based on the deep network:

firstly, inputting a video frame I, and preprocessing:

inputting video frames I, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the average value of the corresponding channel from each pixel value in the video frame I, wherein the average value of the R channel in each video frame I is 104.00698793, the average value of the G channel in each video frame I is 116.66876762, and the average value of the B channel in each video frame I is 122.67891434, so that the shape of the video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed is denoted as I', as shown in the following formula (1):

I′＝Resize(I-Mean(R,G,B)) (1),

in formula (1), mean (R, G, B) is the average of three color channels of red, green, and blue, and Resize (·) is a function for adjusting the size of the video frame I';

secondly, extracting an initial spatial feature map S of the video frame I':

sending the video frame I' after the first step of preprocessing into a ResNet50 deep network to extract an initial spatial feature map S, wherein the formula (2) is as follows:

S＝ResNet50(I′) (2),

in equation (2), resNet50 (-) is a ResNet50 deep network,

the ResNet50 deep network comprises a convolution layer, a pooling layer, a nonlinear activation function Relu layer and residual connection;

thirdly, obtaining a space characteristic diagram S with five scales _final ：

Extracted in the second stepThe initial space characteristic diagram S of the video frame I' is respectively sent into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with expansion rates of 2,4,8 and 16 in four scales _k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales _final ，

Fourthly, obtaining a characteristic diagram F:

obtaining the space characteristic diagram S of five scales by the third step _final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (3),

F＝BN(Relu(Conv(S _final ))) (3),

in formula (3), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;

fifthly, obtaining a rough space-time saliency map Y _ST And edge profile E of salient objects _t ：

Inputting the feature map F obtained in the fourth step to the spatio-temporal branch and the edge detection branch simultaneously to obtain a spatio-temporal feature map F _ST And obtaining an edge profile E of the salient object _t The specific operation is as follows,

inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F _ST As shown in the following formula (4),

F _ST ＝ConvLSTM(F,H _t-1 ) (4),

in equation (4), convLSTM (. Cndot.) is a ConvLSTM operation, H _t-1 The state of the ConvLSTM unit at the previous moment;

then the obtained space-time characteristic diagram F _ST Then feeding the data into a convolution layer with convolution kernel size of 1 × 1 to obtain a rough space-time saliency map Y _ST The formula is as follows:

Y _ST ＝Conv(F _ST ) (5),

in equation (5), conv (·) is a convolution operation;

inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object _t The concrete operation is as follows,

obtaining static state of input video of T frames through ResNet50 deep network and expansion convolution

Wherein X _t Given X for the video frame of the t-th frame _t ，X _t After the edge detection branch, outputting the edge profile E _t ∈[0,1] ^W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are taken from the edge detection network &>

Calculated taking into account the previous video frame, as shown in equations (6) and (7),

H _t ＝ConvLSTM(X _t ,H _t-1 ) (6),

in the formula (6) and the formula (7),

for the 3D tensor hidden state, M is the number of channels, E _t ' is an unweighted edge profile, H _t For the current ConvLSTM cell state, H _t-1 Is the state of the ConvLSTM cell at the previous time, X ₁ Is a video frame that is the first frame,

embedding ConvLSTM in ConvLSTM, obtaining edge profile E _t Is the edge detection network

As shown in the following formula (8),

/>

then using the edge detection network

Weighting to obtain an edge profile E of the salient object _t As shown in the following equation (9),

in the formula (9), the first and second groups of the chemical reaction are shown in the specification,

is a 1 x 1 convolution kernel used to map the edge detection network->

Obtaining a weight matrix, sigmoid function σ being such that the matrix is normalized to [0, 1%]；

Thus, obtaining a rough space-time saliency map Y _ST And edge profile E of salient objects _t ；

Sixthly, obtaining a final significance prediction result picture Y _final ：

The rough space-time saliency map Y obtained in the fifth step is used _ST And edge profile E of salient objects _t Fusing to obtain a final significance prediction result graph Y _final As shown in the following equation (10),

in the formula (10), the first and second groups of the chemical reaction are shown in the formula,

for matrix multiplication, σ is a sigmoid function, resize (·) is a function for adjusting the video frame size,

restoring the obtained video frame to 473 × 473 the size of the original input video frame;

seventh step, calculating the loss for input video frame I:

calculating a saliency map for the input video frame I through the first step to the sixth step, and obtaining a final saliency prediction result map Y for measuring the final saliency _final The difference between the method and the ground-truth adopts a binary cross entropy loss function during training

As shown in the following formula (11),

in the formula (11), G (i, j) belongs to [0,1] is the true value of the pixel (i, j), M (i, j) belongs to [0,1] is the predicted value of the pixel (i, j), N =473 is taken,

by continuously shrinking

The size of the sum of the two values is used for training the network, and a random gradient descent method is adopted to optimize a binary cross entropy loss function->

And completing the video saliency detection based on the deep network.

In the video saliency detection method based on the deep network, the five-scale spatial feature map S is obtained _final The specific operation of (1) is as follows:

the dilated convolution kernel in the ResNet50 deep network is represented as

Wherein K is the number of expanded convolution layers, cxc is the multiplication of width and height, C is the channel number, and>

for expanding the parameters of the convolution, whose step size is set to 1, four output characteristic maps are derived on the basis of these parameters>

Wherein W and H are the width and height, respectively, as shown in the following equation (12),

in the formula (12), C _k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,

for the dilation convolution operation, S is the initial spatial feature map,

the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4 k is 1,2,3,4]Expansion ratio r _k Has four values of r _k = 2,4,8,16, and it expands the convolution kernel C _k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scales

And then they are connected in series in turn, as shown by the following formula (13),

S _final ＝[S,T ₁ ，T ₂ ,…,T _K ] (13),

in the formula (13), S _final For the final multi-scale space characteristic map, S is the initial space characteristic map S, T extracted by the ResNet50 deep network _K So as to obtain a feature map after expansion convolution, a space feature map S of five scales _final The shape of (2) is 60X 4096.

The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:

(1) Compared with CN106372636A, the method of the invention adopts a deep learning-based method, firstly uses ResNet50 and expansion convolution to extract multi-scale spatial features, then uses ConvLSTM to extract time information, and finally integrates the time information into space-time information. The invention has the outstanding substantive characteristics and remarkable progress that the detection precision of the remarkable object is better and faster than the method for calculating the optical flow by using ConvLSTM to extract the time information without calculating the optical flow information.

(2) Compared with CN109784183A, the method of the invention adopts a connection mode with a residual error network, and a plurality of convolution layers are connected with the residual error block.

(3) Compared with CN109118469A, the method of the invention has the prominent substantive features and obvious progress that fussy sparse matrix extraction is not needed, advanced features are extracted from video frames by adopting a deep neural network, each pixel point is predicted, the detection result is more accurate, and the robustness is better.

(4) Compared with CN105913456B, the method of the invention has the prominent substantive characteristics and remarkable progress that the method directly adopts an end-to-end neural network method without linear iteration and k-means clustering with larger calculation amount, and can obtain a prediction result more quickly after training is finished.

(5) Compared with CN109034001A, the method of the present invention adopts the edge detection branch based on the depth network to extract the edge of the salient object in the original image, and thus guides the generation of the following complete salient image. The invention has the prominent substantive features and remarkable progress that the remarkable objects in the obtained remarkable graph are more complete.

(6) Compared with CN108241854A, although the method of the present invention is a deep learning method, the method of the present invention adopts the expansion convolution to extract four feature maps with different scales, compared with the method of the present invention, the extracted features of the present invention are more comprehensive, therefore, the outstanding substantive features and the obvious progress of the present invention are that the edges of the salient objects in the final salient map are smoother.

(7) Compared with CN110598537A, the method of the invention has the prominent substantive features and the remarkable progress that the ConvLSTM is used for simulating the optical flow information between frames, and the extracted optical flow information is more accurate than that calculated by the traditional method.

(8) Compared with Video salt Object Detection view full conditional Networks, the method has the prominent substantive characteristics and remarkable progress that time information between frames is utilized, and an obtained prediction result image is more accurate.

(9) The invention provides a video saliency detection method model based on a deep network. The method is different from the traditional edge detection algorithm, and can accurately detect the outline of the salient object in each frame in a video sequence to guide the prediction of a salient map.

(10) The method utilizes the depth saliency target edge detection branch to generate a saliency target contour map and fuse the saliency target contour map with the space-time saliency map of each frame in the video, so that the contour of the saliency target contour map is smoother, and the saliency target in each frame in the video sequence can be predicted more accurately.

Drawings

The invention is further illustrated with reference to the following figures and examples.

Fig. 1 is a schematic block diagram of a process of the video saliency detection method based on a deep network according to the present invention.

FIG. 2 is a graph Y of the significance prediction result of a video frame I with one cat and one box as significant targets in an embodiment of the invention _final 。

Detailed Description

The embodiment shown in fig. 1 shows that the video saliency detection method based on the deep network has the following processes:

inputting video frame I, preprocessing → extracting initial space characteristic diagram S of video frame I' → obtaining space of five scalesCharacteristic diagram S _final → obtaining a feature map F → obtaining a coarse spatio-temporal saliency map Y _ST And edge profile E of salient objects _t → obtaining the final significance prediction result graph Y _final Calculating loss for input video frame I → completing video saliency detection based on the deep web.

Example 1

The method for detecting video saliency based on the deep network comprises the following specific steps:

firstly, inputting a video frame I, and preprocessing:

inputting video frames I with significant targets of one cat and one box, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the mean value of its corresponding channel from each pixel value in the video frame I, wherein the mean value of the R channel in each video frame I is 104.00698793, the mean value of the G channel in each video frame I is 116.66876762, and the mean value of the B channel in each video frame I is 122.67891434, so that the shape of the video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed as such is denoted as I', as shown in the following equation (1):

I′＝Resize(I-Mean(R,G,B)) (1),

secondly, extracting an initial spatial feature map S of the video frame I':

S＝ResNet50(I′) (2),

in formula (2), resNet50 (·) is a ResNet50 deep network,

the third stepObtaining a space characteristic map S with five scales _final ：

Respectively sending the initial spatial feature map S of the video frame I' extracted in the second step into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with four scales with expansion rates of 2,4,8 and 16 respectively _k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales _final ，

Obtaining a five-scale spatial feature map S _final The specific operation is as follows:

the dilated convolution kernel in the ResNet50 deep network is represented as

Wherein W and H are the width and height, respectively, as shown in the following equation (3),

in the formula (3), C _k Is an expanded convolution kernel with the value of K, the number of the expanded convolutions is K,

for the dilation convolution operation, S is the initial spatial feature map,

the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4, k is [1,2,3,4 ]]Expansion ratio r _k Has a value of fourEach is r _k = 2,4,8,16, and it expands the convolution kernel C _k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scales

Then they are connected in series in turn, as shown in the following formula (4),

S _final ＝[S,T ₁ ，T ₂ ,…,T _K ] (4),

in the formula (4), S _final For the final multi-scale space feature map, S is the initial space feature map S, T extracted by ResNet50 deep network _K In order to obtain a feature map, a five-scale spatial feature map S, after an expansion convolution _final The shape of (1) is 60X 4096;

fourthly, obtaining a characteristic diagram F:

obtaining the five-scale space characteristic diagram S obtained in the third step _final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (5),

F＝BN(Relu(Conv(S _final ))) (5),

in equation (5), conv (-) is a convolution operation, relu (-) is a nonlinear activation function, and BN (-) is a normalization operation;

inputting the feature map F obtained in the fourth step into ConvLSTM of spatio-temporal branch to obtain a spatio-temporal feature map F _ST As shown in the following equation (6),

F _ST ＝ConvLSTM(F,H _t-1 ) (6),

in equation (6), convLSTM (·) is a ConvLSTM operation, H _t-1 Is the state of the ConvLSTM cell at the previous time;

then the obtained space-time characteristic diagram F _ST Then sending the data into a layer of convolution with convolution kernel size of 1 multiplied by 1 to obtain a rough space-time saliency map Y _ST The formula is as follows:

Y _ST ＝Conv(F _ST ) (7),

in equation (7), conv (·) is a convolution operation;

inputting the feature map F obtained in the fourth step into an edge detection branch to obtain an edge contour map E of the salient object _t The specific operation is as follows,

the edge detection branch comprises a two-layer ConvLSTM, which is a strong cyclic model, and is used for capturing time sequence information, describing outline edges of salient objects according to time information, distinguishing the salient objects from non-salient objects in an image, and more specifically obtaining static state of an input video of a T frame through ResNet50 depth network and expansion convolution

Wherein X _t Given X for the video frame of the t-th frame _t ，X _t Outputting the edge profile E after edge detection branching _t ∈[0,1] ^W×H Wherein W and H are the width and height, respectively, of the predicted edge image, are based on the edge detection network->

Calculated taking into account the previous video frame, as shown in equations (8) and (9),

H _t ＝ConvLSTM(X _t ,H _t-1 ) (8),

in the formula (8) and the formula (9),

for the hidden state of the 3D tensor, M is the number of channels, E _t ' is an unweighted edge profile, H _t Is the current state of the ConvLSTM cell, H _t-1 Is the state of the ConvLSTM cell at the previous time, X ₁ Is a video frame that is the first frame,

As shown in the following formula (10),

then using the edge detection network

Weighting to obtain an edge profile E of the salient object _t As shown in the following equation (11),

in the formula (11), the first and second groups,

is a 1 x 1 convolution kernel used to map edge detection networks->

Sixthly, obtaining a final significance prediction result picture Y _final ：

The rough space-time saliency map Y obtained in the fifth step is used _ST And edge profile E of salient objects _t Fusing to obtain a final significance prediction result graph Y _final As shown in the following equation (12),

in equation (12),' is matrix multiplication, σ is sigmoid function, resize (-) is a function of adjusting the video frame size,

restoring the obtained video frame to 473 × 473 of the original input video frame;

FIG. 2 is a final saliency prediction result graph Y of the video frame I of the present embodiment _final There are two prominent targets, cats and boxes.

Seventh step, calculating the loss for input video frame I:

calculating a saliency map for the input video frame I through the first step to the sixth step, and measuring a final saliency prediction result map Y obtained in the sixth step _final The difference between the method and the ground-truth adopts a binary cross entropy loss function during training

As shown in the following formula (13),

in the formula (13), G (i, j) ∈ [0,1] is the true value of the pixel (i, j), M (i, j) ∈ [0,1] is the predicted value of the pixel (i, j), N =473 is selected,

by continuous reduction of

And completing the video significance detection based on the deep network.

In the above embodiments, the ResNet50 deep network, convLSTM, ground-truth, and stochastic gradient descent methods are all known in the art.

Claims

1. The video saliency detection method based on the deep network is characterized by comprising the following steps: firstly, a ResNet50 deep network is used for extracting spatial features, then time and edge information are extracted to jointly obtain a significance prediction result graph, and video significance detection based on the deep network is completed, and the method comprises the following specific steps:

firstly, inputting a video frame I, and preprocessing:

inputting video frame I, unifying the sizes of the video frames to be 473 × 473 pixels in width and height, and subtracting the average value of the corresponding channel from each pixel value in video frame I, wherein the average value of the R channel in each video frame I is 104.00698793, the average value of the G channel in each video frame I is 116.66876762, and the average value of the B channel in each video frame I is 122.67891434, so that the shape of video frame I before being input to the ResNet50 depth network is 473 × 473 × 3, and the video frame after being preprocessed in this way is denoted as I', as shown in the following formula (1):

I′＝Resize(I-Mean(R,G,B)) (1),

secondly, extracting an initial spatial feature map S of the video frame I':

S＝ResNet50(I′) (2),

in equation (2), resNet50 (-) is a ResNet50 deep network,

thirdly, obtaining a space characteristic diagram S with five scales _final ：

The initial spatial feature map S of the video frame I' extracted in the second step is respectively sent into four different expansion convolutions with expansion rates of 2,4,8 and 16 in a ResNet50 deep network to obtain results T with four scales with expansion rates of 2,4,8 and 16 respectively _k Then the result is connected with the initial space characteristic diagram S of the output result of the ResNet50 deep network in series to finally obtain a space characteristic diagram S with five scales _final ，

Fourthly, obtaining a characteristic diagram F:

obtaining the five-scale space characteristic diagram S obtained in the third step _final The feature map F having a shape of 60 × 60 × 32 is obtained by a convolution operation with a convolution kernel of 3 × 3 × 32, as shown in the following equation (3),

F＝BN(Relu(Conv(S _final ))) (3),

F _ST ＝ConvLSTM(F,H _t-1 ) (4),

in equation (4), convLSTM (. Cndot.) is a ConvLSTM operation, H _t-1 Is the state of the ConvLSTM cell at the previous time;

then the obtained space-time characteristic diagram F _ST Then feeding the obtained mixture into a layer of convolution with convolution kernel size of 1X 1 to obtain a coarse productSlight space-time saliency map Y _ST The formula is as follows:

Y _ST ＝Conv(F _ST ) (5),

in equation (5), conv (·) is a convolution operation;

H _t ＝ConvLSTM(X _t ,H _t-1 ) (6),

in the formula (6) and the formula (7),

for the 3D tensor hidden state, M is the number of channels, E _t ' is an unweighted edge profile, H _t Is the current state of the ConvLSTM cell, H _t-1 Is the state of the ConvLSTM cell at the previous time, X ₁ Is a video frame that is the first frame,

embedding ConvLSTM in ConvLSTM to obtain edge profile E _t The key component of (A) is the edge detection netCollateral channel

As shown in the following formula (8),

then using the edge detection network

in the formula (9), the first and second groups,

is a 1 x 1 convolution kernel used to map the edge detection network->

Sixthly, obtaining a final significance prediction result graph Y _final ：

in the formula (10), the first and second groups,

seventh step, calculating the loss for input video frame I:

As shown in the following formula (11),

in the formula (11), G (i, j) is the true value of the pixel (i, j), M (i, j) is the predicted value of the pixel (i, j), N =473 is selected,

by continuous reduction of

And completing the video significance detection based on the deep network.

2. The method for detecting video saliency based on deep network as claimed in claim 1, wherein: said obtaining being of five scalesSpatial feature map S _final The specific operation is as follows:

the dilated convolution kernel in the ResNet50 deep network is represented as

for the parameters of the dilation convolution, whose step size is set to 1, four output characteristic maps +are derived on the basis of these parameters>

for the dilation convolution operation, S is the initial spatial feature map,

the shape of the initial space characteristic diagram S obtained after the ResNet50 deep network is 60 multiplied by 2048, the value range of 4 k is 1,2,3,4]Expansion ratio r _k Has four values of r _k = {2,4,8,16}, and it expands the convolution kernel C _k All have a shape of 3 × 3 × 512, thereby finally obtaining feature maps with four different scales

S _final ＝[S,T ₁ ，T ₂ ,…,T _K ] (13),

in the formula (13), S _final For the final multi-scale space feature map, S is the initial space feature map S, T extracted by ResNet50 deep network _K In order to obtain a feature map, a five-scale spatial feature map S, after an expansion convolution _final The shape of (2) is 60X 4096.