CN113066074A

CN113066074A - Visual saliency prediction method based on binocular parallax offset fusion

Info

Publication number: CN113066074A
Application number: CN202110385471.9A
Authority: CN
Inventors: 周武杰; 马佳宝; 雷景生; 强芳芳; 钱小鸿; 甘兴利
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2021-04-10
Filing date: 2021-04-10
Publication date: 2021-07-02

Abstract

The invention discloses a binocular parallax offset fusion-based visual saliency prediction method, and relates to the field of deep learning. In the training stage, a convolutional neural network is constructed, and the network comprises a feature extraction layer and an upper sampling layer. The feature extraction layer comprises two twin networks, a network framework adopts a ResNet34 framework, and a convolutional neural network feature extraction part consisting of 5 convolutional blocks is used; the upsampling layer includes 4 parts: the device comprises a GCM module, a CAM module, a characteristic cascade transmission module and an SPSM module. Inputting the NCTU binocular image data set into a convolutional neural network for training to obtain a single-channel saliency target prediction map; and then, calculating a loss function value between a prediction graph corresponding to the training set image and a real significance target label graph to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model. The method has the advantage of improving the efficiency and accuracy of the significance target prediction.

Description

Visual saliency prediction method based on binocular parallax offset fusion

Technical Field

The invention relates to the field of deep learning, in particular to a binocular parallax offset fusion-based visual saliency prediction method.

Background

With the availability of mass data brought by the development of the internet, the rapid acquisition of key information from mass image and video data has become a key problem in the field of computer vision. The visual significance detection through object identification, 3D display, visual comfort evaluation and 3D visual quality measurement has important application value in this respect. The human visual system has the ability to quickly search and locate objects of interest when faced with natural scenes, being able to locate the most prominent areas in the image while ignoring other areas. The visual attention mechanism has extremely important significance for people to process visual image information in daily life.

The method for predicting the visual saliency through deep learning can be used for directly predicting the saliency region of an end-to-end (end-to-end) at a pixel level, namely, only images and labels in a training set are required to be input into a model frame for training to obtain weights and models, then prediction is carried out in a testing set to verify the quality of the models, the best prediction model is obtained through continuous tuning optimization, and finally the prediction model is used for predicting pictures in the real world to obtain the visual saliency prediction result of the pictures. The prediction method based on deep learning has the core that a binocular parallax offset fusion visual saliency prediction method constructed by a convolutional neural network is utilized, and the prediction method is strong in multilayer structure and capability of automatically learning features, and can learn the features of multiple layers. The architecture of the convolutional neural network mainly comprises two types: bottom up and top down. Bottom-up refers to the visual attention elicited by the essential features of an image, which are driven by underlying perceptual data, such as a set of image features, e.g., color, brightness, orientation, etc. According to the bottom layer image data, different areas have stronger characteristic difference; by determining the difference between the target area and its surrounding pixels, the saliency of the image area can be calculated. The top-down strategy is based on a task-driven attention saliency mechanism, based on task experience to drive visual attention, and based on knowledge to predict a target saliency region of a current image. For example, in an area, when you are looking for a friend wearing a black hat, you will first notice the prominent features of the black hat.

Most of the existing visual saliency prediction methods adopt a deep learning method, and a model combining convolutional layer batch, batch normalization layer and pooling layer is utilized, so that a better framework is obtained through different combination modes of the convolutional layer batch, the batch normalization layer and the pooling layer, and a better model is obtained.

Disclosure of Invention

In view of the above, the invention provides a binocular parallax offset fusion-based visual saliency prediction method, which has a good prediction effect and is rapid in prediction.

In order to achieve the purpose, the invention adopts the following technical scheme:

a visual saliency prediction method based on binocular parallax offset fusion comprises the following steps:

selecting a plurality of binocular views of natural scenes and movie scenes to form an image data training set;

constructing a convolutional neural network framework, wherein the neural network framework enables high-level semantic information and low-level detail information to be combined with each other;

training the convolutional neural network framework: inputting the binocular view to the convolutional neural network framework, the convolutional neural network framework outputting a grayscale map; the Loss function of the convolutional neural network framework adopts root mean square error, CC-Loss and KLDivloss;

and (5) training for multiple times to obtain a convolutional neural network prediction training model.

Preferably, the specific connection relationship of the neural network framework is as follows:

inputting the left view of the input layer into a 1 st, a 2 nd, a 3 rd, a 4 th and a 5 th volume block in sequence; wherein the 1 st convolution block is input to the 2 nd SPSM module, the 2 nd convolution block is input to the 1 st SPSM module, the 3 rd convolution block is input to the 3 rd GCM module, the 4 th convolution block is input to the 2 nd GCM module, and the 5 th convolution block is input to the 1 st GCM module; the right viewpoint of the input layer is sequentially connected with a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block and a 10 th convolution block; wherein the 6 th convolution block is input to the 2 nd SPSM block, the 7 th convolution block is input to the 1 st SPSM block, the 8 th convolution block is input to the 3 rd CAM block, the 9 th convolution block is input to the 2 nd CAM block, and the 10 th convolution block is input to the 1 st CAM block; the 1 st GCM module is input into a 1 st feature cascade transfer module, the 2 nd GCM module is input into a 2 nd feature cascade transfer module, and the 3 rd GCM module is input into a 3 rd feature cascade transfer module; the 1 st characteristic cascade transfer module outputs to the 1 st CAM module, the 2 nd characteristic cascade transfer module and the 3 rd characteristic cascade transfer module, the 2 nd characteristic cascade transfer module outputs to the 2 nd CAM module, the 3 rd characteristic cascade transfer module outputs to the 3 rd CAM module; the 1 st CAM bank outputs to the 2 nd feature cascade transfer bank, the 2 nd CAM bank outputs to the 3 rd feature cascade transfer bank, and the 3 rd CAM bank outputs to the 1 st SPSM bank and the 1 st high-level volume block; the 1 st SPSM module outputs to the 2 nd SPSM module, and the 2 nd and 3 rd advanced volume blocks are sequentially connected behind the 1 st advanced volume block; the output of the 2 nd SPSM module and the 3 rd high level convolution reaches the output layer via the concatance connection layer.

Preferably, the specific input-output relationship of the SPSM module is as follows:

the left viewpoint features and the right viewpoint features are respectively input into a parallax fusion layer, the parallax fusion layer outputs to a high-level feature fusion layer, the high-level features of the front layer output to the high-level feature fusion layer, the high-level feature fusion layer outputs to a convolution block, and the output of the convolution block is the output of the SPSM module.

Preferably, the specific input-output relationship of the GCM module is as follows:

the convolution characteristic diagrams are respectively input into a 1 st convolution layer, a 1 st hollow convolution layer, a 2 nd hollow convolution layer and a 3 rd hollow convolution layer and then input into a 2 nd convolution layer, and the output of the 2 nd convolution layer is input into a splicing layer.

Preferably, the specific input-output relationship of the characteristic cascade transfer module is as follows:

the 1 st GCM module is input into the 1 st convolution layer, and the 1 st convolution layer is respectively output to the 1 st pixel point convolution layer, the 3 rd pixel point convolution layer, the 1 st characteristic splicing layer and the 1 st CAM module; the 1 st CAM is output to the 2 nd pixel point lamination; the 2 nd GCM module inputs the 1 st pixel point lamination, the 1 st pixel point lamination is input to the 1 st characteristic splicing layer, the 1 st characteristic splicing layer is input to the 2 nd pixel point lamination, the 2 nd pixel point lamination is input to the 2 nd convolution layer, and the 2 nd convolution layer is output to the 2 nd CAM module; the 1 st pixel point lamination is output to the 3 rd pixel point lamination, the 2 nd lamination is output to the 2 nd characteristic splicing layer, and the 2 nd CAM module is output to the 4 th pixel point lamination; the 3 rd GCM module outputs the 3 rd pixel point lamination, the 3 rd pixel point lamination outputs the 2 nd characteristic splicing layer, the 2 nd characteristic splicing layer outputs the 4 th pixel point lamination, the 4 th pixel point lamination outputs the 3 rd convolution, and the 3 rd convolution inputs the 3 rd CAM module.

Compared with the prior art, the visual saliency prediction method based on binocular parallax offset fusion has the following beneficial effects:

1. the invention constructs a convolutional neural network architecture, and a picture data set sampled from the real world is input into a convolutional neural network for training to obtain a convolutional neural network prediction model. And inputting the picture to be predicted into the network, and predicting to obtain a prediction result picture of the visual saliency area of the picture. The method of the invention uses the mutual combination of high-level semantic information and low-level detail information in the neural network architecture process, thereby effectively improving the accuracy of the salient region prediction.

2. The method of the invention constructs two parts of a coding layer and a decoding layer by using a convolutional neural network, wherein the coding layer extracts high-level semantic features and low-level detail features of an image, and the decoding layer is upwards transmitted step by the high-level semantic features and supplements information by combining the low-level detail features. The problem that the detail features are lost when the image features are extracted by a first-level and first-level coding layer structure is solved, and the region of a salient target can be more accurately positioned by the extracted high-level features.

3. The method adopts a feature step-by-step upward transfer module in the upward transfer process of a decoding layer, fully utilizes high-level features, gradually positions to the position of a significant target, and transfers the significant target to a front layer one by one; a sub-pixel shifting module is adopted to fully utilize the mutual fusion of high-level features and low-level features to ensure the utilization rate of the features and predict the visual saliency area of the image with the maximum accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a diagram of a mold structure according to the present invention;

FIG. 2 is a schematic diagram of an SPSM module of the present invention;

FIG. 3 is a schematic diagram of a GCM module according to the present invention;

FIG. 4 is a schematic diagram of a feature cascade transfer module of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a binocular parallax offset fusion-based visual saliency prediction method, the overall implementation block diagram of which is shown in fig. 1 and comprises a model training stage and a model testing stage;

the specific steps of the model training process are as follows:

step 1_ 1: and selecting Q binocular images of natural scenes and movie scenes, namely images with a left view point and a right view point, to form a training image data set. And the qth graph in the training set is denoted as { I }^q(I, j) }, training set and { I }^q(i, j) } corresponding true visual saliency prediction maps

The images in the natural scene and the movie scene are both RGB three-channel binocular color images, Q is a positive integer, Q is more than or equal to 200, Q is 332, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents the width of W which is 480, and H represents { I { (I) } is a positive integer, Q is^q(I, j) } e.g. take W640, H480, I^q(I, j) represents { I^qThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),

to represent

The middle coordinate position is the pixel value of the pixel point of (i, j);

step 1_ 2: convolutional neural network architecture: the convolutional neural network architecture of the present invention is mainly composed of two parts, namely a feature extraction part (coding layer) and an upsampling part (decoding layer).

In the feature extraction part, because the adopted data set is a binocular vision data set and each sample is divided into a left view point and a right view point, the feature extraction part comprises two codes with the same frameworkThe layer is a left view characteristic coding layer and a right view characteristic coding layer respectively, visual characteristic extraction is carried out on the pictures of the left view and the right view, and each coding layer comprises 5 convolution blocks. Namely, the feature extraction section includes left viewpoint feature extraction and right viewpoint feature extraction. The feature extraction of the left and right viewpoints all comprises a 1 st volume block, a 2 nd volume block, a 3 rd volume block, a 4 th volume block and a 5 th volume block. Here, the output of the first two volume blocks is defined as a shallow feature, and the output of the last three volume blocks is defined as a high-level feature. Wherein the convolution block outputs of the last 3, 4, 5 of the left view correspond to the global context module GCM, respectively₃、GCM₂、GCM₁(ii) a Convolution block outputs of the last 3, 4 and 5 of the right viewpoint respectively correspond to the attention fusion module CAM₃、CAM₂、 CAM₁。

The upsampling includes five modules: respectively, a global context module (comprising three GCM units, respectively GCM₁、GCM₂、GCM₃) A feature cascade transfer module (FCM), a channel attention fusion module (comprising three CAM cells, respectively CAM)₁、CAM₂、CAM₃) A sub-pixel shift module (SPSM) and an advanced feature convolution component. Wherein the high-level feature convolution is composed of three convolution blocks. Each convolution block contains a 3 x 3 convolution kernel with step size of 1, padding of 1, and a quantity normalization layer and activation function (ReLU).

First, the input to the network is that the left and right viewpoints of each picture in the binocular data set (the width and height of the picture are W-256, H-256, and the channels are R-channel component, G-channel component, and B-channel component, respectively) extract visual feature information by using feature convolution blocks of the same architecture. The characteristic extraction part adopts the network specification of ResNet34, and the characteristic extraction part comprises 5 Convolution blocks, wherein the first Convolution block comprises a first layer of Convolution layer (Conv), a first layer of Activation layer (Act) and a first layer of maximum pooling layer (Max Pool). Adopting convolution layer configuration with convolution kernel (kernel _ size) of 7, step size (stride) of 2 and edge filling (padding) of 3, then normalizing the convolved feature map by a batch normalization layer, and then enabling the feature map to pass through an activation functionThe non-linear transformation of the (modified linear unit ReLU) finally outputs the feature map of the first volume block from the maximum pooling layer, and the feature map of the layer is made to be the right view point F₁ ^RLeft viewpoint F₁ ^LAt this time, the first convolution block outputs 64 feature maps, and the 64 feature maps constitute a left viewpoint feature map set P₁ ^LRight viewpoint feature map set P₁ ^RAnd the width of each feature map is

Has a height of

The second volume block is composed of a second layer volume layer (Conv), a second layer Activation layer (Act), and a second layer max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set P output by the first left view convolution block₁ ^LAnd the convolution block of the right view is input into a right view feature graph set P output by the first convolution block of the right view₁ ^R. The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 64, i.e. the number of feature maps output by the convolution layer. The convolution layer is then normalized by a batch normalization layer, then passes through a nonlinear activation function (ReLU), and finally is output by a maximum pooling. And starting from the second layer, the convolution signature is processed using a residual structure. Let the input of the convolution block be X, the output of the convolution block be Y, and the feature map output Y 'of the final convolution block be Y' ═ X + Y. The purpose of this is that in the convolution process, the original information and the information after convolution are combined, and the features with more abstract characteristics can be extracted on the premise of retaining the original information to the maximum extent. Let the characteristic diagram of the layer be a right viewpoint

Left viewpoint

At this time, the second convolution block outputs 64 feature maps, and the 64 feature maps form a left viewpoint feature map set

Right viewpoint feature map set

Wherein P is₂Each feature map having a width of

Has a height of

The third volume block is composed of a third volume layer (Conv), a third Activation layer (Act), and a third max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a second left view convolution block

The convolution block of the right view is input into a right view feature map set output by a second convolution block of the right view

The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 128, i.e. the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpoint

Left viewpoint

At this time, the third convolution block outputs 128 feature maps, which willThe 128 feature maps form a left viewpoint feature map set

Right viewpoint feature map set

P₃Each feature map of (1) has a width of

Has a height of

The fourth volume block is composed of a fourth volume layer (Conv), a fourth Activation layer (Act), and a fourth max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a third left view convolution block

The convolution block of the right view inputs a right view feature map set output by a third convolution block of the right view

The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 256, i.e., the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpoint

Left viewpoint

At this time, the fourth convolution block outputs 256 feature maps, and the 256 feature maps form a left viewpoint feature map set

Right viewpoint feature map set

P₄Each feature map of (1) has a width of

Has a height of

The fifth volume block is composed of a fifth volume layer (Conv), a fifth Activation layer (Act), and a fifth max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a convolution block of a fourth left view

The convolution block of the right view is input into a right view feature map set output by a convolution block of a fourth right view

The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 512, i.e. the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpoint

Left viewpoint

At this time, the fourth convolution block outputs 512 feature maps, and the 512 feature maps form a left viewpoint feature map set

Right viewpoint feature map set

P₅Each feature map of (1) has a width of

Has a height of

For the feature extraction part of the coding layer, after extracting the feature map of each of the convolutional blocks of the left and right view data sets, further processing is required. For the left viewpoint, after the feature maps output by the third to fifth convolution blocks, a Global Context Module (GCM) is needed, which has a total of 4 branches, and each branch extracts features under different receptive fields by using different sizes of hole convolution (the hole convolution rate is 1, 3, 5, 7), so as to locate the salient objects with different sizes. In each branch there is a basic volume block: convolution layer with convolution kernel of 3 × 3, batch normalization layer, activation layer (ReLU)

In this embodiment, the network architecture includes a total of 3 GCM modules, and as shown in fig. 3, after the third, fourth and fifth rolling blocks of the feature extraction part, respectively, the GCM module corresponding to the third rolling block is a GCM₃The GCM module corresponding to the fourth volume block is GCM₂The GCM module corresponding to the fifth volume block is GCM₁And the input and output sizes of the characteristic diagram of each GCM module are not changed.

The input of each GCM module is the output F of three rolling blocks after the left viewpoint_i ^LWherein i ∈ {3, 4, 5}, and the characteristic after convolution with a hole convolution of 1, 3, 5, 7 is F₁ ^d，

The four feature maps are spliced in a concatance mode and then passed through a volumeG is a characteristic diagram obtained by a convolution layer consisting of a convolution layer with a 3 x 3 kernel, a batch normalization layer and an activation layer of a nonlinear activation function (ReLU), and the input characteristic F of the GCM module is spliced by a concatance mode at the moment_i ^LWhere i ∈ {3, 4, 5}, again through the convolutional layer of the same structure, and finally the output of the module is

Where i ∈ {3, 4, 5 }.

In the feature extraction section, for the right viewpoint data set, after the right viewpoint feature map is extracted, it is further processed. For the right view, the feature maps output by the third to fifth convolution blocks need to be passed through a focus fusion module (CAM). The input to the module includes the feature of the right viewpoint F_i ^RWhere i e {3, 4, 5}, and a feature map from the feature cascade pass module output

Where i ∈ {3, 4, 5}, where the feature map F of the right viewpoint_i ^RAnd feature map of feature cascade module output

Performing coherence splicing processing, then calculating the weight of each pixel point through an activation layer (softmax), and obtaining the weight of each pixel and the characteristic F of the right viewpoint_i ^RPerforming pixel-level dot product operation to obtain a fusion characteristic diagram F_softmaxThen, the calculation of hierarchical channel attention (LayerNorm) is performed, and the feature diagram F is obtained_softmaxObtaining the corresponding weight value of each layer after activating the layer (softmax), and combining the hierarchy weight with the right viewpoint feature F_i ^RObtaining a feature map F by performing pixel dot product_LayerNormThen, the feature map is compared with

Performing dot product operation to obtain a feature map F_fusionFinally, the feature map is output by a nonlinear activation function (Sigmoid), and is recorded as

Wherein i belongs to {1,2,3}, the feature map represents the features from two visual angles of left and right viewpoints, and the features obtained from different visual angles are fully utilized from two aspects of pixel level and hierarchy level to carry out feature interaction.

Similarly, the network architecture of the present invention comprises a total of 3 CAM modules, and after the third, fourth and fifth rolling blocks of the feature extraction part, the CAM module corresponding to the third rolling block is the CAM₃The CAM bank corresponding to the fourth volume block is CAM₂The CAM bank corresponding to the fifth volume block is CAM₁And the input and output sizes of each module feature map are unchanged.

In the upsampling section, as shown in fig. 4, the input of the feature cascade transfer module (FCM) comes from the features of the GCM output left view and the features of the CAM output right view. The FCM module is divided into three stages, corresponding to the 5 th, 4 th, and 3 rd convolution block portions of the feature extraction portion, respectively. The first stage is first advanced feature processing: will feature map

Features F obtained after convolution processing₃Output to CAM₃Performing the following steps; the second stage is

And F₃Performing point multiplication operation and splicing F₃Is followed by

The output feature map is subjected to dot product operation, so that the features processed by the CAM module can be fully fused to obtain the positioning information of the object. Then carrying out convolution operation to obtain a feature map F₂Output to CAM2 block; the third stage

And F₂Upsampling by a factor of 2 and F₃Upsampling by 4 times to perform dot multiplication, and splicing to obtain a final product from F₂Is then output with the CAM bank

The feature is multiplied by a point, and finally a feature graph F is obtained after the feature passes through the convolution layer₂Feeding into CAM₂In a module. The positioning information of the object is learned in the upward transfer process, and the salient region is supplemented step by combining with the characteristics of the previous layer.

Output to CAM by feature cascade passing module (FCM)₂In module, via CAM₂Module cascade feature output via activation function (Sigmoid)

CAM₂The output of the module is divided into two parts: the first part passes through three convolutional layers in series and the second part passes through a sub-pixel shift module (SPSM).

As shown in FIG. 2, the input of the sub-pixel shift module (SPSM) is a feature map set P for the left viewpoint_i ^LAnd feature map set P of right viewpoint_i ^RWhere i ∈ {1,2 }. Each sub-pixel shifting module receives two inputs and adds corresponding pixels of the characteristic graphs of the left viewpoint and the right viewpoint so as to solve the problem of difference caused by different positions of the left viewpoint and the right viewpoint to the same object. Then is reacted with

And performing pixel dot multiplication to obtain positioning information of the salient object, and performing object boundary supplement by combining the offset characteristic diagram. Then obtaining the characteristic input of the next layer through the output of a convolution layer

The network framework comprises two sub-pixel shift modules (SPSM) which respectively correspond to the features of the 1 st and 2 nd rolling blocks of the feature extraction part. And the shallow layer features are utilized for detail supplement, and the features of all levels are further fully utilized.

And after the output of a sub-pixel migration module (SPSM) and a series convolution layer, splicing the two characteristic graphs at a channel level, and then outputting the two characteristic graphs as a single-channel visual saliency prediction graph through a convolution layer, a batch normalization layer and an activation function (Sigmoid).

Step 1_ 3: and (3) inputting and outputting a model: the input of the model is a binocular data set, namely the input of two RGB three-channel color images of a left viewpoint and a right viewpoint, and the output is a single-channel gray image. Wherein the value range of each pixel value is between 0 and 255.

Step 1_ 4: a model loss function. The Loss function of the model adopts three parts of Loss, namely root mean square error, CC-Loss and KLDivloss. The root mean square error is used for evaluating the difference between each pixel in the label graph and the prediction graph; CC-Loss (Channel Correlation Loss), can limit the specific relationship between classes and channels and maintain the separability within and between classes. The KL divergence (Kullback-LeiblerDrargence) is also called relative entropy and is used to measure the degree of difference between two probability distributions. The loss of the three parts is adopted in the invention to carry out loss calculation on the constructed convolutional neural network.

Step 1_ 5: training process, optimal parameter selection: according to the model framework and the calculation process in the step 1_2, calculating model loss according to the loss function in the step 1_4 according to the input in the step 1_3 to obtain output, continuously repeating the process for V times to obtain a convolutional neural network prediction training model to obtain Q loss function values, finding out the minimum loss value in the Q loss function values, wherein the weight matrix and the offset matrix corresponding to the loss value can be used as the optimal weight and the offset of the convolutional neural network model. Corresponding notation is W^bestAnd b^best(ii) a Where V > 1, in this example V is 300.

The specific steps of the model test process are as follows:

step 2_ 1: the test set contains 95 binocular images in total

Representing a test image with prediction; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents

Width of (A), H' represents

The height of (a) of (b),

to represent

And the middle coordinate position is the pixel value of the pixel point of (i, j).

Step 2_ 2: inputting R, G, B three channels of left and right viewpoints of a test set binocular data set picture into a convolutional neural network, and utilizing W^bestAnd b^bestThe optimal parameters are predicted to obtain a single-channel visual saliency prediction image of each picture.

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) building a convolutional neural network architecture by using a deep learning library of the pytorech based on the python language. Training is performed using the NCTU dataset.

TABLE 1 evaluation results on test sets using the method of the invention

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A visual saliency prediction method based on binocular parallax offset fusion is characterized by comprising the following steps:

training the convolutional neural network framework;

2. The binocular disparity migration fusion-based visual saliency prediction method according to claim 1, wherein the neural network framework has the following specific connection relations:

3. The method for predicting visual saliency based on binocular disparity shift fusion according to claim 2, wherein the specific input-output relationship of the SPSM module is as follows:

4. The binocular disparity migration fusion-based visual saliency prediction method of claim 2, wherein the specific input-output relationship of the GCM module is as follows:

5. The binocular disparity migration fusion-based visual saliency prediction method of claim 2, wherein the specific input-output relationship of the feature cascade transfer module is as follows: