CN113066074A - Visual saliency prediction method based on binocular parallax offset fusion - Google Patents

Visual saliency prediction method based on binocular parallax offset fusion Download PDF

Info

Publication number
CN113066074A
CN113066074A CN202110385471.9A CN202110385471A CN113066074A CN 113066074 A CN113066074 A CN 113066074A CN 202110385471 A CN202110385471 A CN 202110385471A CN 113066074 A CN113066074 A CN 113066074A
Authority
CN
China
Prior art keywords
layer
module
convolution
input
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110385471.9A
Other languages
Chinese (zh)
Inventor
周武杰
马佳宝
雷景生
强芳芳
钱小鸿
甘兴利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202110385471.9A priority Critical patent/CN113066074A/en
Publication of CN113066074A publication Critical patent/CN113066074A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a binocular parallax offset fusion-based visual saliency prediction method, and relates to the field of deep learning. In the training stage, a convolutional neural network is constructed, and the network comprises a feature extraction layer and an upper sampling layer. The feature extraction layer comprises two twin networks, a network framework adopts a ResNet34 framework, and a convolutional neural network feature extraction part consisting of 5 convolutional blocks is used; the upsampling layer includes 4 parts: the device comprises a GCM module, a CAM module, a characteristic cascade transmission module and an SPSM module. Inputting the NCTU binocular image data set into a convolutional neural network for training to obtain a single-channel saliency target prediction map; and then, calculating a loss function value between a prediction graph corresponding to the training set image and a real significance target label graph to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model. The method has the advantage of improving the efficiency and accuracy of the significance target prediction.

Description

Visual saliency prediction method based on binocular parallax offset fusion
Technical Field
The invention relates to the field of deep learning, in particular to a binocular parallax offset fusion-based visual saliency prediction method.
Background
With the availability of mass data brought by the development of the internet, the rapid acquisition of key information from mass image and video data has become a key problem in the field of computer vision. The visual significance detection through object identification, 3D display, visual comfort evaluation and 3D visual quality measurement has important application value in this respect. The human visual system has the ability to quickly search and locate objects of interest when faced with natural scenes, being able to locate the most prominent areas in the image while ignoring other areas. The visual attention mechanism has extremely important significance for people to process visual image information in daily life.
The method for predicting the visual saliency through deep learning can be used for directly predicting the saliency region of an end-to-end (end-to-end) at a pixel level, namely, only images and labels in a training set are required to be input into a model frame for training to obtain weights and models, then prediction is carried out in a testing set to verify the quality of the models, the best prediction model is obtained through continuous tuning optimization, and finally the prediction model is used for predicting pictures in the real world to obtain the visual saliency prediction result of the pictures. The prediction method based on deep learning has the core that a binocular parallax offset fusion visual saliency prediction method constructed by a convolutional neural network is utilized, and the prediction method is strong in multilayer structure and capability of automatically learning features, and can learn the features of multiple layers. The architecture of the convolutional neural network mainly comprises two types: bottom up and top down. Bottom-up refers to the visual attention elicited by the essential features of an image, which are driven by underlying perceptual data, such as a set of image features, e.g., color, brightness, orientation, etc. According to the bottom layer image data, different areas have stronger characteristic difference; by determining the difference between the target area and its surrounding pixels, the saliency of the image area can be calculated. The top-down strategy is based on a task-driven attention saliency mechanism, based on task experience to drive visual attention, and based on knowledge to predict a target saliency region of a current image. For example, in an area, when you are looking for a friend wearing a black hat, you will first notice the prominent features of the black hat.
Most of the existing visual saliency prediction methods adopt a deep learning method, and a model combining convolutional layer batch, batch normalization layer and pooling layer is utilized, so that a better framework is obtained through different combination modes of the convolutional layer batch, the batch normalization layer and the pooling layer, and a better model is obtained.
Disclosure of Invention
In view of the above, the invention provides a binocular parallax offset fusion-based visual saliency prediction method, which has a good prediction effect and is rapid in prediction.
In order to achieve the purpose, the invention adopts the following technical scheme:
a visual saliency prediction method based on binocular parallax offset fusion comprises the following steps:
selecting a plurality of binocular views of natural scenes and movie scenes to form an image data training set;
constructing a convolutional neural network framework, wherein the neural network framework enables high-level semantic information and low-level detail information to be combined with each other;
training the convolutional neural network framework: inputting the binocular view to the convolutional neural network framework, the convolutional neural network framework outputting a grayscale map; the Loss function of the convolutional neural network framework adopts root mean square error, CC-Loss and KLDivloss;
and (5) training for multiple times to obtain a convolutional neural network prediction training model.
Preferably, the specific connection relationship of the neural network framework is as follows:
inputting the left view of the input layer into a 1 st, a 2 nd, a 3 rd, a 4 th and a 5 th volume block in sequence; wherein the 1 st convolution block is input to the 2 nd SPSM module, the 2 nd convolution block is input to the 1 st SPSM module, the 3 rd convolution block is input to the 3 rd GCM module, the 4 th convolution block is input to the 2 nd GCM module, and the 5 th convolution block is input to the 1 st GCM module; the right viewpoint of the input layer is sequentially connected with a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block and a 10 th convolution block; wherein the 6 th convolution block is input to the 2 nd SPSM block, the 7 th convolution block is input to the 1 st SPSM block, the 8 th convolution block is input to the 3 rd CAM block, the 9 th convolution block is input to the 2 nd CAM block, and the 10 th convolution block is input to the 1 st CAM block; the 1 st GCM module is input into a 1 st feature cascade transfer module, the 2 nd GCM module is input into a 2 nd feature cascade transfer module, and the 3 rd GCM module is input into a 3 rd feature cascade transfer module; the 1 st characteristic cascade transfer module outputs to the 1 st CAM module, the 2 nd characteristic cascade transfer module and the 3 rd characteristic cascade transfer module, the 2 nd characteristic cascade transfer module outputs to the 2 nd CAM module, the 3 rd characteristic cascade transfer module outputs to the 3 rd CAM module; the 1 st CAM bank outputs to the 2 nd feature cascade transfer bank, the 2 nd CAM bank outputs to the 3 rd feature cascade transfer bank, and the 3 rd CAM bank outputs to the 1 st SPSM bank and the 1 st high-level volume block; the 1 st SPSM module outputs to the 2 nd SPSM module, and the 2 nd and 3 rd advanced volume blocks are sequentially connected behind the 1 st advanced volume block; the output of the 2 nd SPSM module and the 3 rd high level convolution reaches the output layer via the concatance connection layer.
Preferably, the specific input-output relationship of the SPSM module is as follows:
the left viewpoint features and the right viewpoint features are respectively input into a parallax fusion layer, the parallax fusion layer outputs to a high-level feature fusion layer, the high-level features of the front layer output to the high-level feature fusion layer, the high-level feature fusion layer outputs to a convolution block, and the output of the convolution block is the output of the SPSM module.
Preferably, the specific input-output relationship of the GCM module is as follows:
the convolution characteristic diagrams are respectively input into a 1 st convolution layer, a 1 st hollow convolution layer, a 2 nd hollow convolution layer and a 3 rd hollow convolution layer and then input into a 2 nd convolution layer, and the output of the 2 nd convolution layer is input into a splicing layer.
Preferably, the specific input-output relationship of the characteristic cascade transfer module is as follows:
the 1 st GCM module is input into the 1 st convolution layer, and the 1 st convolution layer is respectively output to the 1 st pixel point convolution layer, the 3 rd pixel point convolution layer, the 1 st characteristic splicing layer and the 1 st CAM module; the 1 st CAM is output to the 2 nd pixel point lamination; the 2 nd GCM module inputs the 1 st pixel point lamination, the 1 st pixel point lamination is input to the 1 st characteristic splicing layer, the 1 st characteristic splicing layer is input to the 2 nd pixel point lamination, the 2 nd pixel point lamination is input to the 2 nd convolution layer, and the 2 nd convolution layer is output to the 2 nd CAM module; the 1 st pixel point lamination is output to the 3 rd pixel point lamination, the 2 nd lamination is output to the 2 nd characteristic splicing layer, and the 2 nd CAM module is output to the 4 th pixel point lamination; the 3 rd GCM module outputs the 3 rd pixel point lamination, the 3 rd pixel point lamination outputs the 2 nd characteristic splicing layer, the 2 nd characteristic splicing layer outputs the 4 th pixel point lamination, the 4 th pixel point lamination outputs the 3 rd convolution, and the 3 rd convolution inputs the 3 rd CAM module.
Compared with the prior art, the visual saliency prediction method based on binocular parallax offset fusion has the following beneficial effects:
1. the invention constructs a convolutional neural network architecture, and a picture data set sampled from the real world is input into a convolutional neural network for training to obtain a convolutional neural network prediction model. And inputting the picture to be predicted into the network, and predicting to obtain a prediction result picture of the visual saliency area of the picture. The method of the invention uses the mutual combination of high-level semantic information and low-level detail information in the neural network architecture process, thereby effectively improving the accuracy of the salient region prediction.
2. The method of the invention constructs two parts of a coding layer and a decoding layer by using a convolutional neural network, wherein the coding layer extracts high-level semantic features and low-level detail features of an image, and the decoding layer is upwards transmitted step by the high-level semantic features and supplements information by combining the low-level detail features. The problem that the detail features are lost when the image features are extracted by a first-level and first-level coding layer structure is solved, and the region of a salient target can be more accurately positioned by the extracted high-level features.
3. The method adopts a feature step-by-step upward transfer module in the upward transfer process of a decoding layer, fully utilizes high-level features, gradually positions to the position of a significant target, and transfers the significant target to a front layer one by one; a sub-pixel shifting module is adopted to fully utilize the mutual fusion of high-level features and low-level features to ensure the utilization rate of the features and predict the visual saliency area of the image with the maximum accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram of a mold structure according to the present invention;
FIG. 2 is a schematic diagram of an SPSM module of the present invention;
FIG. 3 is a schematic diagram of a GCM module according to the present invention;
FIG. 4 is a schematic diagram of a feature cascade transfer module of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a binocular parallax offset fusion-based visual saliency prediction method, the overall implementation block diagram of which is shown in fig. 1 and comprises a model training stage and a model testing stage;
the specific steps of the model training process are as follows:
step 1_ 1: and selecting Q binocular images of natural scenes and movie scenes, namely images with a left view point and a right view point, to form a training image data set. And the qth graph in the training set is denoted as { I }q(I, j) }, training set and { I }q(i, j) } corresponding true visual saliency prediction maps
Figure BDA0003014630770000061
The images in the natural scene and the movie scene are both RGB three-channel binocular color images, Q is a positive integer, Q is more than or equal to 200, Q is 332, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, I is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, W represents the width of W which is 480, and H represents { I { (I) } is a positive integer, Q isq(I, j) } e.g. take W640, H480, Iq(I, j) represents { IqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0003014630770000062
to represent
Figure BDA0003014630770000063
The middle coordinate position is the pixel value of the pixel point of (i, j);
step 1_ 2: convolutional neural network architecture: the convolutional neural network architecture of the present invention is mainly composed of two parts, namely a feature extraction part (coding layer) and an upsampling part (decoding layer).
In the feature extraction part, because the adopted data set is a binocular vision data set and each sample is divided into a left view point and a right view point, the feature extraction part comprises two codes with the same frameworkThe layer is a left view characteristic coding layer and a right view characteristic coding layer respectively, visual characteristic extraction is carried out on the pictures of the left view and the right view, and each coding layer comprises 5 convolution blocks. Namely, the feature extraction section includes left viewpoint feature extraction and right viewpoint feature extraction. The feature extraction of the left and right viewpoints all comprises a 1 st volume block, a 2 nd volume block, a 3 rd volume block, a 4 th volume block and a 5 th volume block. Here, the output of the first two volume blocks is defined as a shallow feature, and the output of the last three volume blocks is defined as a high-level feature. Wherein the convolution block outputs of the last 3, 4, 5 of the left view correspond to the global context module GCM, respectively3、GCM2、GCM1(ii) a Convolution block outputs of the last 3, 4 and 5 of the right viewpoint respectively correspond to the attention fusion module CAM3、CAM2、 CAM1
The upsampling includes five modules: respectively, a global context module (comprising three GCM units, respectively GCM1、GCM2、GCM3) A feature cascade transfer module (FCM), a channel attention fusion module (comprising three CAM cells, respectively CAM)1、CAM2、CAM3) A sub-pixel shift module (SPSM) and an advanced feature convolution component. Wherein the high-level feature convolution is composed of three convolution blocks. Each convolution block contains a 3 x 3 convolution kernel with step size of 1, padding of 1, and a quantity normalization layer and activation function (ReLU).
First, the input to the network is that the left and right viewpoints of each picture in the binocular data set (the width and height of the picture are W-256, H-256, and the channels are R-channel component, G-channel component, and B-channel component, respectively) extract visual feature information by using feature convolution blocks of the same architecture. The characteristic extraction part adopts the network specification of ResNet34, and the characteristic extraction part comprises 5 Convolution blocks, wherein the first Convolution block comprises a first layer of Convolution layer (Conv), a first layer of Activation layer (Act) and a first layer of maximum pooling layer (Max Pool). Adopting convolution layer configuration with convolution kernel (kernel _ size) of 7, step size (stride) of 2 and edge filling (padding) of 3, then normalizing the convolved feature map by a batch normalization layer, and then enabling the feature map to pass through an activation functionThe non-linear transformation of the (modified linear unit ReLU) finally outputs the feature map of the first volume block from the maximum pooling layer, and the feature map of the layer is made to be the right view point F1 RLeft viewpoint F1 LAt this time, the first convolution block outputs 64 feature maps, and the 64 feature maps constitute a left viewpoint feature map set P1 LRight viewpoint feature map set P1 RAnd the width of each feature map is
Figure BDA0003014630770000071
Has a height of
Figure BDA0003014630770000072
The second volume block is composed of a second layer volume layer (Conv), a second layer Activation layer (Act), and a second layer max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set P output by the first left view convolution block1 LAnd the convolution block of the right view is input into a right view feature graph set P output by the first convolution block of the right view1 R. The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 64, i.e. the number of feature maps output by the convolution layer. The convolution layer is then normalized by a batch normalization layer, then passes through a nonlinear activation function (ReLU), and finally is output by a maximum pooling. And starting from the second layer, the convolution signature is processed using a residual structure. Let the input of the convolution block be X, the output of the convolution block be Y, and the feature map output Y 'of the final convolution block be Y' ═ X + Y. The purpose of this is that in the convolution process, the original information and the information after convolution are combined, and the features with more abstract characteristics can be extracted on the premise of retaining the original information to the maximum extent. Let the characteristic diagram of the layer be a right viewpoint
Figure BDA0003014630770000081
Left viewpoint
Figure BDA0003014630770000082
At this time, the second convolution block outputs 64 feature maps, and the 64 feature maps form a left viewpoint feature map set
Figure BDA0003014630770000083
Right viewpoint feature map set
Figure BDA0003014630770000084
Wherein P is2Each feature map having a width of
Figure BDA0003014630770000085
Has a height of
Figure BDA0003014630770000086
The third volume block is composed of a third volume layer (Conv), a third Activation layer (Act), and a third max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a second left view convolution block
Figure BDA0003014630770000087
The convolution block of the right view is input into a right view feature map set output by a second convolution block of the right view
Figure BDA0003014630770000088
The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 128, i.e. the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpoint
Figure BDA0003014630770000091
Left viewpoint
Figure BDA0003014630770000092
At this time, the third convolution block outputs 128 feature maps, which willThe 128 feature maps form a left viewpoint feature map set
Figure BDA0003014630770000093
Right viewpoint feature map set
Figure BDA0003014630770000094
P3Each feature map of (1) has a width of
Figure BDA0003014630770000095
Has a height of
Figure BDA0003014630770000096
The fourth volume block is composed of a fourth volume layer (Conv), a fourth Activation layer (Act), and a fourth max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a third left view convolution block
Figure BDA0003014630770000097
The convolution block of the right view inputs a right view feature map set output by a third convolution block of the right view
Figure BDA0003014630770000098
The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 256, i.e., the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpoint
Figure BDA0003014630770000099
Left viewpoint
Figure BDA00030146307700000910
At this time, the fourth convolution block outputs 256 feature maps, and the 256 feature maps form a left viewpoint feature map set
Figure BDA00030146307700000911
Right viewpoint feature map set
Figure BDA00030146307700000912
P4Each feature map of (1) has a width of
Figure BDA00030146307700000913
Has a height of
Figure BDA00030146307700000914
The fifth volume block is composed of a fifth volume layer (Conv), a fifth Activation layer (Act), and a fifth max pooling layer (MaxPooling, Pool). The convolution block of the left view is input into a left view feature map set output by a convolution block of a fourth left view
Figure BDA0003014630770000101
The convolution block of the right view is input into a right view feature map set output by a convolution block of a fourth right view
Figure BDA0003014630770000102
The relevant parameters of the convolutional layer are as follows: the convolution kernel size is 3 × 3, step size 1, and edge padding 1. The number of convolution kernels is 512, i.e. the number of feature maps output by the convolution layer. And after the convolution layer is subjected to normalization processing through a batch normalization layer, the normalization layer is subjected to a nonlinear activation function (ReLU), and finally the normalization layer is output by a maximum pooling layer (Max Pooling). Let the characteristic diagram of the layer be a right viewpoint
Figure BDA0003014630770000103
Left viewpoint
Figure BDA0003014630770000104
At this time, the fourth convolution block outputs 512 feature maps, and the 512 feature maps form a left viewpoint feature map set
Figure BDA0003014630770000105
Right viewpoint feature map set
Figure BDA0003014630770000106
P5Each feature map of (1) has a width of
Figure BDA0003014630770000107
Has a height of
Figure BDA0003014630770000108
For the feature extraction part of the coding layer, after extracting the feature map of each of the convolutional blocks of the left and right view data sets, further processing is required. For the left viewpoint, after the feature maps output by the third to fifth convolution blocks, a Global Context Module (GCM) is needed, which has a total of 4 branches, and each branch extracts features under different receptive fields by using different sizes of hole convolution (the hole convolution rate is 1, 3, 5, 7), so as to locate the salient objects with different sizes. In each branch there is a basic volume block: convolution layer with convolution kernel of 3 × 3, batch normalization layer, activation layer (ReLU)
In this embodiment, the network architecture includes a total of 3 GCM modules, and as shown in fig. 3, after the third, fourth and fifth rolling blocks of the feature extraction part, respectively, the GCM module corresponding to the third rolling block is a GCM3The GCM module corresponding to the fourth volume block is GCM2The GCM module corresponding to the fifth volume block is GCM1And the input and output sizes of the characteristic diagram of each GCM module are not changed.
The input of each GCM module is the output F of three rolling blocks after the left viewpointi LWherein i ∈ {3, 4, 5}, and the characteristic after convolution with a hole convolution of 1, 3, 5, 7 is F1 d
Figure BDA0003014630770000111
Figure BDA0003014630770000112
The four feature maps are spliced in a concatance mode and then passed through a volumeG is a characteristic diagram obtained by a convolution layer consisting of a convolution layer with a 3 x 3 kernel, a batch normalization layer and an activation layer of a nonlinear activation function (ReLU), and the input characteristic F of the GCM module is spliced by a concatance mode at the momenti LWhere i ∈ {3, 4, 5}, again through the convolutional layer of the same structure, and finally the output of the module is
Figure BDA0003014630770000113
Where i ∈ {3, 4, 5 }.
In the feature extraction section, for the right viewpoint data set, after the right viewpoint feature map is extracted, it is further processed. For the right view, the feature maps output by the third to fifth convolution blocks need to be passed through a focus fusion module (CAM). The input to the module includes the feature of the right viewpoint Fi RWhere i e {3, 4, 5}, and a feature map from the feature cascade pass module output
Figure BDA0003014630770000114
Where i ∈ {3, 4, 5}, where the feature map F of the right viewpointi RAnd feature map of feature cascade module output
Figure BDA0003014630770000115
Performing coherence splicing processing, then calculating the weight of each pixel point through an activation layer (softmax), and obtaining the weight of each pixel and the characteristic F of the right viewpointi RPerforming pixel-level dot product operation to obtain a fusion characteristic diagram FsoftmaxThen, the calculation of hierarchical channel attention (LayerNorm) is performed, and the feature diagram F is obtainedsoftmaxObtaining the corresponding weight value of each layer after activating the layer (softmax), and combining the hierarchy weight with the right viewpoint feature Fi RObtaining a feature map F by performing pixel dot productLayerNormThen, the feature map is compared with
Figure BDA0003014630770000121
Performing dot product operation to obtain a feature map FfusionFinally, the feature map is output by a nonlinear activation function (Sigmoid), and is recorded as
Figure BDA0003014630770000122
Wherein i belongs to {1,2,3}, the feature map represents the features from two visual angles of left and right viewpoints, and the features obtained from different visual angles are fully utilized from two aspects of pixel level and hierarchy level to carry out feature interaction.
Similarly, the network architecture of the present invention comprises a total of 3 CAM modules, and after the third, fourth and fifth rolling blocks of the feature extraction part, the CAM module corresponding to the third rolling block is the CAM3The CAM bank corresponding to the fourth volume block is CAM2The CAM bank corresponding to the fifth volume block is CAM1And the input and output sizes of each module feature map are unchanged.
In the upsampling section, as shown in fig. 4, the input of the feature cascade transfer module (FCM) comes from the features of the GCM output left view and the features of the CAM output right view. The FCM module is divided into three stages, corresponding to the 5 th, 4 th, and 3 rd convolution block portions of the feature extraction portion, respectively. The first stage is first advanced feature processing: will feature map
Figure BDA0003014630770000123
Features F obtained after convolution processing3Output to CAM3Performing the following steps; the second stage is
Figure BDA0003014630770000124
And F3Performing point multiplication operation and splicing F3Is followed by
Figure BDA0003014630770000125
The output feature map is subjected to dot product operation, so that the features processed by the CAM module can be fully fused to obtain the positioning information of the object. Then carrying out convolution operation to obtain a feature map F2Output to CAM2 block; the third stage
Figure BDA0003014630770000126
And F2Upsampling by a factor of 2 and F3Upsampling by 4 times to perform dot multiplication, and splicing to obtain a final product from F2Is then output with the CAM bank
Figure BDA0003014630770000131
The feature is multiplied by a point, and finally a feature graph F is obtained after the feature passes through the convolution layer2Feeding into CAM2In a module. The positioning information of the object is learned in the upward transfer process, and the salient region is supplemented step by combining with the characteristics of the previous layer.
Output to CAM by feature cascade passing module (FCM)2In module, via CAM2Module cascade feature output via activation function (Sigmoid)
Figure BDA0003014630770000132
CAM2The output of the module is divided into two parts: the first part passes through three convolutional layers in series and the second part passes through a sub-pixel shift module (SPSM).
As shown in FIG. 2, the input of the sub-pixel shift module (SPSM) is a feature map set P for the left viewpointi LAnd feature map set P of right viewpointi RWhere i ∈ {1,2 }. Each sub-pixel shifting module receives two inputs and adds corresponding pixels of the characteristic graphs of the left viewpoint and the right viewpoint so as to solve the problem of difference caused by different positions of the left viewpoint and the right viewpoint to the same object. Then is reacted with
Figure BDA0003014630770000133
And performing pixel dot multiplication to obtain positioning information of the salient object, and performing object boundary supplement by combining the offset characteristic diagram. Then obtaining the characteristic input of the next layer through the output of a convolution layer
Figure BDA0003014630770000134
The network framework comprises two sub-pixel shift modules (SPSM) which respectively correspond to the features of the 1 st and 2 nd rolling blocks of the feature extraction part. And the shallow layer features are utilized for detail supplement, and the features of all levels are further fully utilized.
And after the output of a sub-pixel migration module (SPSM) and a series convolution layer, splicing the two characteristic graphs at a channel level, and then outputting the two characteristic graphs as a single-channel visual saliency prediction graph through a convolution layer, a batch normalization layer and an activation function (Sigmoid).
Step 1_ 3: and (3) inputting and outputting a model: the input of the model is a binocular data set, namely the input of two RGB three-channel color images of a left viewpoint and a right viewpoint, and the output is a single-channel gray image. Wherein the value range of each pixel value is between 0 and 255.
Step 1_ 4: a model loss function. The Loss function of the model adopts three parts of Loss, namely root mean square error, CC-Loss and KLDivloss. The root mean square error is used for evaluating the difference between each pixel in the label graph and the prediction graph; CC-Loss (Channel Correlation Loss), can limit the specific relationship between classes and channels and maintain the separability within and between classes. The KL divergence (Kullback-LeiblerDrargence) is also called relative entropy and is used to measure the degree of difference between two probability distributions. The loss of the three parts is adopted in the invention to carry out loss calculation on the constructed convolutional neural network.
Step 1_ 5: training process, optimal parameter selection: according to the model framework and the calculation process in the step 1_2, calculating model loss according to the loss function in the step 1_4 according to the input in the step 1_3 to obtain output, continuously repeating the process for V times to obtain a convolutional neural network prediction training model to obtain Q loss function values, finding out the minimum loss value in the Q loss function values, wherein the weight matrix and the offset matrix corresponding to the loss value can be used as the optimal weight and the offset of the convolutional neural network model. Corresponding notation is WbestAnd bbest(ii) a Where V > 1, in this example V is 300.
The specific steps of the model test process are as follows:
step 2_ 1: the test set contains 95 binocular images in total
Figure BDA0003014630770000151
Representing a test image with prediction; wherein, i ' is more than or equal to 1 and less than or equal to W ', j ' is more than or equal to 1 and less than or equal to H ', and W ' represents
Figure BDA0003014630770000152
Width of (A), H' represents
Figure BDA0003014630770000153
The height of (a) of (b),
Figure BDA0003014630770000154
to represent
Figure BDA0003014630770000155
And the middle coordinate position is the pixel value of the pixel point of (i, j).
Step 2_ 2: inputting R, G, B three channels of left and right viewpoints of a test set binocular data set picture into a convolutional neural network, and utilizing WbestAnd bbestThe optimal parameters are predicted to obtain a single-channel visual saliency prediction image of each picture.
To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
And (3) building a convolutional neural network architecture by using a deep learning library of the pytorech based on the python language. Training is performed using the NCTU dataset.
TABLE 1 evaluation results on test sets using the method of the invention
Figure BDA0003014630770000156
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (5)

1. A visual saliency prediction method based on binocular parallax offset fusion is characterized by comprising the following steps:
selecting a plurality of binocular views of natural scenes and movie scenes to form an image data training set;
constructing a convolutional neural network framework, wherein the neural network framework enables high-level semantic information and low-level detail information to be combined with each other;
training the convolutional neural network framework;
and (5) training for multiple times to obtain a convolutional neural network prediction training model.
2. The binocular disparity migration fusion-based visual saliency prediction method according to claim 1, wherein the neural network framework has the following specific connection relations:
inputting the left view of the input layer into a 1 st, a 2 nd, a 3 rd, a 4 th and a 5 th volume block in sequence; wherein the 1 st convolution block is input to the 2 nd SPSM module, the 2 nd convolution block is input to the 1 st SPSM module, the 3 rd convolution block is input to the 3 rd GCM module, the 4 th convolution block is input to the 2 nd GCM module, and the 5 th convolution block is input to the 1 st GCM module; the right viewpoint of the input layer is sequentially connected with a 6 th convolution block, a 7 th convolution block, an 8 th convolution block, a 9 th convolution block and a 10 th convolution block; wherein the 6 th convolution block is input to the 2 nd SPSM block, the 7 th convolution block is input to the 1 st SPSM block, the 8 th convolution block is input to the 3 rd CAM block, the 9 th convolution block is input to the 2 nd CAM block, and the 10 th convolution block is input to the 1 st CAM block; the 1 st GCM module is input into a 1 st feature cascade transfer module, the 2 nd GCM module is input into a 2 nd feature cascade transfer module, and the 3 rd GCM module is input into a 3 rd feature cascade transfer module; the 1 st characteristic cascade transfer module outputs to the 1 st CAM module, the 2 nd characteristic cascade transfer module and the 3 rd characteristic cascade transfer module, the 2 nd characteristic cascade transfer module outputs to the 2 nd CAM module, the 3 rd characteristic cascade transfer module outputs to the 3 rd CAM module; the 1 st CAM bank outputs to the 2 nd feature cascade transfer bank, the 2 nd CAM bank outputs to the 3 rd feature cascade transfer bank, and the 3 rd CAM bank outputs to the 1 st SPSM bank and the 1 st high-level volume block; the 1 st SPSM module outputs to the 2 nd SPSM module, and the 2 nd and 3 rd advanced volume blocks are sequentially connected behind the 1 st advanced volume block; the output of the 2 nd SPSM module and the 3 rd high level convolution reaches the output layer via the concatance connection layer.
3. The method for predicting visual saliency based on binocular disparity shift fusion according to claim 2, wherein the specific input-output relationship of the SPSM module is as follows:
the left viewpoint features and the right viewpoint features are respectively input into a parallax fusion layer, the parallax fusion layer outputs to a high-level feature fusion layer, the high-level features of the front layer output to the high-level feature fusion layer, the high-level feature fusion layer outputs to a convolution block, and the output of the convolution block is the output of the SPSM module.
4. The binocular disparity migration fusion-based visual saliency prediction method of claim 2, wherein the specific input-output relationship of the GCM module is as follows:
the convolution characteristic diagrams are respectively input into a 1 st convolution layer, a 1 st hollow convolution layer, a 2 nd hollow convolution layer and a 3 rd hollow convolution layer and then input into a 2 nd convolution layer, and the output of the 2 nd convolution layer is input into a splicing layer.
5. The binocular disparity migration fusion-based visual saliency prediction method of claim 2, wherein the specific input-output relationship of the feature cascade transfer module is as follows:
the 1 st GCM module is input into the 1 st convolution layer, and the 1 st convolution layer is respectively output to the 1 st pixel point convolution layer, the 3 rd pixel point convolution layer, the 1 st characteristic splicing layer and the 1 st CAM module; the 1 st CAM is output to the 2 nd pixel point lamination; the 2 nd GCM module inputs the 1 st pixel point lamination, the 1 st pixel point lamination is input to the 1 st characteristic splicing layer, the 1 st characteristic splicing layer is input to the 2 nd pixel point lamination, the 2 nd pixel point lamination is input to the 2 nd convolution layer, and the 2 nd convolution layer is output to the 2 nd CAM module; the 1 st pixel point lamination is output to the 3 rd pixel point lamination, the 2 nd lamination is output to the 2 nd characteristic splicing layer, and the 2 nd CAM module is output to the 4 th pixel point lamination; the 3 rd GCM module outputs the 3 rd pixel point lamination, the 3 rd pixel point lamination outputs the 2 nd characteristic splicing layer, the 2 nd characteristic splicing layer outputs the 4 th pixel point lamination, the 4 th pixel point lamination outputs the 3 rd convolution, and the 3 rd convolution inputs the 3 rd CAM module.
CN202110385471.9A 2021-04-10 2021-04-10 Visual saliency prediction method based on binocular parallax offset fusion Pending CN113066074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110385471.9A CN113066074A (en) 2021-04-10 2021-04-10 Visual saliency prediction method based on binocular parallax offset fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110385471.9A CN113066074A (en) 2021-04-10 2021-04-10 Visual saliency prediction method based on binocular parallax offset fusion

Publications (1)

Publication Number Publication Date
CN113066074A true CN113066074A (en) 2021-07-02

Family

ID=76566592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110385471.9A Pending CN113066074A (en) 2021-04-10 2021-04-10 Visual saliency prediction method based on binocular parallax offset fusion

Country Status (1)

Country Link
CN (1) CN113066074A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113409319A (en) * 2021-08-17 2021-09-17 点内(上海)生物科技有限公司 Rib fracture detection model training system, method, detection system and detection method
CN113538379A (en) * 2021-07-16 2021-10-22 河南科技学院 Double-stream coding fusion significance detection method based on RGB and gray level image

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113538379A (en) * 2021-07-16 2021-10-22 河南科技学院 Double-stream coding fusion significance detection method based on RGB and gray level image
CN113538379B (en) * 2021-07-16 2022-11-22 河南科技学院 Double-stream coding fusion significance detection method based on RGB and gray level images
CN113409319A (en) * 2021-08-17 2021-09-17 点内(上海)生物科技有限公司 Rib fracture detection model training system, method, detection system and detection method

Similar Documents

Publication Publication Date Title
CN109508681B (en) Method and device for generating human body key point detection model
CN110555434B (en) Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN110175986B (en) Stereo image visual saliency detection method based on convolutional neural network
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
CN111681177B (en) Video processing method and device, computer readable storage medium and electronic equipment
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN113689382B (en) Tumor postoperative survival prediction method and system based on medical images and pathological images
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
CN112101262B (en) Multi-feature fusion sign language recognition method and network model
CN116206133B (en) RGB-D significance target detection method
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN113449691A (en) Human shape recognition system and method based on non-local attention mechanism
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
CN112668638A (en) Image aesthetic quality evaluation and semantic recognition combined classification method and system
CN116189292A (en) Video action recognition method based on double-flow network
CN115484410A (en) Event camera video reconstruction method based on deep learning
CN113393434A (en) RGB-D significance detection method based on asymmetric double-current network architecture
CN115359370A (en) Remote sensing image cloud detection method and device, computer device and storage medium
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN116580184A (en) YOLOv 7-based lightweight model
CN117636134A (en) Panoramic image quality evaluation method and system based on hierarchical moving window attention
CN117409481A (en) Action detection method based on 2DCNN and 3DCNN
CN114882405B (en) Video saliency detection method based on space-time double-flow pyramid network architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination