CN111275076B

CN111275076B - Image significance detection method based on feature selection and feature fusion

Info

Publication number: CN111275076B
Application number: CN202010030505.8A
Authority: CN
Inventors: 袁夏; 居思刚; 赵春霞
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-01-13
Filing date: 2020-01-13
Publication date: 2022-10-21
Anticipated expiration: 2040-01-13
Also published as: CN111275076A

Abstract

The invention discloses an image significance detection method based on feature selection and feature fusion, which comprises the following steps of: extracting features of the input image, and adding the features into the feature pyramid set; selecting the characteristics of the characteristic pyramid set to obtain a new characteristic pyramid set; performing feature fusion on the features in the new feature pyramid set in a bottom-up manner to obtain a mixed feature pyramid set; and training the significance prediction network model by using the features in the mixed feature pyramid set, and performing significance detection on the image to be detected by using the trained model. The invention adopts the attention model to select the characteristics of the image, enhances the characteristics related to the image target, makes the characteristics more effective, adopts a bottom-up characteristic fusion structure to effectively fuse the detailed characteristics of the bottom layer and the semantic characteristics of the high layer, greatly improves the characteristic capability of the characteristics, and has higher detection accuracy than a common significance model network.

Description

Image significance detection method based on feature selection and feature fusion

Technical Field

The invention belongs to the field of image significance detection, and particularly relates to an image significance detection method based on feature selection and feature fusion.

Background

The saliency of an image is an object or an object which draws attention in the image, the result of saliency detection in the image or the video is often an object in the image or the video, the saliency detection in the neurology department is described as an attention mechanism for focusing or reducing a significant part of a seen object scene, and the saliency detection can automatically process the representation of the object in the image. The significance detection can improve the efficiency of algorithms such as object detection and image segmentation.

The most effective significance detection method at present is realized based on a full convolution neural network. The full convolutional neural network adds a plurality of convolutional layers and pooling layers, gradually increases the receptive field, generates high-level semantic information, and plays a crucial role in significance detection, while the pooling layers reduce the size of feature mapping and worsen the boundaries of salient objects. Some networks protect the boundary of a protruding object by using manual design features, extract the manual features to calculate the significant value of a super pixel, and divide an image into regions by using the manual features. When the saliency map is generated, the handcraft features and the high-level features of the convolutional neural network are complementary, but the methods extract the features separately, and the complementary features extracted separately are difficult to be effectively fused. Furthermore, the manual process feature extraction process is very time consuming.

In addition to manual process characterization, some studies have found that the features of different layers of the network are also complementary and integrate multi-scale features for significance detection. More specifically, deep features often contain global context-aware information that is suitable for correctly locating salient regions. Shallow features contain spatial structural details suitable for locating boundaries. These methods fuse different scale features but do not take into account their different contributions to significance, which makes significance detection underperforming. To overcome these problems, the prior art proposes to introduce a focus model and a gate function into the significance detection network, but this method ignores different features of high-level and low-level features, which may affect the extraction of valid features, and thus reduce the accuracy of significance detection.

Disclosure of Invention

The invention aims to provide an image significance detection method based on feature selection and feature fusion, which can better perform feature characterization and significance prediction on an image.

The technical solution for realizing the purpose of the invention is as follows: an image saliency detection method based on feature selection and feature fusion, the method comprising the steps of:

step 1, extracting features of an input image, and adding all the features into a feature pyramid set;

step 2, selecting the characteristics of the characteristic pyramid set to obtain a new characteristic pyramid set;

step 3, performing feature fusion on the features in the new feature pyramid set in a bottom-up manner to obtain a mixed feature pyramid set;

and 4, training a significance prediction network model by using the features in the mixed feature pyramid set, and performing significance detection on the image to be detected by using the trained significance prediction network model.

Further, the step 1 of performing feature extraction on the input image, specifically performing feature extraction on the input image by using a convolutional neural network ResNext, and the specific process includes:

suppose that the five part of the convolution blocks included in the convolutional neural network ResNext are conv respectively ₁ 、conv ₂ 、conv ₃ 、conv ₄ 、conv ₅ ；

Step 1-1, inputting an image into the five parts of the volume blocks in sequence, and performing forward iteration, wherein an iteration formula is as follows:

f _i+1 ＝conv _j (f _i ,W _j ),j∈[1,5],i∈[-1,3]

wherein, when i = -1, f _-1 F is the image to be detected, i is-1,0,1,2,3 respectively _i+1 Respectively representing the convolution blocks conv ₁ 、conv ₂ 、conv ₃ 、conv ₄ 、conv ₅ Output result of (1), W _j For the convolution block conv _j The parameters of (1);

step 1-2, adding the feature graph output by each partial rolling block to an output set to form a feature pyramid set { f } ₀ ,f ₁ ,f ₂ ,f ₃ ,f ₄ }。

Further, in step 2, feature selection is performed on the feature pyramid set, specifically, a spatial attention and channel attention mechanism is adopted for feature selection, and the specific process includes:

step 2-1, utilizing spatial attention to the bottom layer feature map f in the feature pyramid set ₀ Selecting the characteristics to obtain a new bottom characteristic diagram

Step 2-2, utilizing channel attention to the middle layer characteristic diagram f in the characteristic pyramid set ₂ The selection of the characteristics is carried out,obtaining a new mid-level feature map

Obtaining new feature pyramid set from above

Further, step 2-1 is to utilize the spatial attention to the bottom-level feature map f in the feature pyramid set ₀ Selecting the characteristics to obtain a new bottom characteristic diagram

The method specifically comprises the following steps:

defining an underlying feature graph f ₀ Is composed of

w, h and c respectively represent the width, height and channel number of the characteristic diagram; constructing a spatial attention module comprising two sub-volume blocks, respectively denoted conv ₁₁ 、conv ₂₂ ；

Step 2-1-1, mixing f ^l Put into conv in sequence ₁₁ 、conv ₂₂ The sub-volume blocks respectively output the feature map C ₁ 、C ₂ ：

C ₁ ＝conv ₁₁ (f ^l ,W ₁₁ )

C ₂ ＝conv ₂₂ (f ^l ,W ₂₂ )

In the formula, W ₁₁ 、W ₂₂ Are each conv ₁₁ 、conv ₂₂ Parameters of the sub-volume blocks;

step 2-1-2, p conv ₁₁ 、conv ₂₂ Output result C of sub-volume block ₁ 、C ₂ Element-by-element addition is performed and the resulting value of the addition is mapped to [0,1] using a sigmoid function]Obtaining the weight SA of the spatial attention, wherein the specific formula is as follows:

SA＝σ(C ₁ +C ₂ )

in the formula, σ represents a sigmoid function;

step 2-1-3, utilizing the weight SA of the space attention to the bottom layer feature map f ₀ Carrying out feature selection to obtain a new underlying feature map

Or

The formula used is:

further, the sub-volume block conv ₁₁ 、conv ₂₂ Each including two convolutional layers, one of which has 32 convolutional kernels and 3x3 convolutional kernels, and the other has 64 convolutional kernels and 3x3 convolutional kernels.

Further, step 2-2 describes using channel attention to the middle-level feature map f in the feature pyramid set ₂ Selecting characteristics to obtain a new middle layer characteristic diagram

The method specifically comprises the following steps:

definition of middle layer feature map f ₂ Is composed of

Step 2-2-1, mixing f ^m Expand into one set:

f ^m ＝{f ₁ ^m ,f ₂ ^m ,......,f _C ^m }

wherein f is _i ^m Is f ^m The ith channel slice feature in (1),

i =1,2, …, C, C is characteristic diagram f ^m The number of channels of (a);

step 2-2-2, slicing feature f for each channel _i ^m Performing global pooling to obtain a channel level vector

Step 2-2-3, learning the channel level vector by utilizing two continuous full-connection layers and a nonlinear activation layer to obtain a channel level attention vector, mapping the channel level attention vector to [0,1] by utilizing a sigmoid function, and obtaining a weight CA of channel attention, wherein the formula is as follows:

CA＝F(v ^m ,W)＝σ(fc ₂ (δ(fc ₁ (v ^m ,W ₁ )),W ₂ ))

in the formula, W ₁ 、W ₂ Respectively full connection layer fc ₁ 、fc ₂ δ is a nonlinear activation function, and σ is a sigmoid function;

step 2-2-4, centering the middle layer characteristic diagram f by using the weight CA of the channel attention ₂ Re-distributing channel weight to obtain new middle layer characteristic diagram

Or

The formula used is:

further, step 3, performing feature fusion on the features in the new feature pyramid set in a bottom-up manner to obtain a fused feature pyramid set, specifically including:

step 3-1, removing new bottom layer characteristic diagram

Sampling some other characteristic graph as new bottom characteristic graph

Then upsampled feature map and

or mixed feature cascade to obtain cascade feature f ^cat The formula used is:

in the formula (f) _i ×) represents the pair feature f _i Up-sampling, [ c ]]Representing channel cascade operation, j = -1,

to represent

j =0,1,2,

representing cascade characteristics f ^cat Hybrid features learned through the three convolutional layers;

step 3-2, the cascade characteristic f ^cat Through three layers of convolution layers, learning of feature fusion is carried out to obtain mixed features

The formula used is:

step 3-3, repeating step 3-1 and step 3-2 in a bottom-up mode, and enabling the features f in the new feature pyramid set to be new ₁ ,f ₂ ,f ₃ ,f ₄ I.e. f ₁ ,

f ₃ ,f ₄ Fusing layer by layer to obtainObtaining a set of mixed feature pyramids

Further, the saliency prediction network model in the step 4 includes three convolution layers, a batch regularization layer and an activation layer are added behind the first two convolution layers, and the last convolution layer outputs a saliency map which is a single channel and has the same resolution as the original input image.

Further, in the step 4, training a significance prediction network model by using the features in the mixed feature pyramid set includes:

step 4-1, sequentially carrying out significance prediction on the features in the mixed feature pyramid set by using a significance prediction network model;

4-2, performing loss calculation on all prediction results to obtain a gradient, and performing iterative update on the significance prediction network model parameters by using the gradient through a reverse transfer algorithm;

and (5) repeating the step 4-1 to the step 4-2 until the iteration times exceed a preset threshold value, and finishing the training of the significance prediction network model.

Compared with the prior art, the invention has the following remarkable advantages: 1) The attention model is adopted to select the features of the image, so that the features related to the image target are enhanced, and the features are more effective; 2) The method adopts a bottom-up feature fusion structure to effectively fuse the detail features of the bottom layer and the semantic features of the high layer, greatly improves the characterization capability of the features, and has higher detection accuracy than a common significance model network.

The present invention is described in further detail below with reference to the attached drawing figures.

Drawings

Fig. 1 is a flowchart of an image saliency detection method based on feature selection and feature fusion according to the present invention.

FIG. 2 is a diagram illustrating feature selection performed on a feature map by a spatial attention module according to the present invention.

FIG. 3 is a diagram illustrating feature selection performed on a feature map by a channel attention module according to the present invention.

FIG. 4 is a schematic diagram of bottom-up feature fusion for a feature pyramid in accordance with the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, in conjunction with fig. 1, the present invention provides an image saliency detection method based on feature selection and feature fusion, the method including the following steps:

and 4, training the significance prediction network model by using the features in the mixed feature pyramid set, and performing significance detection on the image to be detected by using the trained significance prediction network model.

Further, in one embodiment, the feature extraction is performed on the input image in step 1, specifically, the feature extraction is performed on the input image by using a convolutional neural network ResNext, and the specific process includes:

suppose that the five part of the convolution blocks included in the convolutional neural network ResNext are conv respectively ₁ 、conv ₂ 、conv ₃ 、conv ₄ 、conv ₅ (ii) a The higher feature layer has rich semantic information, and the lower feature layer has rich low-level information such as texture.

Step 1-1, inputting an image into five parts of convolution blocks in sequence, and performing forward iteration, wherein an iteration formula is as follows:

f _i+1 ＝conv _j (f _i ,W _j ),j∈[1,5],i∈[-1,3]

wherein, when i = -1, f _-1 F for the image to be detected, i is-1,0,1,2,3, respectively _i+1 Respectively representing the convolution blocks conv ₁ 、conv ₂ 、conv ₃ 、conv ₄ 、conv ₅ Output result of (1), W _j Conv for convolution block _j The parameters of (a);

Exemplary preferably, as a specific example, the above conv ₁ Is a layer of convolution layer with convolution kernel size of 7x7, conv ₂ 、conv ₃ 、conv ₄ 、conv ₅ The method comprises 3, 4, 6 and 3 blocks, wherein the blocks are structures commonly used in Resnet series, specifically, a network structure formed by serially stacking three layers of convolution layers, and the convolution kernel sizes of the three layers of convolution are 1x1,3x3 and 1x1 respectively.

Illustratively, as a specific example, assume an input image I _3×300×300 The picture size is RGB three channels, and the length and the width of the picture are both 300 pixels. The characteristic pyramid set obtained through the process of the step 1 is

Wherein the superscript indicates the serial number of the feature map and the subscript indicates the number of channels and the width and height shapes of the feature map.

Further, in one embodiment, in step 2, feature selection is performed on the feature pyramid set, specifically, a spatial attention and channel attention mechanism is adopted for feature selection, and the specific process includes:

Step 2-2, utilizing channel attention to the middle layer characteristic diagram f in the characteristic pyramid set ₂ Selecting characteristics to obtain a new middle layer characteristic diagram

Obtaining new feature pyramid set from above

Further, in one embodiment, in conjunction with FIG. 2, step 2-1 utilizes spatial attention to the underlying feature map f in the feature pyramid set ₀ Selecting the characteristics to obtain a new bottom characteristic diagram

The method specifically comprises the following steps:

defining an underlying feature graph f ₀ Is composed of

w, h and c respectively represent the width, height and channel number of the characteristic diagram; in view of the above-described examples,

constructing a spatial attention module comprising two sub-volume blocks, respectively denoted conv ₁₁ 、conv ₂₂ ；

C ₁ ＝conv ₁₁ (f ^l ,W ₁₁ )

C ₂ ＝conv ₂₂ (f ^l ,W ₂₂ )

In the formula, W ₁₁ 、W ₂₂ Are respectively conv ₁₁ 、conv ₂₂ Parameters of the sub-volume blocks;

as a specific example, for the above example, the following will be mentioned

Put into conv in sequence ₁₁ 、conv ₂₂ Sub-volume blocks, respectively outputting a feature map C ₁ 、C ₂ ，

Step 2-1-2, p conv ₁₁ 、conv ₂₂ Output result C of sub-volume block ₁ 、C ₂ Performing element-by-element addition, and mapping the result of the addition to [0,1] using a sigmoid function]Obtaining the weight SA of the spatial attention, wherein the specific formula is as follows:

SA＝σ(C ₁ +C ₂ )

in the formula, σ represents a sigmoid function;

as a specific example, for the above example,

step 2-1-3, utilizing the weight SA of the space attention to the bottom layer characteristic diagram f ₀ Carrying out feature selection to obtain a new underlying feature map

Or

The formula used is:

further, in one embodiment, the sub-volume block conv ₁₁ 、conv ₂₂ Each including two convolutional layers, one of which has 32 convolutional kernels and 3x3 convolutional kernels, and the other has 64 convolutional kernels and 3x3 convolutional kernels.

Further, in one embodiment, in conjunction with FIG. 3, step 2-2 utilizes channel attentionFor the middle layer characteristic diagram f in the characteristic pyramid set ₂ Selecting characteristics to obtain a new middle layer characteristic diagram

The method specifically comprises the following steps:

definition of middle layer feature map f ₂ Is composed of

Step 2-2-1, mixing f ^m Expand into one set:

f ^m ＝{f ₁ ^m ,f ₂ ^m ,......,f _C ^m }

wherein f is _i ^m Is f ^m The ith channel slice feature in (1),

i =1,2, …, C, C is characteristic diagram f ^m The number of channels of (a);

as a specific example, the above example is addressed

Spread out into a set f ^m ＝{f ₁ ^m ,f ₂ ^m ,......,f ₅₁₂ ^m }。

As a specific example, for the above example,

the vector is a 512x1 dimension channel.

Step 2-2-3, learning a channel level vector by utilizing two continuous full-connection layers and a nonlinear activation layer to obtain a channel level attention vector, mapping the channel level attention vector to [0,1] by utilizing a sigmoid function, and obtaining a weight CA of channel attention, wherein the formula is as follows:

CA＝F(v ^m ,W)＝σ(fc ₂ (δ(fc ₁ (v ^m ,W ₁ )),W ₂ ))

in the formula, W ₁ 、W ₂ Respectively full connection layer fc ₁ 、fc ₂ Delta is a nonlinear activation function, and sigma is a sigmoid function;

as a specific example, for the above example,

Or

The formula used is:

further, in one embodiment, with reference to fig. 4, in step 3, feature fusion is performed on features in the new feature pyramid set in a bottom-up manner, so as to obtain a fused feature pyramid set, which specifically includes:

step 3-1, removing new bottom layer characteristic diagram

Sampling some other characteristic graph as new bottom characteristic graph

Then the up-sampled feature map and

or mixed feature cascade to obtain cascade feature f ^cat The formula used is:

in the formula (f) _i ×) represents the pair feature f _i Up-sampling, [ c ]]Indicating channel cascade operation, j = -1,

to represent

j =0,1,2,

step 3-2, cascading characteristic f ^cat Through three layers of convolution layers, learning of feature fusion is carried out to obtain mixed features

The formula used is:

f ₃ ,f ₄ Fusing layer by layer to obtain a mixed characteristic pyramid set

Illustratively, in one embodiment, the convolution kernel size of the three convolutional layers in step 3-2 is 3x3,1x1 in this order.

Further, in one embodiment, the saliency-predicted network model in step 4 includes three convolutional layers, the first two convolutional layers are added with batch regularization layers and activation layers, and the last convolutional layer outputs a saliency map of a single channel and the same resolution as the original input image.

In one embodiment, the sizes of convolution kernels of three convolutional layers included in the significance prediction network model are sequentially 3x3,1x1.

Further, in one embodiment, in step 4, the feature in the mixed feature pyramid set is used to train the significance prediction network model, and the specific process includes:

The invention adopts the attention model to select the characteristics of the image, enhances the characteristics related to the image target, makes the characteristics more effective, adopts a bottom-up characteristic fusion structure to effectively fuse the detailed characteristics of the bottom layer and the semantic characteristics of the high layer, greatly improves the characteristic capability of the characteristics, and has higher detection accuracy than a common significance model network.

Claims

1. An image saliency detection method based on feature selection and feature fusion is characterized by comprising the following steps:

step 1, extracting features of an input image, and adding all the features into a feature pyramid set; the feature extraction is performed on the input image, specifically, the feature extraction is performed on the input image by using a convolutional neural network ResNext, and the specific process includes:

suppose that the five partial convolution blocks included in the convolutional neural network ResNext are conv ₁ 、conv ₂ 、conv ₃ 、conv ₄ 、conv ₅ ；

f _i+1 ＝conv _j (f _i ,W _j ),j∈[1,5],i∈[-1,3]

step 1-2, adding the feature graph output by each partial rolling block to an output set to form a feature pyramid set { f } ₀ ,f ₁ ,f ₂ ,f ₃ ,f ₄ }；

Step 2, selecting the characteristics of the characteristic pyramid set to obtain a new characteristic pyramid set; and selecting the features of the feature pyramid set, specifically selecting the features by adopting a space attention and channel attention mechanism, wherein the specific process comprises the following steps:

Step 2-2, utilizing channel attention to the middle layer feature pattern f in the feature pyramid set ₂ Selecting characteristics to obtain a new middle layer characteristic diagram

Obtaining new feature pyramid set from above

Step 3, performing feature fusion on the features in the new feature pyramid set in a bottom-up manner to obtain a mixed feature pyramid set; the feature fusion is performed on the features in the new feature pyramid set in a bottom-up manner to obtain a fused feature pyramid set, and the method specifically includes:

step 3-1, removing new bottom layer characteristic diagram

Sampling some other characteristic graph as new bottom characteristic graph

Then upsampled feature map and

or mixed feature cascade to obtain cascade feature f ^cat The formula used is:

represent

j =0,1,2,

representing cascade characteristics f ^cat The hybrid features learned through the three convolutional layers;

The formula used is:

step 3-3, repeating step 3-1 and step 3-2 in a bottom-up mode, and enabling the features f in the new feature pyramid set to be in the same shape ₁ ,f ₂ ,f ₃ ,f ₄ I.e. f ₁ ,

f ₃ ,f ₄ Fusing layer by layer to obtain a pyramid set with mixed features

Step 4, training a significance prediction network model by using the features in the mixed feature pyramid set, and performing significance detection on an image to be detected by using the trained significance prediction network model; the significance prediction network model comprises three convolutional layers, wherein a batch regularization layer and an activation layer are added behind the first two convolutional layers, and the last convolutional layer outputs a significance map which is a single channel and has the same resolution as the original input image.

2. The method for detecting image saliency based on feature selection and feature fusion as claimed in claim 1, wherein step 2-1 utilizes spatial attention to the underlying feature map f in the feature pyramid set ₀ Selecting the characteristics to obtain a new bottom characteristic diagram

The method specifically comprises the following steps:

defining an underlying feature graph f ₀ Is composed of

C ₁ ＝conv ₁₁ (f ^l ,W ₁₁ )

C ₂ ＝conv ₂₂ (f ^l ,W ₂₂ )

step 2-1-2, p conv ₁₁ 、conv ₂₂ Output result C of sub-volume block ₁ 、C ₂ Element-by-element addition is performed and the resulting value of the addition is mapped to [0,1] using a sigmoid function]The specific formula of the weight SA of the spatial attention is obtained as follows:

SA＝σ(C ₁ +C ₂ )

in the formula, σ represents a sigmoid function;

Or

The formula used is:

3. the method according to claim 2, wherein the sub-volume block conv is used for detecting image saliency based on feature selection and feature fusion ₁₁ 、conv ₂₂ Each including two convolutional layers, where the number of convolutional kernels in one layer is 32, the size of convolutional kernel is 3x3, the number of convolutional kernels in the other layer is 64, and the size of convolutional kernel is 3x3.

4. The method for detecting image saliency based on feature selection and feature fusion as claimed in claim 1, characterized in that, in step 2-2, the middle-level feature map f in the feature pyramid set is focused on by channel attention ₂ Selecting characteristics to obtain a new middle layer characteristic diagram

The method specifically comprises the following steps:

definition of middle layer feature map f ₂ Is composed of

Step 2-2-1, mixing f ^m Expand into one set:

f ^m ＝{f ₁ ^m ,f ₂ ^m ,......,f _C ^m }

wherein f is _i ^m Is f ^m The ith channel slice feature in (1),

c is a characteristic diagram f ^m The number of channels of (a);

CA＝F(v ^m ,W)＝σ(fc ₂ (δ(fc ₁ (v ^m ,W ₁ )),W ₂ ))

Or

The formula used is:

5. the image significance detection method based on feature selection and feature fusion as claimed in claim 1, wherein the convolution kernel size of the three layers of convolution layers in step 3-2 is 3x3,1x1 in sequence.

6. The method for detecting image saliency based on feature selection and feature fusion according to claim 1, wherein in step 4, the feature in the mixed feature pyramid set is used for training a saliency prediction network model, and the specific process includes: