CN117726809A

CN117726809A - Small sample semantic segmentation method based on information interaction enhancement

Info

Publication number: CN117726809A
Application number: CN202311575329.6A
Authority: CN
Inventors: 王诗言; 谢博; 张驰; 徐慧玲; 王译苹
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-03-19

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to a small sample semantic segmentation method based on information interaction enhancement; comprising the following steps: acquiring a support image, a corresponding true value mask and a query image with segmentation; processing the support image and the query image by adopting ResNet to obtain support middle and high-level characteristics and query middle and high-level characteristics; adopting a query prior generation module, a guide attention module and a spatial information interaction attention module to process truth masks, support middle-level features, support high-level features, query middle-level features and query high-level features of the support images so as to obtain four output features; inputting four output features of the three modules into a multi-scale fusion network for processing to obtain refined query features; inputting the refined query features into a decoder for processing to obtain a prediction segmentation result of the query image; the method can make the model have more robustness and better segment the target object.

Description

Small sample semantic segmentation method based on information interaction enhancement

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a small sample semantic segmentation method based on information interaction enhancement.

Background

Thanks to the rapid development of a series of supervised convolutional neural network architectures, deep learning has made a significant advance in semantic segmentation convenience. However, most models learn using a fully supervised learning scheme, and the performance of the model is not only dependent on the quality and quantity of the marker data, but also has poor generalization ability in segmenting unknown classes. Particularly for intensive prediction tasks, the collection and labeling of data requires a significant amount of time and cost. Under the background, the small sample learning technology is introduced into semantic segmentation by expert students, and the small sample semantic segmentation method is generated, so that the problems are effectively relieved.

The small sample semantic segmentation is to segment a target object in a query image by using a plurality of support images containing the same category and masks thereof. Therefore, how to effectively extract and utilize the information provided by the support image to guide the segmentation of the query image is a key factor affecting the semantic segmentation performance of the small sample. The most popular is a prototype-based structure, a representative prototype is extracted by carrying out mask averaging pooling on a target area in a support image, and the obtained prototype is utilized to carry out feature matching with a query image to guide the segmentation of a target object. This method has also proven effective, as PFENet, PANet, ASGNet and the like. However, the method has the defect that the generated support prototype is single and lacks of spatial information, and when the appearance or shape of the target object in the support and query image is greatly different, the problem of semantic ambiguity can be caused by directly using the prototype, so that the target object is not accurately segmented. Although some methods generate prototypes of multiple classes, such as PMMs and PAMs, they are not optimal in supporting pixel-level feature matching of prototypes and query features, and do not take into account the full use of semantic information contained in the supporting image and the query image to interact to make up for intra-class differences and increase generalization capability of the model.

Therefore, it is necessary to provide a small sample semantic segmentation method which makes full use of supporting image and query image semantic information as much as possible, enhances information interaction between the two, and enhances model robustness.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a small sample semantic segmentation method based on information interaction enhancement, which comprises the following steps:

s1: acquiring a support image, a corresponding true value mask and a query image with segmentation;

s2: processing the support image and the query image by adopting ResNet as a backbone feature extraction network to obtain support medium-level features, support high-level features, query medium-level features and query high-level features;

s3: adopting a query priori generating module to process the true value mask, the support middle-level features, the support high-level features and the query high-level features of the support image to obtain a query priori mask and a support prototype feature vector;

s4: processing the true value mask of the support image and the support intermediate level feature by adopting a guide attention module to obtain an attention feature map;

s5: processing the truth mask, the support middle-level feature and the query middle-level feature of the support image by adopting a spatial information interaction attention module to obtain a final query feature;

s6: inputting the query prior mask, the support prototype feature vector, the attention feature map and the final query feature into a multi-scale fusion network for processing to obtain refined query features;

s7: and inputting the refined query features into a decoder for processing to obtain a prediction segmentation result of the query image.

Preferably, step S3 specifically includes:

carrying out Hadamard product on the truth mask supporting the advanced features and the supporting images, and then carrying out pixel-level cosine similarity calculation on the truth mask supporting the advanced features and the supporting images to obtain a query priori mask;

and carrying out mask average pooling on the truth masks supporting the intermediate-level features and the supporting images to obtain the supporting prototype feature vector.

Further, the formula for obtaining the support prototype feature vector is:

wherein V is ^S Representing support prototype feature vectors, ζ represents adjustment M _S Size and shapeFunction of size matching, M _S True value mask representing support image correspondence ++>Indicating support for medium-level features, as well as Hadamard product, MAP (. Cndot.) indicating mask-averaged pooling.

Preferably, step S4 specifically includes:

mask average pooling is carried out on the true value masks supporting the intermediate-level features and the supporting images, so that supporting prototype feature vectors are obtained;

performing convolution operation and feature activation on the support prototype feature vector to obtain an attention vector;

and carrying out Hadamard product on the support intermediate-level features and the attention vector to obtain an attention feature map.

Preferably, step S5 specifically includes:

carrying out Hadamard product on the truth mask of the support image and the support intermediate-level feature to obtain the support feature;

performing linear mapping on the support features to obtain support linear features; performing linear mapping on the query mid-level features to obtain a first query linear feature and a second query linear feature;

performing matrix multiplication calculation on the supported linear features and the first query linear features to obtain an affinity matrix;

performing matrix multiplication calculation on the affinity matrix and the second query linear feature, and performing residual connection with the query mid-level feature to obtain a first query feature;

respectively carrying out average pooling and maximum pooling operation on the first query feature in the channel dimension, and splicing the two pooling features to obtain a spliced feature;

performing convolution operation and feature activation on the spliced features to obtain a space attention diagram;

carrying out Hadamard product on the space attention diagram and the query mid-level feature to obtain a second query feature;

fusing the first query feature and the second query feature to obtain a final query feature;

further, the formula for fusing the first query feature and the second query feature is:

wherein,representing final query features, ++>Representing a first query feature->Representing a second query feature->Representing a splicing operation, f ^1×1 (. Cndot.) represents a convolution operation with a convolution kernel size of 1×1.

Preferably, step S6 specifically includes:

downsampling the final query feature and the attention feature map to obtain 4 new query features with different scales and 4 new attention feature maps with different scales;

expanding the support prototype feature vectors to be the same as the shape of the 4 new query features with different scales to obtain 4 new support prototype feature vectors with different scales;

the query prior mask is adjusted to be the same as the space size of the 4 new attention feature images with different scales, so that 4 new query prior masks with different scales are obtained;

splicing new query features, new attention feature graphs, new support prototype feature vectors and new query priori masks in the channel dimension, and convolving the spliced results to obtain 4 rough query features with different scales;

performing feature interaction processing on the 4 rough query features with different scales to obtain 4 initial refined query features;

and 4 kinds of initial refined query features are spliced, feature fusion and dimension reduction are carried out on the spliced results through convolution, and the final refined query features are obtained.

Preferably, the decoder comprises a 3 x 3 convolution with residual connection and a classifier.

The beneficial effects of the invention are as follows: according to the small sample semantic segmentation method based on information interaction enhancement, provided by the invention, the target object can be segmented under a small amount of marked data, so that the time and cost for data acquisition and marking are reduced; through the efficient spatial information interaction attention network, semantic information interaction of the support image and the query image is enhanced, the features of the query image are further enriched, and meanwhile, target class pixels to be segmented are more remarkable, so that negative effects brought in classes are reduced; the mask of the supporting image is used for guiding the supporting characteristics, so that the network is more biased to learn class-related information, the problem that new class information is restrained by base class information in the segmentation process can be effectively solved, and the generalization of the model is improved; the features of the obtained query image are richer and finer through interaction of the feature information with different scales, and the accuracy of the model is effectively improved.

Drawings

FIG. 1 is a schematic diagram of an overall network architecture for small sample semantic segmentation in the present invention;

FIG. 2 is a schematic diagram of a network structure of a directing attention module according to the present invention;

FIG. 3 is a schematic diagram of a network structure of a spatial information interaction attention module according to the present invention;

FIG. 4 is a schematic diagram of a multi-scale feature fusion network according to the present invention;

fig. 5 is a schematic diagram of a network module structure for feature interaction at two scales in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a small sample semantic segmentation method based on information interaction enhancement, which comprises the following steps:

s1: acquisition of support image I ^S Corresponding truth mask M ^S Query image I with segmentation ^q 。

S2: and processing the support image and the query image by adopting the ResNet as a backbone feature extraction network to obtain the support mid-level features, the support high-level features, the query mid-level features and the query high-level features.

Preferably, a ResNet-50 pre-training model is used as a backbone feature extraction network, an overall network model structure diagram is shown in figure 1, features output by block_2 and block_3 are cascaded in channel dimension by the model, and then feature fusion and dimension reduction are carried out through 1X 1 convolution, so that intermediate-level features of a support image and a query image are obtained; taking the features output by block_4 as high-level features of the support image and the query image, the process is expressed as:

wherein I is ^S 、I ^q Respectively supporting images and inquiring images, B _i Features representing the i-th block output; f (f) ^1×1 (. Cndot.) represents a convolution operation with a convolution kernel size of 1×1;respectively representing the support of the intermediate level features, the support of the advanced features, the inquiry of the intermediate level features and the inquiry of the advanced features; />Representing a stitching operation.

S3: and processing the truth value mask, the support middle-level features, the support high-level features and the query high-level features of the support image by adopting a query prior generation module to obtain a query prior mask and a support prototype feature vector.

In order to better activate target class pixels in the query image and obtain richer query image features, the invention constructs a query prior generation module.

The process of the query prior generation module for processing the truth mask, the support middle-level features, the support high-level features and the query high-level features of the support image comprises the following steps:

performing Hadamard product on the truth mask of the support advanced features and the support image, and then performing pixel-level cosine similarity calculation on the support advanced features and the query advanced features to obtain a query priori maskBecause the support image features retain only features that are relevant to the target class of pixels after the Hadamard product of the support image features with the prior mask, the higher the value of the similarity calculation with which the query feature is, the greater the probability that the pixel is the target pixel.

In order to better activate target class pixels in the query image, a support prototype feature vector V is extracted through mask averaging pooling ^S The method comprises the steps of carrying out a first treatment on the surface of the Specific: mask-averaged pooling of the truth mask supporting the mid-level features with the supporting image:

wherein V is ^S Representing support prototype feature vectors, ζ represents adjustment M _S Size and shapeFunction of size matching, M _S True value mask representing support image correspondence ++>Indicating support for medium-level features, as well as Hadamard product, MAP (. Cndot.) indicating mask average pooling operation, and C indicating the number of feature channels.

S4: and processing the true value mask of the supporting image and the supporting intermediate-level features by adopting the directing attention module to obtain an attention feature map.

The invention builds a directing attention module, and aims to better learn and understand information related to categories by using a supporting image and a corresponding mask directing model, so as to solve the problem that new category information is inhibited by base category information because of less new category data in the segmentation process. As shown in fig. 2, the process of directing attention module to process the truth mask of the support image and the support intermediate features includes:

the directing attention module has two inputsM _S Binary truth masks corresponding to the supported intermediate features and the supported images respectively; mask average pooling is carried out on the truth masks of the support intermediate-level features and the support images, and support prototype feature vectors only related to target categories are extracted:

performing a series of convolution operations and feature activation on the support prototype feature vector to obtain an attention vector V ^A ：

Wherein σ is denoted as a sigmoid activation function; conv (-) represents a series of convolution operations.

Suppressing irrelevant category information features on support features using attention vectors to obtain an attention feature map G with relevant category information representation ^S The method comprises the steps of carrying out a first treatment on the surface of the Specific: the hadamard product is performed on the support mid-level features and the attention vector:

where H represents the height of the feature map and W represents the width of the feature map.

S5: and processing the truth mask, the support middle-level feature and the query middle-level feature of the support image by adopting the spatial information interaction attention module to obtain the final query feature.

The invention constructs a spatial information interaction attention module, aims to highlight the characteristics of a target area in query characteristics, fully utilizes the space and detail information provided by a support image, and performs pixel-level information interaction of the support image and the query image, thereby obtaining more pixel similarity discrimination information between the support image and the query image to act on the query image characteristics. As shown in fig. 3, the process of the spatial information interaction attention module for processing the truth mask of the supporting image, the supporting mid-level features and the querying mid-level features includes:

the space information interaction attention module has three inputs, F ^S 、M _S 、F ^q Wherein The module is divided into two parts.

A first part: will M _S Applied to the support image feature to obtain the support feature F only retaining the information of the target object ₁ ^S Features that are beneficial to follow-up more accurate attention to the target object; specific: carrying out Hadamard product on the truth mask supporting the image and the middle-level feature supporting:

the support characteristic is subjected to linear mapping to obtain the support linear characteristicLinear mapping is carried out on the mid-level features of the query to obtain a first query linear feature phi (F ^q ) And a second query linear feature g (F ^q )；/>φ(F ^q )、

Matrix multiplication is carried out on the support linear features and the first query linear features to obtain an affinity matrix of the support features and the query features

Wherein,representing a matrix multiplication.

Due to F ₁ ^S Only the foreground features of the target class are retained, so that the weight of pixels belonging to the features of the target class in A is greater in the process of calculating the similarity between the supporting features and the pixels of the query features.

Applying A to the query feature, adding a residual connection to obtain the query featureF ₁ ^q The method comprises the steps of containing richer characteristic information, and the characteristics of the target class pixels are more obvious; specific: performing matrix multiplication calculation on the affinity matrix and the second query linear feature, and performing residual connection with the query mid-level feature to obtain a first query feature F ₁ ^q ：

A second part: for better capturing the spatial information of the query features, the first query feature F is compared with the last branch ₁ ^q Average pooling and maximum pooling are performed in the channel dimension to obtain pooling characteristics:

wherein AvgPool (. Cndot.) and MaxPool (. Cndot.) represent average pooling and maximum pooling operations, respectively,

to be obtainedSplicing in the channel dimension, carrying out feature fusion on the spliced features through 7X 7 convolution, and finally carrying out feature activation to obtain a spatial attention force diagram +.>

Wherein f ^7×7 (. Cndot.) represents a convolution operation with a convolution kernel size of 7 x 7.

Applying SA to query feature results in new query feature (second query feature)) The method comprises the steps of carrying out a first treatment on the surface of the Specific: performing Hadamard product on the spatial attention map and the mid-level features of the query:

fusing the second query feature and the first query feature obtained by the first part to obtain a final query feature

The characteristics of the genus and the target class are more obvious and are richerIs provided.

S6: and inputting the query prior mask, the support prototype feature vector, the attention feature map and the final query feature into a multi-scale fusion network for processing to obtain the refined query feature.

The invention constructs a multi-scale fusion network, which aims at fusionV ^S 、G ^S 、/>The method is used for better activating target class pixels in the query image, enriching query features and realizing more accurate segmentation on the target object, namely solving the problem of inaccurate segmentation effect caused by overlarge scale difference between the support image and the target object in the query image by effectively fusing multi-scale context information of the target object.

The overall structure of the converged network is shown in fig. 4, with 4 inputs The method comprises the following steps of: the final query feature, support prototype feature vector, attention feature map with class information representation, and query prior mask, and the output of the network is a refined query feature with rich information. The process of the multi-scale fusion network for processing the query prior mask, the support prototype feature vector, the attention feature map and the final query feature comprises the following steps:

for final query featuresAnd a attention profile G ^S Downsampling to obtain 4 different scalesNew query feature F of degree ⁱ ＝[F ¹ ,F ² ,F ³ ,F ⁴ ]And 4 different scales of new attention profile G ⁱ ＝[G ¹ ,G ² ,G ³ ,G ⁴ ]The method comprises the steps of carrying out a first treatment on the surface of the The 4 scales are [60×60,30× 30,15 ×15,8×8 ]]。

Will support prototype feature vector V ^S Expanding to be the same as the shape of 4 new query features with different scales to obtain 4 new support prototype feature vectors V with different scales ⁱ ＝[V ¹ ,V ² ,V ³ ,V ⁴ ]。

Will query the a priori maskThe size of the space of the new attention feature images is adjusted to be the same as that of the 4 new attention feature images with different scales, and 4 new query priori masks M with different scales are obtained ⁱ ＝[M ¹ ,M ² ,M ³ ,M ⁴ ]。

Splicing new query features F in channel dimension ⁱ New attention profile G ⁱ New support prototype feature vector V ⁱ And a new query a priori mask M ⁱ . Simple feature fusion is carried out on the spliced result through 1X 1 convolution, the channel number is reduced, and 4 rough query features F with different scales are obtained _q ⁱ ＝[F _q ¹ ,F _q ² ,F _q ³ ,F _q ⁴ ]，

Performing interactive processing on the 4 rough query features with different scales to obtain 4 initial refined query features; specific: in the process of fusing the multi-scale contexts, four branches are shared in the horizontal direction, and the fine information is transferred to the rough features with small resolution through the fine features in a top-down fusion mode. In the figureThe function of (a) is to realize the characteristic interaction of two different scales by other upper layersBranch as input assist feature F ^aux Enriching the characteristic F of the current branch ⁱ Obtaining refined query features F _R ：

F _R ＝R(F ^aux ,F ⁱ )

As shown in fig. 5, the assist feature with more detail is first adjusted to conform to the current feature shape, then spliced in the channel dimension, and convolved from the assist feature F using 1 x 1 convolution ^aux Extracting beneficial information from the obtained product, and then matching the extracted beneficial information with the current characteristic F ⁱ And carrying out residual connection. Secondly, in order to reduce the loss of the channel information of the query image, the characteristics after residual connection are better extracted through a channel attention mechanism. Finally, the final refined query feature F is obtained by two times of 3X 3 convolutions and residual error connection _R 。

Features F of these 4 scales from top to bottom _q ⁱ ＝[F _q ¹ ,F _q ² ,F _q ³ ,F _q ⁴ ]The feature interaction operation is performed for 7 times, the first branch from top to bottom has no auxiliary feature for interaction, and the rest all followIs defined in (2). Obtaining 4 initial refined query features F containing multi-scale features and contextual information _R ⁱ ＝[F _R ¹ ,F _R ² ,F _R ³ ,F _R ⁴ ]。

Will F _R ¹ 、F _R ² 、F _R ³ 、F _R ⁴ The space size of (2) is adjusted to H multiplied by W, and 4 initial refinement query features are spliced; feature fusion and dimension reduction are carried out on the spliced result through 1X 1 convolution, and final refined query features are obtained

The obtained refined query feature F _final The final prediction segmentation result is obtained by a decoder comprising a 3 x 3 convolution with residual connection and a classifier.

Through the operation, the invention realizes semantic segmentation of the query image, fully utilizes the semantic information of the support image and the query image, enhances the information interaction of the support image and the query image, and enhances the robustness of the model.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, etc.

While the foregoing is directed to embodiments, aspects and advantages of the present invention, other and further details of the invention may be had by the foregoing description, it will be understood that the foregoing embodiments are merely exemplary of the invention, and that any changes, substitutions, alterations, etc. which may be made herein without departing from the spirit and principles of the invention.

Claims

1. The small sample semantic segmentation method based on information interaction enhancement is characterized by comprising the following steps of:

2. The small sample semantic segmentation method based on information interaction enhancement according to claim 1, wherein step S3 specifically comprises:

3. The small sample semantic segmentation method based on information interaction enhancement according to claim 2, wherein the formula for obtaining the supporting prototype feature vector is:

4. The small sample semantic segmentation method based on information interaction enhancement according to claim 1, wherein step S4 specifically comprises:

5. The small sample semantic segmentation method based on information interaction enhancement according to claim 1, wherein step S5 specifically comprises:

and fusing the first query feature and the second query feature to obtain a final query feature.

6. The small sample semantic segmentation method based on information interaction enhancement according to claim 5, wherein a formula for fusing the first query feature and the second query feature is:

7. The small sample semantic segmentation method based on information interaction enhancement according to claim 1, wherein step S6 specifically comprises:

8. The small sample semantic segmentation method based on information interaction enhancement according to claim 1, wherein the decoder comprises a 3 x 3 convolution with residual connection and a classifier.