CN111860517A

CN111860517A - Semantic segmentation method under small sample based on decentralized attention network

Info

Publication number: CN111860517A
Application number: CN202010601796.1A
Authority: CN
Inventors: 张磊; 李欣; 甄先通; 常峰贵; 简治平; 左利云; 胥亮; 李镇昌
Original assignee: Shandong Gaitech Robotics Technology Co ltd; Guangdong University of Petrochemical Technology
Current assignee: Shandong Gaitech Robotics Technology Co ltd; Guangdong University of Petrochemical Technology
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-30
Anticipated expiration: 2040-06-28
Also published as: CN111860517B

Abstract

The invention discloses a small sample lower semantic segmentation method based on a decentralized attention network, which belongs to the technical field of semantic segmentation, and provides a decentralized attention network mechanism which is used for a small sample semantic segmentation task, can activate more pixel points belonging to an object foreground, and can establish a more stable association relation between a support image and an image to be segmented, so that the support image and the image to be segmented have better generalization performance when the consistency of the shape and the like of the support image and the image to be segmented is poor, simultaneously, multi-scale attention information fusion is used for the semantic segmentation task, semantic information obtained from a deep network multilayer is subjected to decentralized attention network mechanism, and by utilizing upper sampling and residual error network fusion, semantic segmentation is carried out on the fused foreground, thereby increasing the robustness when the object scale changes, the system has better performance.

Description

Semantic segmentation method under small sample based on decentralized attention network

Technical Field

The invention relates to the technical field of semantic segmentation, in particular to a semantic segmentation method under a small sample based on a decentralized attention network.

Background

Deep learning has been widely used in semantic segmentation of computer vision, but in practical semantic segmentation applications, the performance of a deep model obtained by learning is affected due to less labeled support data. At present, prototype-based methods are popular, where a prototype refers to a representation of a class of objects, and in the deep learning framework, the prototype is an output generated by a deep neural network based on the labeled information of the support image and its corresponding object. In other words, a prototype is an associative mapping between the input support image and the object class. For semantic segmentation under small samples, it is basically based on a prototype method, where the representation of the prototype has multiple forms. The characteristic of the support image is pooled to be used as a prototype, and the prototype and the image characteristic to be segmented are used together to generate segmentation mapping. A prototype representation is extracted from the support image using masked mean pooling, and the segmentation map is predicted by calculating the cosine distance between the prototype representation and the image to be segmented. However, these methods are all prototypes that are fixed and thus lack generalization. Besides, through a drawing attention mechanism, a pixel-to-pixel connection relation between a support image and an image to be segmented is established, but in the method, due to deviation in pixel competition, only a few parts of foreground objects in the support image are used for establishing the mapping relation, and therefore the information transfer from the support image to the image to be segmented is limited to a great extent.

Semantic segmentation in computer vision refers to the separation of objects of interest in an image from the background. The current general method is to extract a global description from the support image (labeled image) as a prototype to help the image to be segmented to complete the semantic segmentation task. However, this method is difficult to achieve good results in small samples, where a simple global vector representation of the prototype may have bias and lack of generalization capability. In the other method, a connection relation between a support image and an image to be segmented is established through an attention machine mechanism, but in the attention machine mechanism, due to deviation in pixel point selection, only a small part of foreground objects in the support image are easy to use to establish the mapping relation, so that the information is influenced to be transmitted from the support image to the image to be segmented.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problems in the prior art, the invention aims to provide a semantic segmentation method under a small sample based on a distraction network, provides a distraction network mechanism, is used for the semantic segmentation task of the small sample, can activate more pixel points belonging to the object foreground, and can establish more stable incidence relation between the support image and the image to be segmented, so that the support image and the image to be segmented have better generalization performance when the consistency of the shapes and the like of the support image and the image to be segmented is poor, meanwhile, the multi-scale attention information fusion is used for the semantic segmentation task, the semantic information obtained from the deep network multi-layer is subjected to the decentralized attention network mechanism, and by utilizing the upsampling and residual error network fusion, and semantic segmentation is carried out on the fused result, so that the robustness of the foreground object in scale change is improved, and the system has better performance.

2. Technical scheme

In order to solve the above problems, the present invention adopts the following technical solutions.

A semantic segmentation method under a small sample based on a distraction network comprises the following steps: the method comprises the following steps of training a data set, a deep neural network, a frame parameter phi, a support image and a target mark mask image of the support image, wherein the training data set comprises the mask image with each image having a segmentation mark, the deep neural network adopts a resnet101 network structure and parameters obtained by training on ImageNet, the frame parameter phi is a convolutional layer parameter for obtaining k and v and a convolutional layer parameter in a decoder, and the semantic segmentation learning process comprises the following steps:

s1, randomly extracting an image and a label thereof as an image to be segmented for all tasks of the training data set, and taking the rest images as a support image set;

s2, initializing the convolution layer parameters of k and v and the convolution layer parameters in the decoder randomly, namely phi;

s3, utilizing resnet101 network to generate three-layer characteristic representation for image to be segmented and support image

The 1 st and 2 nd block outputs of the resnet101 network are 1 th layer, the third block output is 2 th layer, and the 4 th block output is 3 th layer;

s4, for each l, from 1 to 3, the following operations are carried out:

S4.1, pair

And

two convolutional layers (phi parameters) are used to generate corresponding key word and value pairs, respectively { k }^q _l，v^q _lAnd k^s _l，v^s _l}；

S4.2, calculating an A matrix, wherein the elements are as follows:

s4.3, averaging the matrix A according to rows to generate A^s；

S4.4, to A^sArranging the middle pixels in descending order to obtain the position e corresponding to the pixel j_j；

S4.5, adding A^sThe weight of middle pixel j is adjusted according to the following formula:

wherein H and W are the height and width of the image;

s4.6, reconstructing the A matrix as

Wherein each element is as follows:

s4.7, mixing

Normalized by softmax layer as follows:

s4.8, generating a distraction map for the ith position according to the following mode:

wherein | | | represents a concatenation operation;

s4.9, repeat S2.1 to l ═ 3;

s5, mixing

Is subjected to bilinear upsampling and is summed by a residual module

The attention features of the system are connected in series, the result after the series connection is subjected to bilinear upsampling, and the bilinear upsampling is summed by a residual error module

The attention characteristics of (1) are connected in series, and a dense expression is obtained through a convolutional layer (phi parameter);

s6, performing dense expression through a softmax layer, and obtaining a final segmentation result for each pixel point, wherein the foreground is the background;

s7, comparing the image with a real segmentation mask image, calculating a cross entropy, and solving the gradient of the cross entropy to phi;

S8, updating phi;

s9, loop back to S1, until convergence.

3. Further, the support image is in a one-shot condition, the support image is 1 image, and the semantic segmentation process includes the following steps:

s1, utilizing resnet101 network to generate three-layer characteristic representation for image to be segmented and support image

s2, for each l, from 1 to 3, the following operations are carried out:

s2.1, pair

And

generating corresponding key word and value pair by using two convolution layers, wherein the key word and the value pair are respectively { k^q _l，v^q _lAnd k^s _l，v^s _l}；

S2.2, calculating an A matrix, wherein the elements

S2.3, averaging the matrix A according to rows to generate A^s；；

S2.4, to A^sArranging the middle pixels in descending order to obtain the position e corresponding to the pixel j_j；

S2.5, adding A^sThe weight of middle pixel j is adjusted according to the following formula:

wherein H and W are the height and width of the image;

s2.6, reconstructing an A matrix, wherein each element is as follows:

s2.7, normalizing by a softmax layer as follows:

s2.8, generating a distraction map for the ith position according to the following mode:

wherein | | | represents a concatenation operation;

s2.9, repeat S2.1 to l ═ 3;

s3, mixing

Is subjected to bilinear upsampling and is summed by a residual module

The attention characteristics are connected in series, the result after the series connection is subjected to bilinear upsampling and is subjected to a residual error moduleWhich is and

the attention characteristics of (1) are connected in series, and a dense expression is obtained after passing through a convolution layer;

and S4, the dense expression passes through the softmax layer, and the final segmentation result, namely the foreground or the background, is obtained for each pixel point.

Further, f is^q _lComprising two branch structures, said f^q _lThe next branch is the image to be segmented, which is output corresponding to different convolution layers through a multilayer convolution neural network to cover the information characteristic representation of different semantic levels, wherein f^q _lThe upper branch is a support image and a corresponding mask image (namely label), and the feature representation f covering different semantic level information is generated through different layers of the convolutional neural network (the convolutional neural network of the next branch has the same parameters)^s _l。

Further, f is^s _lAnd performing dot multiplication on the labeling information to obtain information and the representation of the image to be segmented of the corresponding layer, wherein the obtained information and the corresponding layer comprise a DGA (differential global feature analysis) dispersed graph attention mechanism and an RFU (cross-domain fuzzy iron) enhanced fusion unit, and the attention mechanism generated in the DGA dispersed graph attention mechanism link is fused with the RFU enhanced fusion unit link through the attention mechanism corresponding to different layers to generate the final image to be segmented of the label.

Further, the DGA decentralized graphics attention mechanism inputs a representation f containing different levels of semantic information obtained from a support image and an image to be segmented^sAnd f^qThe output is a distraction representation f^aSaid f^sAnd f^qRespectively connecting with channels and collaterals two convolution layers, wherein the two convolution layers map the two convolution layers into a space represented by a key word k and a numerical value v, the k is used for measuring the distance between an image to be segmented and a support image, the v is used for storing detail information extracted from feature mapping, and the k is used for extracting the detail information from feature mapping^sAnd k^qA is obtained by inner product calculation, wherein the calculation is as follows:

further, A is_i,jExpressing the relevance between the pixel i in the image to be segmented and the pixel j in the support image, and averaging the A matrix according to rows to obtain the A matrix_s。

Furthermore, the RFU enhancement fusion unit includes an upsampling mechanism, the upsampling mechanism employs bilinear upsampling, the upsampling mechanism connects attention characteristics of the two layers in series through a residual module, the residual module generates a dense representation through a convolution layer, the upsampling mechanism outputs the final RFU enhancement fusion unit to a convolution layer and a softmax layer, and whether each pixel is a foreground or a background is separately determined.

3. Advantageous effects

Compared with the prior art, the invention has the advantages that:

(1) the invention provides a decentralized attention network mechanism, which is used for a small sample semantic segmentation task, can activate more pixel points belonging to an object foreground, can establish a more stable incidence relation between a support image and an image to be segmented, has better generalization performance when the consistency of the shapes and the like of the support image and the image to be segmented is poor, simultaneously fuses multi-scale attention information for the semantic segmentation task, performs semantic segmentation on a fused result by utilizing upsampling and residual network fusion through the decentralized attention network mechanism and semantic information obtained from multiple layers of a depth network, increases the robustness when the foreground object scale changes, and enables a system to have better performance.

Drawings

FIG. 1 is a schematic view of a one-shot situation according to the present invention;

FIG. 2 is a graphical illustration of the DGA dispersion pattern attention machine of the present invention;

FIG. 3 is an exemplary graph of some of the experimental results of the present method of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention.

In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "top/bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless otherwise specifically stated or limited, the terms "mounted," "disposed," "sleeved/connected," "connected," and the like are used in a broad sense, and for example, "connected" may be a fixed connection, a detachable connection, an integral connection, a mechanical connection, an electrical connection, a direct connection, an indirect connection through an intermediate medium, and a communication between two elements.

Example 1:

referring to fig. 1-3, a method for semantic segmentation under a small sample based on a distraction network includes: the method comprises the following steps of training a data set, a deep neural network, a frame parameter phi, a support image and a target mark mask image of the support image, wherein the training data set comprises the mask image of which each image is provided with a segmentation mark, the deep neural network adopts a resnet101 network structure and parameters obtained by training on ImageNet, the frame parameter phi is a convolutional layer parameter for obtaining k and v and a convolutional layer parameter in a decoder, and the semantic segmentation learning process comprises the following steps:

s4, for each l, from 1 to 3, the following operations are carried out:

s4.1, pair

And

two convolutional layers (phi parameters) are used to generate corresponding key word and value pairs, respectively { k } ^q _l，v^q _lAnd k^s _l，v^s _l}；

S4.2, calculating an A matrix, wherein the elements are as follows:

s4.3, averaging the matrix A according to rows to generate A^s；

wherein H and W are the height and width of the image;

s4.6, reconstructing the A matrix as

Wherein each element is as follows:

s4.7, mixing

Normalized by softmax layer as follows:

where | represents a concatenation operation,

s4.9, repeat S2.1 to l ═ 3;

s5, mixing

Is subjected to bilinear upsampling and is summed by a residual module

s8, updating phi;

s9, loop back to S1, until convergence.

Referring to fig. 1-3, in a one-shot situation, the support image is 1 image, and its semantics

The segmentation process comprises the following steps:

s2, for each l, from 1 to 3, carrying out the following operations;

s2.1, pair

And

S2.2, calculating an A matrix, wherein the elements

S2.3, averaging the matrix A according to rows to generate A^s；；

S2.5, adding A^sThe weight of the middle pixel j is adjusted according to the following formula;

wherein H and W are the height and width of the image;

s2.6, reconstructing an A matrix, wherein each element is as follows:

s2.7, normalizing by a softmax layer as follows:

wherein | | | represents a concatenation operation;

s2.9, repeat S2.1 to l ═ 3;

s3, mixing

Is subjected to bilinear upsampling and is summed by a residual module

Please refer to FIGS. 1-3, f^q _lComprising two branched structures, f^q _lThe next branch is the image to be segmented, which is passed through a multi-layer convolutional neural network,corresponding outputs of different convolutional layers covering information characteristic representation of different semantic levels, f^q _lThe upper branch is a support image and a corresponding mask image (namely label), and the feature representation f covering different semantic level information is generated through different layers of the convolutional neural network (the convolutional neural network of the next branch has the same parameters)^s _lFurther, f^s _lAnd performing dot multiplication on the labeling information to obtain information and representation of the image to be segmented of the corresponding layer, wherein the obtained information and the corresponding layer comprise a DGA (differential global feature analysis) dispersed graph attention mechanism and an RFU (cross-domain fuzzy iron) enhanced fusion unit, and the attention mechanism generated in the DGA dispersed graph attention mechanism link is fused with the RFU enhanced fusion unit link through the attention mechanism corresponding to different layers to generate the final labeled image to be segmented.

Referring to FIGS. 1-3, the input to the DGA scatter plot attention mechanism is a representation f containing different levels of semantic information derived from the support image and the image to be segmented ^sAnd f^qThe output is a distraction representation f^a，f^sAnd f^qTwo convolution layers of channels and collaterals are respectively mapped into a space with key words k representing and numerical values representing v, k is used for measuring the distance between an image to be segmented and a support image, v is used for storing detail information extracted from feature mapping, and k^sAnd k^qA is obtained by inner product calculation, wherein the calculation is as follows:

A_i,jexpressing the relevance between the pixel i in the image to be segmented and the pixel j in the support image, and averaging the A matrix according to rows to obtain the A matrix_sThe RFU enhancement fusion unit comprises an upsampling mechanism, the upsampling mechanism adopts bilinear upsampling, the upsampling mechanism connects attention characteristics of the two layers in series through a residual module, one residual module generates dense representation through one convolution layer, the upsampling mechanism outputs the final RFU enhancement fusion unit to one convolution layer and one softmax layer, and whether each pixel is a foreground or a background is judged independently.

The invention provides a decentralized attention network mechanism, which is used for a small sample semantic segmentation task, can activate more pixel points belonging to an object foreground, can establish a more stable incidence relation between a support image and an image to be segmented, has better generalization performance when the consistency of the shapes and the like of the support image and the image to be segmented is poor, simultaneously fuses multi-scale attention information for the semantic segmentation task, performs semantic segmentation on a fused result by utilizing upsampling and residual network fusion through the decentralized attention network mechanism and semantic information obtained from multiple layers of a depth network, increases the robustness when the foreground object scale changes, and enables a system to have better performance.

The foregoing is only a preferred embodiment of the present invention; the scope of the invention is not limited thereto. Any person skilled in the art should be able to cover the technical scope of the present invention by equivalent or modified solutions and modifications within the technical scope of the present invention.

Claims

1. A semantic segmentation method under a small sample based on a distraction network comprises the following steps: the image recognition method comprises the following steps of training a data set, a deep neural network, a frame parameter phi, a support image and a target labeling mask image supporting the image, wherein the training data set comprises the mask image with each image being segmented and labeled, the deep neural network adopts a resnet101 network structure and parameters obtained by training on ImageNet, and the frame parameter phi is a convolutional layer parameter for obtaining k and v and a convolutional layer parameter in a decoder, and is characterized in that: the semantic segmentation learning process comprises the following steps:

s4, for each l, from 1 to 3, the following operations are carried out:

s4.1, pair

And

S4.2, calculating an A matrix, wherein the elements are as follows:

s4.3, averaging the matrix A according to rows to generate A^s；

where H and W are the height and width of the image.

S4.6, reconstructing the A matrix as

Wherein each element is as follows:

s4.7, mixing

Normalized by softmax layer as follows:

wherein | | | represents a concatenation operation;

s4.9, repeat S2.1 to l ═ 3;

s5, mixing

Is subjected to bilinear upsampling and is summed by a residual module

s8, updating phi;

s9, loop back to S1, until convergence.

2. The method of claim 1, wherein the semantic segmentation under small samples based on the distraction network comprises: the support image is in a one-shot condition, the support image is 1 image, and the semantic segmentation process comprises the following steps:

s1: generation of three-layer feature representation for images to be segmented and support images using resnet101 network

s2, for each l, from 1 to 3, the following operations are carried out:

s2.1, pair

And

S2.2, calculating an A matrix, wherein the elements

S2.3, averaging the matrix A according to rows to generate A^s；；

wherein H and W are the height and width of the image;

S2.6, reconstructing an A matrix, wherein each element is as follows:

s2.7, normalizing by a softmax layer as follows:

wherein | | | represents a concatenation operation;

s2.9, repeat S2.1 to l ═ 3;

s3, mixing

Is subjected to bilinear upsampling and is summed by a residual module

3. The method of claim 1, wherein the semantic segmentation under small samples based on the distraction network comprises: f is^q _lComprising two branch structures, said f^q _lThe next branch is the image to be segmented, which is passed through the multi-layer convolutionOutputting corresponding to different convolution layers through a network, covering information characteristic representation of different semantic levels, wherein f is^q _lThe upper branch is a support image and a corresponding mask image (namely label), and the feature representation f covering different semantic level information is generated through different layers of the convolutional neural network (the convolutional neural network of the next branch has the same parameters) ^s _l。

4. The method of claim 1, wherein the semantic segmentation under small samples based on the distraction network comprises: f is^s _lAnd performing dot multiplication on the labeling information to obtain information and the representation of the image to be segmented of the corresponding layer, wherein the obtained information and the corresponding layer comprise a DGA (differential global feature analysis) dispersed graph attention mechanism and an RFU (cross-domain fuzzy iron) enhanced fusion unit, and the attention mechanism generated in the DGA dispersed graph attention mechanism link is fused with the RFU enhanced fusion unit link through the attention mechanism corresponding to different layers to generate the final image to be segmented of the label.

5. The method of claim 4, wherein the semantic segmentation under small samples based on the distraction network comprises: the DGA dispersed graph attention mechanism inputs a representation f containing different levels of semantic information obtained from a support image and an image to be segmented^sAnd f^qThe output is a distraction representation f^aSaid f^sAnd f^qRespectively connecting with channels and collaterals two convolution layers, wherein the two convolution layers map the two convolution layers into a space represented by a key word k and a numerical value v, the k is used for measuring the distance between an image to be segmented and a support image, the v is used for storing detail information extracted from feature mapping, and the k is used for extracting the detail information from feature mapping ^sAnd k^qA is obtained by inner product calculation, wherein the calculation is as follows:

6. a substrate according to claim 5The semantic segmentation method under the small sample of the distraction network is characterized in that: a is described_i,jExpressing the relevance between the pixel i in the image to be segmented and the pixel j in the support image, and averaging the A matrix according to rows to obtain the A matrix_s。

7. The method of claim 1, wherein the semantic segmentation under small samples based on the distraction network comprises: the RFU enhancement fusion unit comprises an upsampling mechanism, the upsampling mechanism adopts bilinear upsampling, the upsampling mechanism connects attention characteristics of the two layers in series through a residual module, the residual module generates dense representation through a convolution layer, the upsampling mechanism outputs the final RFU enhancement fusion unit to the convolution layer and a softmax layer, and whether each pixel is a foreground or a background is judged independently.