CN113516670B

CN113516670B - Feedback attention-enhanced non-mode image segmentation method and device

Info

Publication number: CN113516670B
Application number: CN202110732029.9A
Authority: CN
Inventors: 刘华平; 董俊杰; 续欣莹; 谢珺
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2024-06-25
Anticipated expiration: 2041-06-29
Also published as: CN113516670A

Abstract

The disclosure provides a feedback attention-enhanced non-mode image segmentation method and device, and belongs to the technical field of image recognition. The method comprises the following steps: acquiring an RGB image of a real environment; and carrying out multi-scale prediction on the RGB image according to a preset non-mode image segmentation network, wherein the preset non-mode image segmentation network outputs the shape, boundary and semantic category of the visible part and the blocked part of the identified object in the image so as to obtain a non-mode image segmentation result of the input image. The method and the device can overcome the defect of incompleteness of object perception in the prior image recognition caused by mutual shielding among objects, avoid recognition and segmentation errors caused by shielding among objects as much as possible, and strengthen the complete perception and understanding of the surrounding environment.

Description

Feedback attention-enhanced non-mode image segmentation method and device

Technical Field

The disclosure relates to the technical field of image recognition, and in particular relates to a feedback attention-enhanced non-mode image segmentation method and device.

Background

With the development of the deep learning technology in the aspect of computer vision, common visual recognition tasks such as image classification, target detection, image segmentation and the like rapidly develop, and the deep learning technology is widely applied to the fields of security monitoring, image retrieval, medical diagnosis, automatic driving, human-computer interaction and the like.

Image segmentation is an important image understanding method aimed at finding all objects present in an image and classifying them at the pixel level. Current image segmentation algorithms have met with great success in both segmentation accuracy and speed. Nevertheless, image segmentation is still insufficient for a comprehensive understanding of the complex surrounding environment, for example, when there is a mutual occlusion relationship between objects, image segmentation techniques can only identify and segment visible pixel portions, and reasonable predictions cannot be made as to the shape, boundaries and semantics of the occluded portions.

However, the human visual system has the ability to perceive the complete physical structure of an object, which is called non-modal perception, enabling accurate reasoning about the complete shape, boundaries and semantic concepts of an object even in the case of partial occlusion or even severe occlusion. The non-mode image segmentation is an extension of the image segmentation task, and aims to simulate the human visual system to predict the visible part and the blocked part of each object, not just the visible pixel part, so as to help the robot to realize the perception of the whole range of the object of interest.

As is known, the human brain has a hierarchical structure, and not only performs a feed-forward process from shallow to deep but also performs a feedback process from deep to shallow, through which information is continuously learned and transferred. At the same time, the human visual system selectively enhances and inhibits the activation of neurons through the attentive mechanisms, thereby improving the ability to process information. The existing non-mode image segmentation method generally adopts a convolutional neural network to train a network model so as to further predict the whole shape and the semantics of an object for an input image, but the convolutional neural network belongs to a feedforward network and only performs a learning process from a shallow layer to a deep layer; and when predicting the shape and semantics of the invisible portion of the object, the problem of under-fitting is often caused by insufficient representation capability of the model features and lack of spatial detail information. In addition, with the advent of the artificial intelligence era, the ability of a robot to have the same reasoning about the integrity of an object as the human visual system would have greater application value, and therefore, be very important for research of non-mode image segmentation.

Disclosure of Invention

The present disclosure aims to overcome the shortcomings of the existing methods and provide a feedback attention-enhanced non-mode image segmentation method and apparatus. The method can help the robot to sense the whole shape, boundary and semantics of the object of interest in an inference mode, not just the visible part, and improve the sensing capability of the robot on the integrity of the object.

An embodiment of a first aspect of the present disclosure provides a feedback attention-enhancing non-mode image segmentation method, including:

Acquiring an RGB image of a real environment; performing multi-scale prediction on the RGB image according to a preset non-mode image segmentation network; the preset non-mode image segmentation network outputs the shapes, boundaries and semantic categories of the visible part and the blocked part of the identified object in the image so as to obtain a non-mode image segmentation result of the input image.

In one embodiment of the disclosure, before the multi-scale prediction of the RGB image according to the preset non-mode image segmentation network, the method further includes: training the non-modal image segmentation network; wherein said training said non-modal image segmentation network comprises: constructing a non-mode image segmentation network with a bilateral network structure; acquiring a training set, wherein the training set comprises an RGB image and a non-mode segmentation mask; inputting the training set into the non-mode image segmentation network, and training the non-mode image segmentation network by optimizing parameters in the network through a random gradient descent method, wherein when the iteration number reaches the upper limit, the non-mode image segmentation network is trained.

In one embodiment of the present disclosure, the non-modal image segmentation network is configured to include a backbone network branch with feedback attention enhancement and a spatial detail preserving network branch.

In one embodiment of the present disclosure, the backbone network branch includes five convolution blocks, C1, C2, C3, C4, C5, respectively, connected in sequence, and three attention-optimized upsampling ARB modules and feedback connections; five convolution blocks C1-C5 adopt ResNet networks;

The spatial detail preserving network branch comprises four convolution blocks and three multi-scale attention feature MSAF modules; wherein the first convolution block is denoted as B2, the second convolution block is denoted as B3, the third convolution block is denoted as B4, the fourth convolution block is denoted as B5, each convolution block comprises two 3 x 3 convolution layers, the input of the B2 convolution block is connected to the output of the C1 convolution block, the first MSAF module is located between the B3 convolution block and the B4 convolution block, the second MSAF module is located between the B4 convolution block and the B5 convolution block, and the third MSAF module is located after the B5 convolution block;

the main network branches and the detail reservation network branches are connected through four feature fusion modules FFM.

In one embodiment of the present disclosure, the ARB module workflow is as follows:

(1) After any feature map X epsilon R ^C×H×W is input into the ARB module, wherein C represents the channel number of the feature map, H represents the height of the feature map, W represents the width of the feature map, and the global feature of the feature map is output by using a global average pooling function;

(2) The output of the step (1) is activated by a fully connected neural network and Sigmoid, and the channel attention vector of the feature map input in the step (1) is calculated;

(3) Multiplying the channel attention vector obtained in the step (2) by the input feature map X, and outputting a corresponding channel weighted feature map Y;

(4) Calculating a spatial attention feature map in the horizontal direction and the vertical direction by using pooling kernels with the sizes of (1, W) and (H, 1) and Softmax functions on the feature map Y obtained in the step (3), and obtaining a corresponding spatial attention feature map in the horizontal direction with the size of R ^C×H×1 and a corresponding spatial attention map in the vertical direction R ^C×1×W respectively;

(5) Performing channel dimension splicing operation on the horizontal direction space attention feature map and the vertical direction space attention feature map output in the step (4) to obtain a spliced feature map, wherein the size of the spliced feature map is R ^C×1×(H+W), and performing convolution operation and Relu nonlinear activation function aggregation on the spliced feature map to obtain features in two directions;

(6) The dimension of the feature map obtained in the step (5) is transformed into two feature maps of R ^C×H×1 and R ^C×1×W again, and then the output values of the two feature maps are calculated by using a Sigmoid function respectively;

(7) And (3) performing matrix multiplication operation on the feature map Y and the output value obtained in the step (6) to obtain a feature map Z epsilon R ^C×H×W finally output by the ARB module.

In one embodiment of the present disclosure, the MSAF module workflow is as follows:

(1) After inputting any feature map F epsilon R ^C×H×W into a MSAF module, extracting features with different sizes in the input image by adopting four convolution kernels with different scales, wherein the number of channels of the output feature map with each convolution kernel scale is halved to obtain four output feature maps, wherein C represents the number of channels of the feature map, H represents the height of the feature map, and W represents the width of the feature map;

(2) Performing splicing operation on the four output feature graphs obtained in the step (1) in a channel dimension to generate a spliced feature graph, wherein the size of the spliced feature graph is R ^2C×H×W;

(3) Performing convolution dimension reduction operation on the spliced feature images in the step (2) by using two 1×1 convolution kernels to obtain corresponding feature images, wherein the corresponding feature images are respectively marked as U and V, and at the moment, U, V epsilon R ^C′×H×W and C' represent the number of channels after dimension reduction;

(4) Transforming the feature images U and V in the step (3) into feature images with dimensions R ^C′×K, respectively, where k=h×w, and transposing the transformed feature images U to obtain updated feature images U e R ^K×C′;

(5) Matrix multiplication is performed between the feature maps U and V, and then a Softmax function is applied to obtain a spatial attention map N e R ^K×K:

Wherein U _i represents the pixel value of the ith position in the feature map U, and V _j represents the pixel value of the jth position in the feature map V; n _ji denotes the pixel value of the j-th row and i-th column in the spatial attention map N, exp () denotes an exponential function based on a natural constant e;

(6) Converting the feature map F into a feature map M by using another 1X 1 convolution kernel, wherein M epsilon R ^C′×H×W, and then converting the dimension of the feature map M into M epsilon R ^C′×K;

(7) Performing a matrix multiplication between the feature map M and the spatial attention map N and transforming the result into a feature map D εR ^C′×H×W;

(8) And (3) transforming the feature map D obtained in the step (7) into D epsilon R ^C×H×W by using 1X 1 convolution, and summing with the feature map F pixel by pixel through jump connection to obtain a feature map E finally output by a MSAF module.

In one embodiment of the present disclosure, the FFM module workflow is as follows:

(1) The two feature maps of the input FFM module are denoted P, Wherein the feature map P represents the output feature map of the space detail preserving network branch, the feature map Q represents the output feature map of the main network branch with feedback attention enhancement, C represents the channel number of the feature map, H represents the height of the feature map, W represents the width of the feature map, and the feature maps P and Q are subjected to pixel-by-pixel addition operation to obtain an added feature map;

(2) The feature map output in the step (1) is subjected to convolution operation through a 3 multiplied by 3 convolution kernel and a1 multiplied by 1 convolution kernel in sequence, and then is input into a Sigmoid nonlinear activation function, and a corresponding feature map is output;

(3) Performing convolution operation on the feature images output in the step (1) through another 3×3 convolution kernel in parallel, and outputting corresponding feature images;

(4) Performing pixel-by-pixel multiplication operation on the feature map output in the step (2) and the feature map output in the step (3), and marking the obtained multiplied feature map as

(5) Performing feature aggregation on the feature map G output in the step (4) and the input feature map P, Q to obtain a final output feature map of the FFM moduleThe expression is:

Wherein mu, lambda represents a trainable parameter, Representing a pixel-by-pixel multiplication operation.

To achieve the above object, a second aspect of the present disclosure provides a feedback attention-enhancing non-mode image segmentation apparatus, including: the acquisition module is used for acquiring RGB images of the real environment; the prediction module is used for carrying out multi-scale prediction on the RGB image according to a preset non-mode image segmentation network, wherein the preset non-mode image segmentation network outputs the shape, boundary and semantic category of the visible part and the blocked part of the identified object in the image so as to obtain a non-mode image segmentation result of the input image.

To achieve the above object, an embodiment of a third aspect of the present disclosure proposes an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform a feedback attention enhanced non-mode image segmentation method as described above.

To achieve the above object, a fourth aspect of the present disclosure provides a computer-readable storage medium storing computer instructions for causing the computer to execute the above-described feedback attention-enhancing non-mode image segmentation method.

The feedback attention-enhanced non-mode image segmentation method and device provided by the present disclosure are characterized and beneficial in that:

In order to improve the perception capability of a robot to the target integrity, the disclosure combines the feedback process and the learning characteristic of an attention mechanism, and provides a feedback attention-enhanced non-mode image segmentation method and device. Specifically, after the first round of features of the input image are extracted by the convolutional neural network, feedback connection is introduced to perform relearning and extracting the second round of image features, because the feature map of the convolution twice can extract more advanced features than the feature map of the convolution once; meanwhile, spatial detail preserving network branches with multi-scale attention feature modules are designed to capture rich spatial detail information. Finally, deep and shallow features from the backbone network branches and the space detail reservation network branches are effectively combined through the feature fusion module, a complex non-mode image segmentation task is realized, and the complex non-mode image segmentation task is applied to a robot system, so that the robot can capture the whole part of a target, not just the visible part, and the complete shape, boundary and semantics of an object are better perceived.

The feedback attention-enhanced non-mode image segmentation method and device can be applied to helping a robot to sense the whole part of an object in a reasoning manner, not just the visible part, so that the robustness of the robot to shielding is realized, and the error caused by mutual shielding in the process of recognition and positioning of the robot is avoided. The method and the device can be applied to various fields such as service robots, security monitoring, automatic driving, unmanned man-machine interaction and the like.

Drawings

FIG. 1 is an overall flowchart of a feedback attention-enhancing non-mode image segmentation method provided by an embodiment of the present disclosure;

FIG. 2 is a block diagram of a non-modal image segmentation network in an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an attention optimization module ARB in an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a multi-scale attention feature module MSAF in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a feature fusion module FFM in an embodiment of the present disclosure;

fig. 6 is a diagram of a non-mode image segmentation effect in an embodiment of the present disclosure.

Detailed Description

The present disclosure provides a feedback attention-enhanced non-mode image segmentation method and apparatus, and the following description is further detailed with reference to the accompanying drawings and specific embodiments.

The embodiment of the disclosure provides a feedback attention-enhanced non-mode image segmentation method, which comprises the steps of relearning by introducing feedback connection in a deep convolutional neural network, extracting higher-level semantic features, designing shallow convolutional layer space detail reservation network branches to capture rich space detail information, improving the performance of non-mode image segmentation, applying the non-mode image segmentation to a robot vision system, helping a robot capture information of the whole part of a target, and not only information of a visible part.

The embodiment of the disclosure provides a feedback attention-enhanced non-mode image segmentation method, which is divided into a training stage and a using stage, and the overall flow of the disclosed embodiment is shown in fig. 1, and comprises the following steps:

(1) The training stage comprises the following specific steps:

(1-1) constructing a non-modal image segmentation network N _A, which is a bilateral network structure comprising a backbone network branch (extracting semantic information) with feedback attention enhancement, a shallow convolutional layer spatial detail preserving network branch (extracting detail information) and four feature fusion modules FFM (Feature Fusion Module, FFM) connecting the backbone network branch and the detail preserving network branch.

FIG. 2 is a block diagram of a non-modal image segmentation network according to an embodiment of the present disclosure, wherein the backbone network branch includes five convolution blocks, C1, C2, C3, C4, C5, connected in sequence, and three attention-optimized upsampling modules ARB (Attention Refinement Block, ARB) and feedback connections; five convolution blocks C1-C5 adopt ResNet networks, and the input of the convolution block C1 is an RGB image.

The spatial detail preserving network branch includes four convolution blocks and three Multi-scale attention feature modules MSAF, where the branch is marked as B2 for the first convolution block, B3 for the second convolution block, B4 for the third convolution block, and B5 for the fourth convolution block, where each convolution block contains two 3 x 3 convolution layers, the input of the B2 convolution block is connected to the output of the C1 convolution block, the first Multi-scale attention feature module MSAF (Multi-scale Attention Feature, MSAF) is located between the B3 convolution block and the B4 convolution block, the second MSAF module is located between the B4 convolution block and the B5 convolution block, and the third MSAF module is located after the B5 convolution block.

Finally, through a feature fusion module FFM, semantic information contained in each feature image output by a main network branch and detail information contained in each feature image output by a space detail retention network branch are aggregated in a soft selection mode to obtain a final output feature image of a non-mode image segmentation network with four different sizes of 1/4,1/8,1/16 and 1/32 of an original input RGB image, and multi-scale prediction is executed.

Further, the attention optimizing module (Attention Refinement Block, ARB) works according to the principle shown in fig. 3, and specifically includes the following steps:

(1-1-1) inputting any feature map X epsilon R ^C×H×W into an ARB module, wherein C represents the number of feature map channels, H represents the height of the feature map, W represents the width of the feature map, and outputting the global feature of the feature map by using a global average pooling function;

(1-1-2) calculating the channel attention vector of the input feature map of the step (1-1-1) by the output of the step (1-1-1) through a fully connected neural network and a Sigmoid activation function;

(1-1-3) multiplying the channel attention vector obtained in the step (1-1-2) by the input feature map X, and outputting a corresponding channel weighted feature map Y;

(1-1-4) calculating a spatial attention profile in the horizontal direction and the vertical direction by using pooling kernels and Softmax functions with the sizes of (1, w) and (H, 1) on the profile Y obtained in the step (1-1-3), so as to obtain a corresponding horizontal spatial attention profile with the sizes of R ^C×H×1 and a corresponding vertical spatial attention R ^C×1×W respectively;

(1-1-5) performing channel dimension splicing operation on the two feature images output in the step (1-1-4) to obtain a spliced feature image, wherein the size of the spliced feature image is R ^C×1×(H+W), and performing convolution operation and Relu nonlinear activation function aggregation on the spliced feature image to obtain features in two directions;

(1-1-6) converting the dimension of the feature map obtained in the step (1-1-5) into two feature maps of R ^C×H×1 and R ^C×1×W, and then calculating the output values of the two feature maps respectively by using a Sigmoid function;

(1-1-7) performing matrix multiplication operation on the feature map Y and the output value obtained in the step (1-1-6) to obtain a feature map Z epsilon R ^C×H×W output by the ARB module finally, wherein the feature map is a new feature map obtained by optimizing an attention mechanism of an original input feature map, and the size of the feature map is consistent with that of the original input feature map;

Further, the Multi-scale attention feature module (Multi-scale Attention Feature, MSAF) works according to the following steps as shown in fig. 4:

(1-2-1) inputting any feature map F epsilon R ^C×H×W into a MSAF module, and then respectively adopting convolution kernels with four different scales of 1 multiplied by 1,3 multiplied by 3,5 multiplied by 5 and 7 multiplied by 7 to extract features with different sizes in the input image, wherein the number of channels of the output feature map with each convolution kernel scale is halved to obtain four output feature maps, wherein C represents the number of channels of the feature map, H represents the height of the feature map, and W represents the width of the feature map;

(1-2-2) performing splicing operation on the four output feature images obtained in the step (1-2-1) in the channel dimension to generate a spliced feature image, wherein the size of the spliced feature image is R ^2C×H×W, so that multi-scale information of the feature images in each dimension is aggregated;

(1-2-3) performing convolution dimension reduction operation on the spliced feature images in the step (1-2-2) by using two 1×1 convolution kernels to obtain corresponding feature images, wherein the corresponding feature images are respectively marked as U and V, and at the moment, U, V epsilon R ^C′×H×W and C 'represent the number of channels after dimension reduction, and in the embodiment, C' =128;

(1-2-4) converting the feature maps U and V in (1-2-3) into feature maps with dimensions R ^C′×K, respectively, where k=hxw, and transposing the converted feature map U to obtain an updated feature map U e R ^K×C′;

(1-2-5) performing a matrix multiplication between U and V, and then applying a Softmax function to obtain a spatial attention diagram N εR ^K×K:

Wherein U _i represents the pixel value of the ith position in the feature map U, V _j represents the pixel value of the jth position in the feature map V, N _ji represents the pixel value of the jth row and the jth column in the spatial attention map N, represents the degree of correlation between different positions, exp () represents an exponential function based on a natural constant e;

(1-2-6) converting the original input feature map F into a feature map M by using another 1X 1 convolution kernel, wherein M epsilon R ^C ^′×H×W is followed by converting the dimension of the feature map M into M epsilon R ^C′×K;

(1-2-7) performing a matrix multiplication between the signature M and the spatial attention pattern N and transforming the result into a signature D e R ^C′×H×W;

(1-2-8) transforming the feature map D obtained in (1-2-7) into D epsilon R ^C×H×W by using 1X 1 convolution, and summing the D epsilon R ^C×H×W with the feature map F pixel by pixel through jump connection to obtain a feature map E finally output by a MSAF module, wherein the feature map E comprises the correlation degree between different pixels in the original input feature map F;

further, the feature fusion module FFM (Feature Fusion Module, FFM) has a working principle shown in fig. 5, and specifically includes the following steps:

(1-3-1) the two feature maps input to the FFM module are denoted as P, Wherein the feature map P represents the output feature map of the space detail preserving network branch, the feature map Q represents the output feature map of the main network branch with feedback attention enhancement, C represents the channel number of the feature map, H represents the height of the feature map, W represents the width of the feature map, and the feature maps P and Q are subjected to pixel-by-pixel addition operation to obtain an added feature map;

(1-3-2) carrying out convolution operation on the feature map output in the step (1-3-1) through a 3X 3 convolution kernel and a 1X 1 convolution kernel in sequence, and then inputting the feature map to a Sigmoid nonlinear activation function to output a corresponding feature map;

(1-3-3) similarly, the feature map output in the step (1-3-1) is subjected to convolution operation through another 3×3 convolution kernel in parallel, and a corresponding feature map is output;

(1-3-4) performing a pixel-by-pixel multiplication operation on the feature map output in the step (1-3-2) and the feature map output in the step (1-3-3), and obtaining a multiplied feature map, which is denoted as

(1-3-5) Respectively carrying out feature aggregation on the feature graphs G output in the step (1-3-4) and the input feature graphs P, Q to obtain final output feature graphs of the FFM moduleThe expression is:

(1-2) Uniformly representing all trainable parameters in the N _A network as theta _R (random initialization at the beginning of the parameters, and continuous iterative update in the training process);

(1-3) acquiring a training set. The present embodiment uses the COCO-Amodal dataset for training of the N _A network, the COCO-Amodal dataset being a dataset for achieving non-modal image segmentation established by adding annotations of occluded, invisible portions on the basis of the COCO dataset. The dataset consisted of 5073 images, with 2500, 1323 and 1250 data for training, validation and testing, respectively. In the dataset, the non-modal segmentation mask is defined as the union of the visible pixel portion mask and the occluded portion pixel mask of the object. COCO-Amodal datasets can be divided into two categories: "Things" and "Stuff", where "Things" is an object of interest, such as a person, a car, a refrigerator, etc.; "Stuff" generally refers to a background, such as a grass, wall, or the like.

(1-4) Resizing the training set image. Transforming the size of each image in the training set of the COCO-Amodal dataset into height H=1024 and width W=800 by a bilinear interpolation method, and marking the adjusted training set as COCOA;

(1-5) defining a loss function L _loss of the non-modal image segmentation network N _A, expressed as:

L_loss＝L_cls+L_box+L_{a_mask}

Where L _cls represents the object classification error, L _box represents the bounding box regression error, and L _{a_mask} represents the object all non-pattern mask (amodal mask) binary cross entropy loss.

(1-6) Randomly selecting an RGB image from a training set COCOA, and setting a neural network training Batch size batch_size=1 in the embodiment;

(1-7) inputting the RGB image selected in the step (1-6) into a convolution block C1 with a convolution kernel size of 7 multiplied by 7, a step length of 2 and an output feature map channel number of 64, so as to obtain a feature map with an output resolution of 1/2 of the original input image;

(1-8) performing a maximum pooling operation with a pooling kernel size of 3 x 3 and a step size of 2 on the output feature map of the step (1-7) to obtain a feature map after maximum pooling, and then sequentially performing processing of four convolution blocks of C2, C3, C4 and C5, wherein a batch normalization layer (BN) and a nonlinear ReLU activation function are immediately followed by each convolution layer in each convolution block, the size of the feature map output from C1 to C5 is sequentially changed into 1/2,1/4,1/8,1/16,1/32 of an original input RGB image, and the channel number is sequentially changed into 64, 256, 512, 1024 and 2048;

(1-9) converting the channel number of the characteristic images output by the convolution blocks C2, C3, C4 and C5 into 256 by adopting 1X 1 convolution, obtaining four characteristic images with different sizes corresponding to the input RGB images extracted by the first round of the main network branch through jump connection, bilinear interpolation and a corresponding attention optimizing module ARB, and respectively marking as P ₂,P₃,P₄,P₅ (the sizes are sequentially 1/4,1/8,1/16 and 1/32 of the original input RGB images) and are abbreviated as P _i, wherein i=2, … and 5; specifically, firstly, an output feature map corresponding to a C5 convolution block is subjected to 1×1 convolution dimension reduction to obtain a feature map P ₅, then the feature map P ₅ is input into a first ARB module to obtain an output feature map, the resolution of the output feature map is doubled by bilinear interpolation to obtain a feature map with enlarged resolution corresponding to P ₅, then the output feature map corresponding to a C4 convolution block is subjected to 1×1 convolution dimension reduction and is subjected to pixel-by-pixel summation operation with the feature map with enlarged resolution corresponding to P ₅ to obtain a feature map P ₄, and the rest feature maps P ₃、P₂ are respectively obtained by a second ARB module and a third ARB module by adopting the same method as that of P ₄;

(1-10) connecting the output feature map P ₂ in the step (1-9) to C2 through feedback, performing pixel-by-pixel summation on the feature map output by the C2 convolution block in the step (1-8) and the feature map P ₂, taking the result obtained by summation as a new input feature map of the C3 convolution block, performing convolution operation from the C3 to the C5 convolution block again, and repeating the step (1-9) to obtain an image feature map extracted in the second round, denoted as P '₂,P'₃,P'₄,P'₅ (1/4, 1/8,1/16,1/32 of the original input RGB image in sequence, abbreviated as P _i', wherein i=2, …,5;

(1-11) calculating an output characteristic diagram O _i based on the feedback attention-enhancing backbone network, wherein the expression is as follows:

O_i＝P_i+β_iP′_i

Where β _i is one of the parameters θ _R, initialized to 0, and progressively learn to assign more weight, i=2, …,5;

(1-12) passing the output feature map of the convolution block C1 in the step (1-7) through each module of the space detail preserving network branch in sequence, wherein the four convolution blocks B2, B3, B4, B5 of the branch respectively comprise 23×3 convolution layers, the channel numbers of the feature maps output by the B2 to B5 are uniformly set to 256, and each convolution layer is followed by a batch normalization layer (BN) and a nonlinear ReLU activation function in each convolution block, and each convolution block further comprises a pooling layer, and the pooling layer is used for reducing the size of the output feature map; wherein, after the output feature map of the B2 convolution block passes through one convolution block and one MSAF module of the network branch, the feature map with the size of 1/2 of the resolution of the input feature map is output from the convolution block, and after the output feature map passes through the MSAF module, the MSAF module outputs a final output feature map containing the correlation degree between different pixels in the feature map of the input MSAF module; the space detail preserving network takes the characteristic diagram output by the B2 convolution block and the three final output characteristic diagrams with different sizes output by the first to the third MSAF modules as the outputs of the branches of the space detail preserving network, which are respectively R ₂,R₃,R₄,R₅ and are abbreviated as R _i, wherein i=2, … and 5; the sizes of the four feature maps are respectively 1/4,1/8,1/16 and 1/32 of the RGB image of the original input C1;

The resolution size and the channel number of the feature map R _i are consistent with those of the feature map O _i;

(1-13) carrying out feature aggregation on the output feature map O _i of the step (1-11) and the output feature map R _i of the step (1-12) through a corresponding FFM module in a soft selection mode, wherein O ₂ and R ₂ are aggregated in a first FFM module, O ₃ and R ₃ are aggregated in a second FFM module, O ₄ and R ₄ are aggregated in a third FFM module, O ₅ and R ₅ are aggregated in a fourth FFM module, four output feature maps with different sizes, the final sizes of which are 1/4,1/8,1/16,1/32 of the original input RGB image, and carrying out multi-scale prediction on the four output feature maps;

(1-14) calculating the gradient of the loss function L _loss to the parameter θ _R:

(1-15) updating parameters in the optimized network by adopting a random gradient descent method Where α is the learning rate of the network, typically set to 0.0001;

(1-16) repeating the steps (1-6) - (1-15), continuously updating the parameter theta _R in the network, and recording all images of the training set as one iteration for one time until the preset iteration times are reached, wherein in the embodiment, the iteration times epochs =18 are set, so that a trained non-mode image segmentation network is obtained;

(2) A use stage; the method comprises the following specific steps:

(2-1) transplanting the non-mode image segmentation network trained in the step (1) to a robot system (no special requirement is imposed on the model of the robot); placing the robot in any location in a selected real environment (e.g., kitchen, living room, etc.);

And (2-2) the robot randomly collects images through the camera, and inputs the collected images into a trained non-mode image segmentation network, wherein the network outputs all the shapes, boundaries and semantic categories of the visible part and the blocked part of an identified object (such as a refrigerator and the like) in the images, namely the non-mode image segmentation result of the input images. Fig. 6 provides a non-mode image segmentation effect diagram, such as a "refrigerator" in an image, in an embodiment of the present disclosure, where a robot may recognize not only a visible pixel portion of the refrigerator, but also a shape of an occluded portion of the refrigerator, thereby sensing the entire range.

In order to achieve the above embodiments, a second aspect of the present disclosure provides a feedback attention-enhancing non-mode image segmentation apparatus, including: the acquisition module is used for acquiring RGB images of the real environment; the prediction module is used for carrying out multi-scale prediction on the RGB image according to a preset non-mode image segmentation network, wherein the preset non-mode image segmentation network outputs the shape, boundary and semantic category of the visible part and the blocked part of the identified object in the image so as to obtain a non-mode image segmentation result of the input image.

In order to achieve the above embodiments, an embodiment of a third aspect of the present disclosure proposes an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform a feedback attention enhanced non-mode image segmentation method as described above.

In order to achieve the above-described embodiments, a fourth aspect of the present disclosure proposes a computer-readable storage medium having stored thereon a computer program for execution by a processor for performing a feedback attention-enhancing non-mode image segmentation method of the above-described embodiments.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform a feedback attention enhanced non-mode image segmentation method of the above embodiments.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or part of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method of feedback attention-enhancing non-modal image segmentation, comprising:

Acquiring an RGB image of a real environment;

performing multi-scale prediction on the RGB image according to a preset non-mode image segmentation network, wherein the preset non-mode image segmentation network outputs the shape, boundary and semantic category of the visible part and the blocked part of the identified object in the image so as to obtain a non-mode image segmentation result of the input image;

Wherein the non-mode image segmentation network comprises a main network branch with feedback attention enhancement and a space detail preservation network branch;

The main network branch comprises five convolution blocks which are respectively C1, C2, C3, C4 and C5, and three attention-optimized up-sampling ARB modules and feedback connection which are connected in sequence; five convolution blocks C1-C5 adopt ResNet networks;

The spatial detail preserving network branch comprises four convolution blocks and three multi-scale attention feature MSAF modules; wherein the first convolution block is denoted as B2, the second convolution block is denoted as B3, the third convolution block is denoted as B4, the fourth convolution block is denoted as B5, each convolution block comprises two 3 x3 convolution layers, the input of the B2 convolution block is connected with the output of the C1 convolution block, the first MSAF module is located between the B3 convolution block and the B4 convolution block, the second MSAF module is located between the B4 convolution block and the B5 convolution block, and the third MSAF module is located after the B5 convolution block;

2. The method of claim 1, further comprising, prior to said multi-scale prediction of said RGB image according to a predetermined non-mode image segmentation network:

Training the non-modal image segmentation network;

wherein said training said non-modal image segmentation network comprises:

Constructing a non-mode image segmentation network with a bilateral network structure;

acquiring a training set, wherein the training set comprises an RGB image and a non-mode segmentation mask;

inputting the training set into the non-mode image segmentation network, and training the non-mode image segmentation network by optimizing parameters in the network through a random gradient descent method, wherein when the iteration number reaches the upper limit, the non-mode image segmentation network is trained.

3. The method of claim 1, wherein the ARB module workflow is as follows:

4. The method of claim 1, wherein the MSAF module workflow is as follows:

(6) Converting the feature map F into a feature map M, M epsilon R ^C′×H×W by using another 1X 1 convolution kernel, and then converting the dimension of the feature map M into M epsilon R ^C′×K;

(7) Performing a matrix multiplication between the feature map M and the spatial attention map N and transforming the result into a feature map D εR ^C ^′×H×W;

5. The method of claim 1, wherein the FFM module workflow is as follows:

(1) Respectively marking two characteristic diagrams input into the FFM module as P and Q epsilon R ^C×H×W, wherein the characteristic diagram P represents an output characteristic diagram of a space detail retaining network branch, the characteristic diagram Q represents an output characteristic diagram of a main network branch with feedback attention enhancement, C represents the number of characteristic diagram channels, H represents the characteristic diagram height, W represents the characteristic diagram width, and the characteristic diagrams P and Q are subjected to pixel-by-pixel addition operation to obtain an added characteristic diagram;

(4) Performing pixel-by-pixel multiplication operation on the feature map output in the step (2) and the feature map output in the step (3), and obtaining a multiplied feature map which is marked as G epsilon R ^C×H×W;

(5) And (3) respectively carrying out feature aggregation on the feature map G output in the step (4) and the input feature map P, Q to obtain a final output feature map A epsilon R ^C×H×W of the FFM module, wherein the expression is as follows:

6. A feedback attention-enhancing non-mode image segmentation apparatus, comprising:

the acquisition module is used for acquiring RGB images of the real environment;

The prediction module is used for carrying out multi-scale prediction on the RGB image according to a preset non-mode image segmentation network, wherein the preset non-mode image segmentation network outputs the shape, boundary and semantic category of the visible part and the blocked part of the identified object in the image so as to obtain a non-mode image segmentation result of the input image;

7. An electronic device, comprising:

At least one processor; and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the instructions being arranged to perform the method of any of the preceding claims 1-5.

8. A computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.