CN113298814A

CN113298814A - Indoor scene image processing method based on progressive guidance fusion complementary network

Info

Publication number: CN113298814A
Application number: CN202110557921.8A
Authority: CN
Inventors: 周武杰; 杨恩泉; 叶宁; 雷景生; 万健; 甘兴利; 钱小鸿; 许彩娥; 强芳芳
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-24

Abstract

The invention discloses an indoor scene image processing method based on a progressive guidance fusion complementary network. The method comprises a training stage and a testing stage; 1_1, selecting an original indoor scene image, a corresponding depth image and a real semantic understanding image to form a training set; 1_2, constructing a convolutional neural network; 1_3, performing data enhancement on the training set to obtain an initial input image pair, and inputting the initial input image pair into a convolutional neural network for processing to obtain a corresponding semantic understanding prediction image; 1_4, calculating a loss function value between the semantic understanding prediction graph and the corresponding semantic understanding prediction graph; 1_5, repeatedly executing 1_3 and 1_4 to obtain a convolutional neural network classification training model; and inputting the indoor scene image to be semantically understood and the corresponding depth image into a convolutional neural network classification training model to obtain a corresponding predicted semantically understood image. The invention can effectively reduce the influence of noise depth measurement, can better complement information and complete feature extraction.

Description

Indoor scene image processing method based on progressive guidance fusion complementary network

Technical Field

The invention relates to a processing method of deep learning, in particular to an indoor scene image processing method based on a progressive guidance fusion complementary network.

Background

With the development of artificial intelligence and computer vision, semantic understanding has more and more applications in the actual social development, and clear visual scene understanding is one of the most magical functions of human brain. To model this capability, semantic understanding aims to provide a class label for each pixel on an image according to the semantics of the image. This problem is one of the most challenging tasks in computer vision, and has attracted a great deal of attention in the computer vision world.

Currently, the most common semantic understanding methods include support vector machines, random forests, and other algorithms. These algorithms focus primarily on a binary task for detecting and identifying specific objects, such as indoor floors, tables and beds. These traditional machine learning methods are often realized by high-complexity features, and deep learning is used to perform semantic understanding on traffic scenes, which is simple and convenient, and more importantly, Convolutional Neural Networks (CNNs) have recently made a breakthrough in various classification tasks (such as semantic segmentation). CNN has been shown to be a powerful visual model that can produce a hierarchy of elements. The key success of this model is mainly its general modeling capability for complex visual scenes.

A semantic understanding method of deep learning is adopted to directly carry out end-to-end (end-to-end) semantic understanding at a pixel level, and prediction can be carried out in a test set only by inputting images in a training set into a model frame for training to obtain weights and a model. The convolutional neural network is powerful in that its multi-layer structure can automatically learn features, and can learn features of multiple layers. At present, methods based on deep learning semantic understanding are divided into two types, the first is an encoding-decoding architecture. In the coding process, position information is gradually reduced and abstract features are extracted through a pooling layer; the decoding process gradually recovers the location information. There is typically a direct link between decoding and encoding. The second framework is a punctured convolution (pooled layers), a sensing domain is expanded in a punctured convolution mode, a smaller punctured convolution sensing domain is smaller, and specific characteristics of some parts are learned; the larger value of the coiled layer with holes has a larger sensing domain, more abstract characteristics can be learned, and the abstract characteristics have better robustness to the size, the position, the direction and the like of an object. Of course, there are also multi-scale prediction, feature mixing, etc. In addition, the depth image contains more spatial structure information. It complements RGB information on many visual tasks. Supplementing the appearance information (i.e., RGB) with depth may improve the performance of semantic understanding because the depth channel has information complementary to the RGB channel and encodes the structural information of the scene. The depth channel can be easily captured using a low cost RGB-D sensor. In general, objects can be identified based on their color and texture properties.

Most of the existing semantic understanding methods adopt a deep learning method, a large number of models are combined by using a convolutional layer and a pooling layer, however, a feature map obtained by simply using pooling operation and convolution operation is single and not representative, so that feature information of an obtained image is reduced, in addition, the quality of the depth map is low, and RGB and depth data display different features. How to effectively identify the difference between the two types of information and unify the two types of information into an effective semantic understanding representation method, and decoding and restoring the information with the highest quality is still a pending problem, and finally the restored effect information is rough and has low segmentation precision.

Disclosure of Invention

The invention aims to solve the technical problem of providing an indoor scene image processing method based on a progressive guidance fusion complementary network, wherein coding fully utilizes RGB appearance detail information and space structure information of a depth map, decoding is performed with high quality, the segmentation efficiency is high, and the segmentation accuracy is high.

In order to solve the technical problems, the invention adopts the following technical scheme:

the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting Q original indoor scene images, depth images corresponding to each original indoor scene image and real semantic understanding images, and forming a training set; converting the real semantic understanding image into a plurality of single-hot coded images by using a single-hot coding method;

step 1_ 2: and constructing a convolutional neural network.

Step 1_ 3: respectively performing data enhancement on each original indoor scene image and the corresponding depth image in the training set to obtain an original indoor scene image and a corresponding depth image after data enhancement, and taking the original indoor scene image and the corresponding depth image as an initial input image pair, and inputting the initial input image into a convolutional neural network for processing to obtain a plurality of pairs of semantic understanding prediction images corresponding to each original indoor scene image in the training set;

step 1_ 4: calculating a loss function value between a set formed by a plurality of semantic understanding prediction images and a set formed by a plurality of semantic understanding prediction images corresponding to each original indoor scene image;

step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times, wherein V is greater than 1, finishing training to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values in total; then finding out the minimum loss function value from the Q multiplied by V loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;

the test stage process comprises the following specific steps:

step 2_ 1: recording an indoor scene image to be semantically understood as { I ^ (I, j) }, wherein I and j respectively represent the horizontal and vertical coordinates of a pixel point with the coordinate position (I, j), and { I ^ (I, j) } represents the pixel value of the pixel point with the coordinate position (I, j) in the indoor scene image to be semantically understood;

step 2_ 2: inputting the indoor scene image to be semantically understood and the corresponding depth image into a convolutional neural network classification training model provided with an optimal weight vector and an optimal bias term to obtain a predicted semantically understood image corresponding to the indoor scene image to be semantically understood, and recording the predicted semantically understood image as { I ^_pre(I, j) }, wherein { I ^ s_preAnd (i, j) represents the pixel value of a pixel point with the coordinate position (i, j) in the predicted semantic understanding image.

The constructed convolutional neural network comprises a coding stage (extracting characteristics) and a decoding stage (image restoration), wherein the coding stage and the decoding stage are sequentially connected;

the encoding stage comprises an RGB image, a depth image, five depth map lifting modules (DE), ten rolling blocks and five progressive complementary fusion modules (PCF); the RGB image and the depth image peak are used as the input of a neural network;

the RGB image is connected with a fifth convolution block after sequentially passing through a first convolution block, a second convolution block, a third convolution block and a fourth convolution block; the output of the first rolling block is respectively input to the first input ends of the first progressive complementary fusion module and the first depth map lifting module, the output of the second rolling block is respectively input to the first input ends of the second progressive complementary fusion module and the second depth map lifting module, the output of the third rolling block is respectively input to the first input ends of the third progressive complementary fusion module and the third depth map lifting module, the output of the fourth rolling block is respectively input to the first input ends of the fourth progressive complementary fusion module and the fourth depth map lifting module, and the output of the fifth rolling block is respectively input to the first input ends of the fifth progressive complementary fusion module and the fifth depth map lifting module;

the depth image is connected with the tenth convolution block after sequentially passing through the sixth convolution block, the seventh convolution block, the eighth convolution block and the ninth convolution block; the output of the sixth rolling block is input to the second input end of the first depth map lifting module, the output of the seventh rolling block is input to the second input end of the second depth map lifting module, the output of the eighth rolling block is input to the second input end of the third depth map lifting module, the output of the ninth rolling block is input to the second input end of the fourth depth map lifting module, and the output of the tenth rolling block is input to the second input end of the fifth depth map lifting module;

the output of the first depth map boosting module is input to the second input end of the first progressive complementary fusion module, the output of the second depth map boosting module is input to the second input end of the second progressive complementary fusion module, the output of the third depth map boosting module is input to the second input end of the third progressive complementary fusion module, the output of the fourth depth map boosting module is input to the second input end of the fourth progressive complementary fusion module, and the output of the fifth depth map boosting module is input to the second input end of the fifth progressive complementary fusion module;

the output of the fifth progressive complementary fusion module is subjected to twice upsampling operation and then input to a third input end of a fourth progressive complementary fusion module, the output of the fourth progressive complementary fusion module is subjected to twice upsampling operation and then input to a third input end of the third progressive complementary fusion module, the output of the third progressive complementary fusion module is subjected to twice upsampling operation and then input to a third input end of a second progressive complementary fusion module, and the output of the second progressive complementary fusion module is subjected to twice upsampling operation and then input to a third input end of the first progressive complementary fusion module;

the output of the first progressive complementary fusion module, the output of the second progressive complementary fusion module, the output of the third progressive complementary fusion module, the output of the fourth progressive complementary fusion module and the output of the fifth progressive complementary fusion module are respectively input to a decoding stage;

the decoding stage comprises four multi-level residual error Modules (MLR), a fourth multi-level residual error module sequentially passes through a third multi-level residual error module and a second multi-level residual error module and then is connected with the first multi-level residual error module, the output of the fourth multi-level residual error module is input to the first input end of the third multi-level residual error module, the output of the third multi-level residual error module is input to the first input end of the second multi-level residual error module, the output of the second multi-level residual error module is input to the first input end of the first multi-level residual error module, the output of the fifth progressive complementary fusion module sequentially passes through a first inversion convolutional layer and an eighth normalization layer, the output of the eighth normalization layer is input to the first input end of the fourth multi-level residual error module after twice up-sampling operation, and the output of the fourth progressive complementary fusion module is input to the second input end of the fourth multi-level residual error module, the output of the third progressive complementary fusion module is input to the second input end of the third multilevel residual error module, the output of the second progressive complementary fusion module is input to the second input end of the second multilevel residual error module, the output of the first progressive complementary fusion module is input to the second input end of the first multilevel residual error module, and the output of the first multilevel residual error module is used as the output of the neural network.

The fifth progressive complementary fusion module has a structure specifically as follows:

the system comprises five convolution modules, a first self-adaptive pooling layer and two Softmax layers; the first input end of the fifth progressive complementary fusion module is sequentially connected with the first self-adaptive pooling layer and the first convolution module, and the output of the first convolution module and the output of the first input end of the fifth progressive complementary fusion module after multiplication obtain a first convolution output; the input of the second input end of the fifth progressive complementary fusion module and the output of the first convolution output after connection operation are input into the second convolution module; the output of the second convolution module and the output of the first convolution module after addition are input into a third convolution module, the output of the third convolution module is input into a first Softmax layer, and the output of the first Softmax layer and the output of the fifth progressive complementary fusion module after multiplication are used as first Softmax characteristic output; the output of the second convolution module and the input of the second input end of the fifth progressive complementary fusion module are added, the output is input to a fourth convolution module, the output of the fourth convolution module is input to a second Softmax layer, and the output of the second Softmax layer multiplied by the input of the second input end of the fifth progressive complementary fusion module is used as a second Softmax characteristic output; the output of the first Softmax characteristic output and the second Softmax characteristic output after addition is input into a fifth convolution module, and the output of the fifth convolution module is used as the output of the fifth progressive complementary fusion module;

the first progressive complementary fusion module, the second progressive complementary fusion module, the third progressive complementary fusion module and the fourth progressive complementary fusion module have the same structure, and the following operations are added on the basis of a fifth progressive complementary fusion mode, specifically: and a second self-adaptive pooling layer and a full-connection module are added, the third input end of each progressive complementary fusion module is sequentially connected with the second self-adaptive pooling layer and the full-connection module, the output of the full-connection module and the output of the second convolution module are multiplied to form a second convolution output, the second convolution output is respectively added with the first convolution output and the input of the second input end of each progressive complementary fusion module, and other operations are the same as those of the fifth progressive complementary fusion module.

The full-connection module comprises a first full-connection layer, a second full-connection layer, a sixth activation layer and a seventh activation layer, the output of the second self-adaptive pooling layer is input into the first full-connection layer, the first full-connection layer is sequentially connected with the sixth activation layer and the second full-connection layer, the output of the second full-connection layer is input into the seventh activation layer, and the output of the seventh activation layer is used as the output of the full-connection module.

The multilayer residual error modules are all specifically as follows:

the output of the sixth convolution module, the output of the seventh convolution module and the output of the eighth convolution module after connection operation are input to a tenth convolution module, the output of the tenth convolution module and the output of the ninth convolution module are added to obtain a third convolution output, the third convolution output sequentially passes through the residual error submodules and is input to a second residual error unit, and the output of the second residual error unit is used as the output of the multilayer residual error module;

the residual sub-module mainly comprises a plurality of first residual units, wherein the residual sub-modules in the first multi-level residual module and the second multi-level residual module comprise two first residual units, the residual sub-module in the third multi-level residual module comprises three first residual units, and the residual sub-module in the fourth multi-level residual module comprises five first residual units.

The first residual error unit comprises an eleventh convolution module, a twelfth convolution module and a thirteenth convolution module, the third convolution output is respectively input to the eleventh convolution module and the twelfth convolution module, the output of the eleventh convolution module is input to the thirteenth convolution module, and the output of the thirteenth convolution module and the output of the twelfth convolution module are added to obtain a fourth convolution output;

the second residual error unit comprises a fourteenth convolution module, a first transposition convolution module and a second transposition convolution module, wherein the fourth convolution output of the first residual error unit connected with the second residual error unit is respectively input to the fourteenth convolution module and the first transposition convolution module;

the first transposition convolution module and the second transposition convolution module are identical in structure and respectively comprise a second transposition convolution layer and a seventh normalization layer, the input of the transposition convolution module is sequentially input into the second transposition convolution layer and the seventh normalization layer, and the output of the seventh normalization layer is used as the output of the convolution module.

The five depth map lifting modules have the same structure, and specifically comprise:

the depth map enhancement device comprises two convolution modules, wherein the input of a first input end of the depth map enhancement module and the input of a second input end of the depth map enhancement module are connected and then input into a fifteenth convolution module, the output of the fifteenth convolution module and the input of the first input end of the depth map enhancement module are added and then input into a sixteenth convolution module, and the output of the sixteenth convolution module and the input of the second input end of the depth map enhancement module are added and then serve as the output of the depth map enhancement module.

The first convolution block and the sixth convolution block have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first normalization layer and a first activation layer; the second convolution block and the seventh convolution block have the same structure and are mainly formed by sequentially connecting three residual error units; the third convolution block and the eighth convolution block have the same structure and are mainly formed by sequentially connecting four residual error units; the fourth convolution block and the ninth convolution block have the same structure and are mainly formed by sequentially connecting six residual error units; the fifth convolution block and the tenth convolution block have the same structure and are formed by sequentially connecting three residual error units.

The three residual error units of the second convolution block have the same structure, and each residual error unit comprises a second convolution layer, a second normalization layer, a second activation layer, a third convolution layer, a third normalization layer and a third activation layer;

the input of the residual error unit is input into a second convolution layer, the second convolution layer is connected with a third normalization layer after sequentially passing through a second normalization layer, a second activation layer and a third convolution layer, the output of the third normalization layer is added with the input of the residual error unit and then input into a third activation layer, and the output of the third activation layer is used as the output of the residual error unit;

the first residual unit of the third volume block, the first residual unit of the fourth volume block and the first residual unit of the fifth volume block are the same, and the following operations are added on the basis of the residual unit of the second volume block: a fourth convolution layer and a fourth normalization layer are added, the input of the residual error unit is respectively input into the second convolution layer and the fourth convolution layer, the output of the fourth convolution layer is input into the fourth normalization layer, the output of the fourth normalization layer and the output of the third normalization layer are added and then input into the third activation layer, and other operations are the same as the operation of the residual error unit of the second convolution block; the other three residual error units of the third convolution module, the other five residual error units of the fourth convolution module and the other residual error unit of the fifth convolution module have the same structure as the residual error unit of the second convolution block.

The first convolution module is mainly formed by connecting a fifth convolution layer and a fourth active layer, the input of the convolution module is input into the fifth convolution layer, and the output of the fourth active layer is used as the output of the convolution module;

the third convolution module and the fourth convolution module have the same structure and are mainly formed by connecting a sixth convolution layer and a fifth normalization layer, the input of the convolution module is input into the sixth convolution layer, and the output of the fifth normalization layer is used as the output of the convolution module;

the second convolution module, the fifth convolution module and the sixteenth convolution module are identical in structure and mainly formed by sequentially connecting a seventh convolution layer, a sixth normalization layer and a fifth active layer, the input of the convolution module is input into the seventh convolution layer, and the output of the fifth active layer is used as the output of the convolution module.

The fact shows that the depth information has a great effect on improving semantic understanding performance, and can provide geometric correspondence and spatial structure information for RGB representation. Most existing work simply assumes that the depth measurements are accurate and well aligned with the RGB pixels, ignores the problem of low quality depth maps, and simply models the problem as multimodal feature fusion to obtain a better feature representation for more accurate segmentation. However, this may not yield satisfactory results because the actual depth data is usually noisy, and as the network goes deeper, the accuracy may be reduced without an efficient fusion.

The invention provides an effective progressive cross-mode cross-complementary guide encoder, which not only can effectively recalibrate RGB characteristic response and improve the quality of a depth map, but also can extract accurate depth information through a plurality of stages and alternately summarize two recalibrated representations. The main branch consists of two sub-networks, RGB and Depth features are respectively extracted, the key point is the proposed gradual context fusion operation, the operation utilizes high-level fusion information to refine feature information of the local layer, and multi-mode information of the local layer is more efficiently fused. At the same time, multi-level residual decoding is introduced on the one hand to help propagate and fuse the maximum fidelity restoration and final output of information between the two modalities, preserving their specificity over long-term propagation. In addition, the proposed coding and decoding device is plug-and-play, and has good migration effect, so as to improve the convenience of RGB-D semantic understanding. The model of (a) always outperforms the latest technology on indoor challenging data sets.

The invention has the beneficial effects that:

1) the invention proposes a novel two-way cross-modal solution for RGBD semantic understanding that can effectively reduce the impact of noise depth measurement and can also incorporate sufficient complementary information to form discriminant representations for discrimination, the encoder part consisting of two network branches that extract features from RGB and depth images simultaneously and fuse the depth features into an RGB feature map while the network is running. The branch fusion utilizes the high-level refinement and the low-level better to complement the information, and completes the feature extraction.

2) The invention also adopts the residual error progressive technology, ensures the minimum distortion degree, completes the high-quality decoding, simultaneously repeatedly utilizes the high-low layer information, reduces the characteristic loss to the maximum extent, and ensures the better restoration of the image.

3) Experiments show that the method can successfully fuse RGB and depth information together so as to carry out semantic understanding on chaotic indoor scenes. Furthermore, the present invention achieves up-to-date performance on challenging segmented data sets.

Drawings

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a block diagram of an implementation of a PCF module in the encoding stage;

FIG. 3 is a block diagram of an implementation of an MLR module at the decoding stage;

FIG. 4 is a block diagram of an implementation of the DE module at the encoding stage;

FIG. 5a is a test set source picture; FIG. 5b is a test set prediction picture;

FIG. 6a is a test set source picture; FIG. 6b is a test set prediction picture;

FIG. 7a is a test set source picture; fig. 7b is a test set prediction picture.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The general implementation block diagram of the invention is shown in fig. 1, which includes two processes, a training phase and a testing phase;

the specific steps of the training phase process are as follows:

step 1_ 1: selecting 795 original indoor scene RGB images, depth images and corresponding real semantic understanding images, forming a training set, recording an original image set as { J (i, J) }, recording the real semantic understanding images corresponding to the original images as { Jtrue (i, J) }, processing the real semantic understanding images corresponding to each original indoor scene image in the training set into 41 independent thermal coding images by adopting an existing independent thermal coding method (one-hot), and recording a set formed by 41 independent thermal coding images processed by the { Jtrue (i, J) }asJtrue. The height of the original image is 480, the width is 640, i is more than or equal to 1 and less than or equal to 640, J is more than or equal to 1 and less than or equal to 480, J (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in { J (i, J) }, and Jtrue (i, J) represents the pixel value of the pixel point with the coordinate position (i, J) in { Jtrue (i, J) }.

Step 1_ 2: and constructing a convolutional neural network.

The constructed convolutional neural network mainly comprises two parts: the method comprises an encoding stage (extracting characteristics) and a decoding stage (image restoration), wherein the encoding stage and the decoding stage are sequentially connected.

The encoding stage comprises an RGB image, a depth image, five depth map lifting modules (DE), ten rolling blocks and five progressive complementary fusion modules (PCF); the RGB image and the depth image are used as input of a neural network;

in the feature extraction part, namely the encoding stage, because the data set comprises an rgb map and a depth map, each time the input of a sample is divided into two branches, namely rgb and depth, the main branch is the same resnet34 used for feature extraction respectively. In a general view, the encoding stage includes a main branch with five volume blocks, a depth map enhancement module (DE), a progressive complementary fusion module (PCF), and rgb sequentially passes through the five volume blocks to define a fourth volume block and a fifth volume block in the main branch as high-level features, a third volume block as a middle-level feature, the first volume block and the second volume block as low-level features, depth includes the five volume blocks, a ninth volume block and a tenth volume block in the main branch as high-level features, an eighth volume block as a middle-level feature, and a sixth volume block and a seventh volume block as low-level features.

For the encoding stage of the model, the input has two branches, receiving respectively an original color image with three channel components of RGB and an original single-channel image, the sizes of which can be respectively referenced as 3 × 480 × 640 and 1 × 480 × 640. The encoding stage comprises an RGB image, a depth image, five depth image lifting modules, ten rolling blocks and five progressive complementary fusion modules; features are extracted through the same resnet34 main branch, and the RGB image is connected with a fifth convolution block after sequentially passing through a first convolution block, a second convolution block, a third convolution block and a fourth convolution block; the output of the first rolling block is respectively input to the first input ends of the first progressive complementary fusion module and the first depth map lifting module, the output of the second rolling block is respectively input to the first input ends of the second progressive complementary fusion module and the second depth map lifting module, the output of the third rolling block is respectively input to the first input ends of the third progressive complementary fusion module and the third depth map lifting module, the output of the fourth rolling block is respectively input to the first input ends of the fourth progressive complementary fusion module and the fourth depth map lifting module, and the output of the fifth rolling block is respectively input to the first input ends of the fifth progressive complementary fusion module and the fifth depth map lifting module;

the first Convolution block and the sixth Convolution block have the same structure, and both are mainly composed of a first Convolution layer (Convolution, Conv), a first normalization layer (BatchNorm) and a first Activation layer (Act) which are sequentially arranged. The convolution layer of the first layer adopts convolution kernel _ size of 7 and step size of stride2, the edge padding (padding) is 3, the number of convolution kernels is 64, and no matter whether the input end receives a three-channel rgb image or a single-channel depth image, the output of the first convolution block is 64 feature images, and the set formed by the 64 feature images is marked as C^r ₁And C^d ₁Representing rgb and depth branches, respectively, then normalizing them by a first normalization layer, then by an activation function (Relu), and finally outputting C as a first volume block^r ₁And C^d ₁64 x 240 x 320.

Before passing through the second convolution block, a maximum pooling layer (MaxPooling) was passed, with a pooling size of 2, at which time the input second convolution block size was changed to 64 × 120 × 160.

The second convolution block and the seventh convolution block have the same structure and are mainly formed by sequentially connecting three residual error units; the main branch of each residual error unit sequentially consists of a second convolution layer, a convolution kernel of 3, a step length of 1, a second normalization layer, a second activation layer and a third convolution layer, the parameters of the second normalization layer are the same as those of the second convolution layer, the number of the third normalization layer is 64. For the shortcut branches, since the step length is all 1 and the number of input and output channels is the same, there is no operation. Finally, add operation is carried out on the main branch and the shortcut branch, and the final output is recorded as C through a third activation layer (Relu activation function)^r ₂And C^d ₂The size is 64 × 120 × 160.

The third convolution block and the eighth convolution block have the same structure and are mainly formed by sequentially connecting four residual error units, and for the first residual error unit, the following operations are added on the basis of the residual error unit of the second convolution block: a fourth convolution layer and a fourth normalization layer are added, the input of the residual error unit is respectively input into the second convolution layer and the fourth convolution layer, the output of the fourth convolution layer is input into the fourth normalization layer, the output of the fourth normalization layer and the output of the third normalization layer are added and then input into the third activation layer, and other operations are the same as the operation of the residual error unit of the second convolution block; namely, the main branch sequentially comprises a second convolution layer, a convolution kernel of 3, a step length of 2, a second normalization layer, a second activation layer and a third convolution layer, the size of the convolution kernel is 3, the step length is 1, the third normalization layer, and for the shortcut branch, the main branch sequentially comprises a fourth convolution layer, the size of the convolution kernel is 1, the step length is 2, the fourth normalization layer, and the number of output channels is 128.

The other residual units have the same structure as the residual units of the second convolutional block, namely: the number of convolution kernels of the second convolution layer is 128, and shortcut branches of the second convolution layer are still no other operation and only flow of simple input data. And finally, the last operation of each residual error unit is that the add operation is carried out on the main branch and the shortcut branch, and then the add operation is carried out on the main branch and the shortcut branch through a third activation layer (Relu activation function). Record the final output as C^r ₃And C^d ₃The size is 128 × 60 × 80.

For the fourth, fifth, ninth, and tenth volume blocks, the operational and corresponding parameters are substantially the same as for the third volume block. The difference is that the fourth convolution block and the ninth convolution block contain six residual units, the final output channel number is 256, and the final output is C^r ₄And C^d ₄The size is 256 × 30 × 40. The fifth convolution block and the tenth convolution block contain 3 residual error units, the final output channel number is 512, and the final output is recorded as C^r ₅And C^d ₅The size is 512 x 15 x 20.

as shown in fig. 4, the five depth map lifting modules are identical in structure, and the depth lifting modules are specifically operated as follows: firstly, incoming rgb graph and depth graph are performed with Concat connection operation, and the input rgb and depth are both output C of the volume block of the corresponding layer^r _iAnd C^d _i(i ═ 1,2,3,4,5), where i is assumed to be 1, the number of channels obtained at this time is the sum 128 of the two input channels, which can be expressed as 64+64 ═ 128, and then the output O is obtained by the seventh convolution layer using the convolution kernel 3 × 3, with a step size of 1, with an edge fill of 1, the sixth normalization layer and the fifth activation layer (activation function (Relu) layer), with the seventh convolution layer₁Add (add) with rgb of original input to get O^{^}Is represented by O^{^}＝O₁+C^r ₁Then, through a sixth convolution layer, adopting a convolution kernel of 1, edge filling of 0, step length of 1 and a fifth normalization layer to obtain an output O₂Then add operation is carried out with the original input depth to obtain the final output D₁Can be represented as D₁＝O2+C^d ₁. The feature map size after passing through the DE module is the same as the original depth size of the input, e.g. D₁Still 64 x 240 x 320.

In a specific implementation, the fusion operation of the encoding stage is progressively complementary to the fusion module, mainly for fusing rgb and enhanced depth, i.e. C, of the input^r _iAnd D_i(i ═ 1,2,3,4,5), while using feature information of higher layers to complement and assist in the fusion of their previous layers. An up-sampling operation twice as long as before each higher layer is transmitted to the next lower layer ensures the same height and width as the lower layer to keep the characteristic informationMaximum match of message delivery. In addition, the fifth progressive complementary fusion module is slightly different from the other four progressive complementary fusion modules. Since the fifth block is the highest layer, there is no feature information of a higher layer as a guide.

The fifth progressive complementary fusion module comprises five convolution modules, a first adaptive pooling layer and two Softmax layers; the method comprises the steps that firstly, the rgb output by a fifth convolution block passes through a first self-adaptive pooling layer, the height and the width are changed into 1 x 1, then the rgb passes through a fifth convolution layer, the size of a convolution kernel is 1, the step length is 1, the edge filling is 0, then the rgb is transmitted into a fourth activation layer (Sigmoid activation function), the obtained output and the original rgb input are subjected to element multiplication, the output is recorded as I, the characteristics of the output and the original input depth are subjected to Concat connection operation, the result obtained by Concat sequentially passes through a seventh convolution layer, the size of the convolution kernel is 1, the step length is 1, the filling is 0, the number of output channels is changed into the number of original input channels, and the output is recorded as Y through a sixth normalization layer and a fifth activation layer (Sigmoid activation function). Y is independently from I and D₅Add to obtain RGB_newAnd Depth_newI.e. RGB_newY + I and Depth_new＝Y+D₅Respectively obtaining S through a sixth convolution layer, the convolution kernel size is 1, the step length is 1, the filling is 0, a fifth normalization layer and the first/second Softmax function_rgbAnd S_depthThen respectively with the original input C^r ₅And D₅Multiplying corresponding elements to obtain two characteristic graphs, adding the two characteristic graphs, passing through a seventh convolution layer, obtaining a final output recorded as C _ out by the convolution kernel with the size of 3, the step length of 1, the edge filling of 1, a sixth normalization layer and a fifth activation layer⁵. Its output size is 512 x 15 x 20.

As shown in fig. 2, the basic operation flow of the first progressive complementary fusion module to the fourth progressive complementary fusion module is the same as the fifth one, and is described by taking PCF fusion of the fourth volume block as an example to determine whether there is guidance of high-level information. The rgb output by the fourth convolution block passes through a first self-adaptive pooling layer, the height and the width of the rgb are changed into 1 x 1, the rgb passes through a fifth convolution layer, the size of a convolution kernel is 1, the step length is 1, the edge filling is 0, and then the rgb is transmitted into a fourth laserAnd the active layer (Sigmoid activation function) performs element multiplication on the obtained output and the original rgb input, records the output as I, is characterized in that the output and the original input depth perform Concat connection operation, the result obtained by Concat sequentially passes through a seventh convolution layer, the size of a convolution kernel is 1, the step length is 1, the filling is 0, the number of output channels is changed into the number of original input channels, and records the output as Y through a sixth normalization layer and a fifth active layer (Sigmoid activation function). The characteristic information C _ out transmitted by the higher layer at this time⁵After passing through a layer of self-adaptive pooling layer, the height and width are 1 × 1, the size of the self-adaptive pooling layer is changed into b × 512 by using view operation, b is the batch size, 512 is the number of self channels, then the self-adaptive pooling layer passes through a first full connection layer, the number of input and output neurons is 512, 256//16, a sixth activation layer (Relu activation function layer), a second full connection layer, the number of input and output neurons is 256//16, 256, a seventh activation layer (Sigmoid activation function), and the size of the view b × 256 × 1 is obtained and output is recorded as f₁。f₁The multiplication of the elements with the output Y results in an output denoted as weight, which is denoted as f₁Y, weight are respectively related to I and D₄Add to obtain RGB_newAnd Depth_newI.e. RGB_newWeight + I and Depth_new＝weight+D₄Respectively obtaining S through a sixth convolution layer, the convolution kernel size is 1, the step length is 1, the filling is 0, a fifth normalization layer and the first/second Softmax function_rgbAnd S_depthThen respectively with the original input C^r ₄And D₄Multiplying corresponding elements to obtain two characteristic graphs, adding the two characteristic graphs, passing through a seventh convolution layer, obtaining a final output recorded as C _ out by the convolution kernel with the size of 3, the step length of 1, the edge filling of 1, a sixth normalization layer and a Relu activation layer⁴. The output size was 256 x 30 x 40. The PCF operation for 1,2,3 volume blocks is the same as the fourth. It is noted that the output size of the progressive complementary blending is the same as the output size of the corresponding volume block, and plug and play is possible.

in the decoding stage, a multi-level residual error module is mainly designed, multi-scale cavity convolution is carried out, and the residual error idea is used on the combination of different layer characteristic diagrams and decoding, so that high-quality decoding and restoration are completed.

The decoding stage comprises four multi-level residual error modules, a fourth multi-level residual error module sequentially passes through a third multi-level residual error module and a second multi-level residual error module and then is connected with a first multi-level residual error module, the output of the fourth multi-level residual error module is input to the first input end of the third multi-level residual error module, the output of the third multi-level residual error module is input to the first input end of the second multi-level residual error module, the output of the second multi-level residual error module is input to the first input end of the first multi-level residual error module, the output of the fifth progressive complementary fusion module is input to the first inversion convolutional layer, the output of the first inversion convolutional layer is input to the first input end of the fourth multi-level residual error module after twice of upsampling operation, the output of the fourth progressive complementary fusion module is input to the second input end of the fourth multi-level residual error module, and the output of the third progressive complementary fusion module is input to the second input end of the third multi-level residual error module, the output of the second progressive complementary fusion module is input to the second input end of the second multilevel residual error module, the output of the first progressive complementary fusion module is input to the second input end of the first multilevel residual error module, and the output of the first multilevel residual error module is used as the output of the neural network.

In the embodiment shown in FIG. 3, in the multi-level residual block in the decoding stage, the output of MLR is recorded as S _ outⁱ(i ═ 1,2,3,4), the first multi-level residual block includes five convolution blocks, two first residual units, and one second residual unit, as described in detail below with the first MLR as an example. First, the output of the first PCF module is added to the output of the second MLR, and the output is denoted as SC _ out, which can be denoted as S _ out²+C_out¹. Then SC _ out passes through a sixth convolution module and a ninth convolution module, the first layer SC _ out passes through one convolution layer, the convolution kernel size is 3, the step length is 1, the edge filling is 1, the expansion rate (deviation) is 1, the number of the convolution kernels is 64, the normalization is performed by one layer, the LeakyReLu is one layer, and the output is recorded as E¹The second layer SC _ out passes through a convolution layer, the convolution kernel size is 3, the step length is 1, the edge filling is 4, the expansion rate is 4, the number of the convolution kernels is 64, the normalization is performed by one layer, the LeakyReLu layer is formed, and the output is recorded as E²The third layer SC _ out is convolved by one layer, the size of a convolution kernel is 3, the step length is 1, the edge filling is 8, the expansion rate is 8, the number of the convolution kernels is 64, the layer normalization is performed by one layer, the layer LeakyReLu is formed, and the output is recorded as E³The fourth level is convoluted by one layer, the size of a convolution kernel is 1, the step length is 1, the edge filling is 0, the number of the convolution kernels is still 64, the normalization is performed by one layer, a ReLu activation function is formed by one layer, and the output is recorded as E⁴At this time, E1, E2, and E3 are connected by Concat, and the output is first passed through a convolution layer, normalized by one layer, and ReLu activated to match the number of channels, then added to E4, and finally passed through a ReLu output layer, which is denoted as E, and may be denoted as E ═ ReLu (Conv (Concat (E1, E2, E3)) + E4). The size of E is then 64 x 240 x 320. Followed by 3 residual units. For the first two residual error units, the residual error units belong to common residual error units, each residual error unit is formed by sequentially passing a main branch through a seventh convolutional layer, a convolutional kernel is 3, the step length is 1, the edge padding is 1, the number of the convolutional kernels is 64, a sixth normalization layer, a fifth activation layer (ReLu activation function), a seventh convolutional layer, the size of the convolutional kernel is 3, the step length is 1, the edge padding is 1, the number of the convolutional kernels is 64, a sixth normalization layer is used for recording the output of the sixth normalization layer as E ^ and a shortcut branch passes through a layer of the seventh convolutional layerThe output after the seventh convolution layer, the sixth normalization layer and the fifth active layer is recorded as E^sAdding the main branch and the shortcut branch, activating by ReLu to obtain an output which is expressed as R ═ ReLu (E)^{^}+E^s). However, for the last residual unit, R sequentially passes through a layer of seventh convolutional layer in the main branch, the convolutional kernel is 3, the step size is 1, the edge padding is 1, the number of convolutional kernels is 64, a layer of sixth normalization, a fifth active layer (ReLu active function), since the incoming step size is 2, then passes through a second transposed convolutional layer (convfransposition 2d), the convolutional kernel size is 3, the step size is 2, the edge padding is 1, the output padding is 1, the number of convolutional kernels is 64, then passes through a layer of seventh normalization, for R on the shortcut branch, passes through a layer of second transposed convolutional layer, the convolutional kernel size is 3, the step size is 2, the edge padding is 1, the number of convolutional kernels is 64, and a layer of seventh normalization is performed, and finally the output of the first MLR block is obtained and is recorded as S _ out¹The size is 64 × 480 × 640.

For the second, third and fourth multilevel residual modules, the operation is the same as the first multilevel residual module, and the input to the second multilevel residual module is the third MLR output S _ out³Output C _ out of the second PCF Module²The input to the second multilevel residual module is the fourth MLR output S _ out⁴The output C _ out of the third PCF module³However, the input to the second multilevel residual module is the output C _ out of the fifth PCF⁵Through a layer of transposed convolutional layers and upsampling operations for size matching, and the output C _ out of the fourth PCF block⁴. In addition, the residual units in the second multilevel residual module, the third multilevel residual module and the fourth multilevel residual module are 3,4 and 6 in sequence, and the output sizes of the residual units are respectively 64 × 240 × 320, 64 × 120 × 160 and 128 × 60 × 80.

The output of the first MLR module passes through a convolution layer, wherein the size of convolution kernels is 1, the step length is 1, the edge filling is 0, and the number of the convolution kernels is the number of the categories to be classified, so that the final output is obtained.

Step 1_ 3: each source in the training setThe original RGB color image and Depth image are data-enhanced by brightness, contrast, inversion, etc. and then used as the initial input image, with a batch size of 6. Inputting the image into a convolutional neural network for training to obtain 41 semantic understanding prediction graphs with original sizes corresponding to each original indoor scene image in a training set, wherein the set of the semantic understanding prediction graphs is marked as J_pre1In addition, for auxiliary training, the network obtains corresponding 41 semantic understanding prediction graphs from the outputs of the 2 nd, 3 th and 4 th MLRs by convolution of 1 × 1 layer, and the formed sets are respectively marked as J_pre2，J_pre3，J_pre4The sizes are 41 × 240 × 320, 41 × 120 × 160, 41 × 60 × 80 in sequence.

Step 1_ 4: for actual loss calculation, the conventional neglect background is followed, so only 40 prediction images participate in calculation, and therefore, the calculation of loss function values is performed between a set formed by 4 40 semantic understanding prediction images corresponding to each original indoor scene image in a training set and a set formed by 40 One-hot (One-hot) coding images processed by corresponding real semantic understanding images, wherein before the processing of the prediction images into the One-hot coding, a label firstly changes the size of the label to be matched with the size of the corresponding prediction image through a nearest interpolation (interplate) method, namely, the label is matched with the size of the J prediction image, namely, the J prediction image is matched with the size of the J prediction image_pre2，J_pre3，J_pre4And (4) matching the sizes. Will J_preiAnd J_trueThe Loss function value between is noted as Loss (J)_prei，J_true) Wherein i ═ 1,2,3,4, Loss (J)_prei，J_true) The final loss value is obtained by averaging the loss function values of the four prediction map sets using cross entropy (CrossEntropyLoss).

Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times, wherein V is more than 1 until the neural network convergence reaches saturation, namely the fluctuation of the training loss value is difficult to reduce, and the verification loss is almost reduced to the minimum, obtaining a convolutional neural network classification training model at the moment, and then taking the obtained network weight vector and bias as the optimal weight vector and the optimal bias term of the convolutional neural network classification training model; in this example, V is chosen to be 200.

The test stage process comprises the following specific steps:

step 2_ 1: 654 pairs of original RGB color pictures and corresponding Depth maps are selected to form a test set, which is marked as { I^{^}(I, j) }, height and width 480 x 640, wherein I is more than or equal to 1 and less than or equal to 640, j is more than or equal to 1 and less than or equal to 480, and I^{^}(I, j) represents { I^{^}And (i, j) the coordinate position in the (i, j) is the pixel value of the pixel point of (i, j). Unlike training, the test set does not use flipping, brightness, contrast, etc. to perform data enhancement.

Step 2_ 2: inputting the selected test set and the RGB and Depth into a convolutional neural network classification training model in a pair mode according to batch 1, predicting by using an optimal weight vector and an optimal bias to obtain a corresponding predicted semantic understanding image, and marking as { I ^_pre(i,j)}，I^_pre(I, j) represents { I ^ j_preAnd (i, j) correspondingly storing the pixel value of the pixel point with the coordinate position (i, j).

To further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

And (3) carrying out neural network architecture by using a python-based deep learning library pytorech. The data except the training set in the indoor scene image database Nyuv2 is used as a test set to analyze how the segmentation effect of the indoor scene images (taking 654 indoor scene images) predicted by the method is. Here, the segmentation performance of the predicted semantic understanding image is evaluated by using 3 common objective parameters of the evaluated semantic understanding method as evaluation indexes, namely, Class Accuracy (Class Accuracy), Mean Pixel Accuracy (Mean Pixel Accuracy), and a ratio of Intersection and Union of the segmented image and the label image (Mean Intersection over Union, MIoU).

The method is utilized to predict each indoor scene image in the indoor scene image database Nyuv2 test set to obtain a predicted semantic understanding image corresponding to each indoor scene image, and the Class Accuracy, the average Pixel Accuracy Mean Accuracy, and the ratio MIoU of intersection and union of the segmentation image and the label image, which reflect the semantic understanding effect of the method, are listed in Table 1. As can be seen from the data listed in table 1, the segmentation result of the indoor scene image obtained by the method of the present invention is better, which indicates that it is feasible and effective to obtain the predicted semantic understanding image corresponding to the indoor scene image by using the method of the present invention.

TABLE 1 evaluation results on test sets using the method of the invention

FIG. 5a shows the 1 st original indoor scene image of the same scene; FIG. 5b shows a predicted semantic understanding image obtained by predicting the original indoor scene image shown in FIG. 5a by the method of the present invention; FIG. 6a shows the 2 nd original indoor scene image of the same scene; FIG. 6b shows a predicted semantic understanding image obtained by predicting the original indoor scene image shown in FIG. 6a by using the method of the present invention; FIG. 7a shows the 3 rd original indoor scene image of the same scene; FIG. 7b shows a predicted semantic understanding image obtained by predicting the original indoor scene image shown in FIG. 7a using the method of the present invention; comparing fig. 5a and 5b, comparing fig. 6a and 6b, and comparing fig. 7a and 7b, it can be seen that the segmentation precision of the predicted semantic understanding image obtained by the method of the present invention is higher.

Claims

1. An indoor scene image processing method based on a progressive guidance fusion complementary network is characterized in that: the method comprises a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1_ 2: and constructing a convolutional neural network.

step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times, wherein the training is completed to obtain a convolutional neural network classification training model, and Q multiplied by V loss function values are obtained in total; then finding out the minimum loss function value from the Q multiplied by V loss function values, and correspondingly taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model;

the test stage process comprises the following specific steps:

2. The indoor scene image processing method based on the progressive guidance fusion complementary network of claim 1, wherein: the constructed convolutional neural network comprises a coding stage and a decoding stage, wherein the coding stage and the decoding stage are sequentially connected;

the encoding stage comprises an RGB image, a depth image, five depth image lifting modules, ten rolling blocks and five progressive complementary fusion modules;

the decoding stage comprises four multi-level residual error modules, a fourth multi-level residual error module is connected with a first multi-level residual error module after sequentially passing through a third multi-level residual error module and a second multi-level residual error module, the output of the fourth multi-level residual error module is input to the first input end of the third multi-level residual error module, the output of the third multi-level residual error module is input to the first input end of the second multi-level residual error module, the output of the fifth progressive complementary fusion module is input to a third transposed convolution module, the output of the third transposed convolution module is input to the first input end of the fourth multi-level residual error module after twice up-sampling operation, the output of the fourth progressive complementary fusion module is input to the second input end of the fourth multi-level residual error module, and the output of the third progressive complementary fusion module is input to the second input end of the third multi-level residual error module, the output of the second progressive complementary fusion module is input to the second input end of the second multilevel residual error module, the output of the first progressive complementary fusion module is input to the second input end of the first multilevel residual error module, the output of the first multilevel residual error module is used as the output of the neural network, the third transposed convolution module comprises a first transposed convolution layer and an eighth normalization layer, the input of the transposed convolution module is sequentially input to the first and eighth normalization layers, and the output of the eighth normalization layer is used as the output of the convolution module.

3. The indoor scene image processing method based on the progressive guidance fusion complementary network as claimed in claim 2, wherein: the fifth progressive complementary fusion module has a structure specifically as follows:

4. The indoor scene image processing method based on the progressive guidance fusion complementary network of claim 3, wherein: the full-connection module comprises a first full-connection layer, a second full-connection layer, a sixth activation layer and a seventh activation layer, the output of the second self-adaptive pooling layer is input into the first full-connection layer, the first full-connection layer is sequentially connected with the sixth activation layer and the second full-connection layer, the output of the second full-connection layer is input into the seventh activation layer, and the output of the seventh activation layer is used as the output of the full-connection module.

5. The indoor scene image processing method based on the progressive guidance fusion complementary network as claimed in claim 2, wherein: the multilayer residual error modules are all specifically as follows:

6. The indoor scene image processing method based on the progressive guidance convergence complementary network as claimed in claim 5, wherein:

7. The indoor scene image processing method based on the progressive guidance fusion complementary network as claimed in claim 2, wherein:

8. The indoor scene image processing method based on the progressive guidance fusion complementary network as claimed in claim 2, wherein: the first convolution block and the sixth convolution block have the same structure and are mainly formed by sequentially connecting a first convolution layer, a first normalization layer and a first activation layer; the second convolution block and the seventh convolution block have the same structure and are mainly formed by sequentially connecting three residual error units; the third convolution block and the eighth convolution block have the same structure and are mainly formed by sequentially connecting four residual error units; the fourth convolution block and the ninth convolution block have the same structure and are mainly formed by sequentially connecting six residual error units; the fifth convolution block and the tenth convolution block have the same structure and are formed by sequentially connecting three residual error units.

9. The indoor scene image processing method based on the progressive guidance fusion complementary network as claimed in claim 2, wherein: the first convolution module is mainly formed by connecting a fifth convolution layer and a fourth active layer, the input of the convolution module is input into the fifth convolution layer, and the output of the fourth active layer is used as the output of the convolution module;

the third convolution module, the fourth convolution module and the sixteenth convolution module are the same in structure and mainly composed of a sixth convolution layer and a fifth normalization layer in a connected mode, the input of the convolution module is input into the sixth convolution layer, and the output of the fifth normalization layer is used as the output of the convolution module;

the second convolution module, the fifth convolution module and the fifteenth convolution module are identical in structure and mainly formed by sequentially connecting a seventh convolution layer, a sixth normalization layer and a fifth active layer, the input of the convolution module is input into the seventh convolution layer, and the output of the fifth active layer is used as the output of the convolution module.