CN113920317A

CN113920317A - Semantic segmentation method based on visible light image and low-resolution depth image

Info

Publication number: CN113920317A
Application number: CN202111369121.XA
Authority: CN
Inventors: 袁媛; 苏月皎; 姜志宇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-01-11
Anticipated expiration: 2041-11-15
Also published as: CN113920317B

Abstract

The invention provides a semantic segmentation method based on a visible light image and a low-resolution depth image. Based on multi-task learning, the low-resolution depth image is processed by utilizing the high-resolution visible light image by utilizing the super-resolution module network to obtain the high-resolution depth image, and then the semantic segmentation result is obtained by utilizing the semantic segmentation module network, so that the high-resolution visible light information which can be obtained in the actual situation is utilized, the resolution and the quality of the semantic segmentation are ensured, the problem of resolution alignment is solved, and the application of the semantic segmentation in the real life is expanded.

Description

Semantic segmentation method based on visible light image and low-resolution depth image

Technical Field

The invention belongs to the technical field of computer vision and image processing, and particularly relates to a semantic segmentation method based on a visible light image and a low-resolution depth image.

Background

Visible light-Depth (RGB-Depth, abbreviated to RGB-D) semantic segmentation is the segmentation of scene objects and regions using the visual appearance information of the scene and the scene distance information acquired by a Depth sensor. The RGB-D semantic segmentation has wide application as an underlying core technology, such as indoor navigation, automatic driving, machine vision and the like.

With the development of the deep learning method, the RGB-D semantic segmentation technology is effectively improved and widely applied to actual visual tasks. S. gupta et al in the document "s.gupta, r.girshick, p.arbel ez, and j.malik.learning Rich sources from RGB-D Images for Object Detection and segmentation. in European Conference on Computer Vision,2014, pp.345-360" propose a Two-stream (Two-stream) encoder-decoder network containing Two encoder branches for extracting visible light information and depth information Features, respectively, to predict the segmentation results. Most of the subsequent RGB-D semantic segmentation methods adopt a double-current network as the basis of the RGB-D semantic segmentation network to extract, fuse and upsample multi-modal features. The document "X.Chen, K.Y.Lin, J.Wang, W.Wu, C.Qian, H.Li, and G.Zeng.Bi-directional Cross-modulation Feature Propagation with Separation-and-Aggregation Gate for RGB-D sensing segmentation. in European Conference connection Computer Vision, 2020" by Chen et al proposes a two-way Cross-modal Feature Propagation based on a Gate, improving the way in which features of both modalities are fused.

Although the above methods work well, they tend to rely on a large number of visible and depth image pairs of the same resolution as training data. However, in reality, due to the depth sensor principle and hardware limitations, the resolution of the depth image is often low, the visible light camera is developed rapidly, and the acquired visible light image has high resolution. Based on this, the current research is mostly to realize the matching with the depth image resolution by acquiring the visible light image with lower resolution. On one hand, the visible light image information with high resolution is not fully utilized, and on the other hand, the generalization capability of the RGB-D semantic segmentation model and the capability of solving practical problems in real life are limited.

Disclosure of Invention

In order to overcome the defect that the existing RGB-D semantic segmentation method cannot process the non-aligned resolution RGB-D image pair, the invention provides a semantic segmentation method based on a visible light image and a low-resolution depth image. Based on multi-task learning, a super-resolution module network is used for processing a high-resolution visible light image and a low-resolution depth image to obtain a high-resolution depth image, and a semantic segmentation module network is used for obtaining a semantic segmentation result, so that high-resolution visible light information which can be obtained in an actual situation is utilized, the resolution and quality of semantic segmentation are guaranteed, the problem of resolution alignment is solved, and the application of semantic segmentation in real life is expanded.

A semantic segmentation method based on a visible light image and a low-resolution depth image is characterized by comprising the following steps:

step 1: the method comprises the steps of conducting down-sampling on a depth image in an input RGB-D data set to obtain a depth image with low resolution; training a super-resolution module network by taking a visible light image and a low-resolution depth image in an RGB-D data set as input data and taking the depth image as supervision information to obtain a trained network;

the RGB-D data set is a public visible light and depth image data set;

the super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders perform feature extraction on input images to obtain corresponding feature images; the fusion module performs addition fusion on the feature images output by the two encoders to obtain a fused feature image; the decoder performs up-sampling processing on the fused image to obtain a depth image with the same resolution as the input visible light image;

step 2: the visible light image in the RGB-D data set and the depth image obtained by the super-resolution module are used as input data, the middle two-layer parameters of the encoder of the trained super-resolution module network are used as supervision information of the middle two-layer parameters of the encoder of the semantic segmentation module network, and the semantic segmentation module network is trained to obtain a trained network;

the semantic segmentation module network comprises two parallel encoders, a fusion module and a decoder, wherein the encoders perform feature extraction on input images to obtain corresponding feature images; the fusion module performs addition fusion on the feature images output by the two encoders to obtain a fused feature image; the decoder performs upsampling processing on the fused image to obtain an image which is a semantic segmentation result image;

and step 3: and inputting the RGB-D data obtained by real acquisition into a trained super-resolution module network, and outputting the RGB-D data which is a semantic segmentation result image through the trained semantic segmentation module network.

2. The method of claim 1, wherein the semantic segmentation is based on a visible light image and a low resolution depth image, and wherein: loss function L of the super-resolution module network_SRComprises the following steps:

where N represents the number of high and low resolution depth image pairs, i represents the index of the high and low depth image pairs,

representing the original depth image in the RGB-D data set,

representing low-resolution depth images obtained by down-sampling, I_rgbRepresenting a visible light image, W, in an RGB-D data set_srRepresenting super-resolution module network parameters, f_sr(. -) represents super resolution module network processing.

Specifically, the loss function L of the super-resolution module network_SRComprises the following steps:

representing the original depth image in the RGB-D data set,

Specifically, the encoders in the super-resolution module network and the semantic segmentation module network are both 4-layer convolutional networks, and the loss function L of the semantic segmentation module network_segComprises the following steps:

wherein L (x, class) represents the weighted cross entropy between the predicted segmentation result and the real label, x represents the predicted semantic segmentation result, class represents the category,

represents the parameters of the i-th layer of the encoder in the super-resolution module network,

representing parameters of an i-th layer of an encoder in a semantic segmentation module network;

wherein, the calculation formula of L (x, class) is as follows:

wherein weight [ class ] represents the weight of class, the value of the weight is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x [ class ] represents the class channel of the output characteristic diagram, j represents the position of the pixel, and x [ j ] represents the probability that the pixel j is predicted to be the class.

The invention has the beneficial effects that: due to the introduction of the super-resolution subtask, the difference between the depth resolution and the visible light resolution which can be obtained in real life can be made up, the high-resolution visible light information which can be obtained in real life is fully utilized, and the method has more practical and industrial values; because the relativity between the super-resolution subtasks and the semantic segmentation subtasks is utilized, the semantic segmentation subtask architecture is assisted to be optimized, and the accuracy of the semantic segmentation network can be improved. The invention is suitable for vehicle-mounted auxiliary systems, automatic driving systems, indoor autonomous navigation systems and the like, and has better practical value.

Drawings

FIG. 1 is a flow chart of the semantic segmentation method based on visible light images and low resolution depth images of the present invention;

FIG. 2 is a comparison graph of depth image results for a super-resolution module network of the present invention;

in the figure, (a) -an original depth image, (b) -a super-resolution module network output depth image, (c) -an actual high-resolution depth image and (d) -a high-resolution visible light image;

FIG. 3 is a comparison image of the segmentation results of the present invention;

in the figure, (a) -input depth image, (b) -input visible light image, (c) -segmentation result image of the invention and (d) -real segmentation image.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the present invention provides a semantic segmentation method based on a visible light image and a low resolution depth image, which is implemented as follows:

the resolution of the visible light sensor in real life is higher than that of the depth sensor, and in order to make the method applicable to practice, the processing object of the invention is a high-resolution visible light image I_rgbAnd corresponding low resolution depth image

However, the resolution of the two modal images in the existing large RGB-D data set is consistent, so in order to train the super-resolution subtask, the depth image in the RGB-D data set is firstly down-sampled to obtain the depth image with relatively lower resolution

1. Super resolution subtask

Visible light image I_rgbAnd low resolution depth image

And performing super-resolution module network training as input. The super-resolution module network comprises two parallel encoders, a fusion module and a decoder, wherein the two branched encoders are used for extracting characteristics of a high-resolution visible light image and a low-resolution depth image, and the extracted characteristics are fused and transmitted to the decoder for up-sampling. With high resolution depth information originally in the data set

Supervising the generation of the high resolution depth image as a supervision signal to train the super resolution module network:

representing the original depth image in the ith pair of images in the RGB-D dataset,

a low resolution depth image representing the image of the ith pair obtained by down-sampling,

representing the visible image in the ith pair of images in an RGB-D dataset, W_srRepresenting super-resolution module network parameters, f_sr(. -) represents super resolution module network processing.

2. Semantic segmentation subtasks

The super-resolution subtask may obtain an aligned resolution RGB-D image pair from a non-aligned resolution RGB-D image pair. Inputting the predicted aligned RGB-D image pair into a semantic segmentation module network, and extracting the characteristics of the visible light image and the depth image predicted by the super-resolution module network by using two encoders comprising K-layer convolution, wherein K is the number of network layers and can be adjusted according to actual conditions. And fusing the extracted features, transmitting the fused features into a decoder, and performing up-sampling to finally obtain a segmentation result.

Aiming at the training of a semantic segmentation module network and the optimization solution of a loss function thereof, the invention adopts a two-part combined mode, wherein one part is from the cross entropy loss between a predicted segmentation result and a real label in a data set, and the other part is from the coupling constraint between a super-resolution module network and the semantic segmentation module network, namely, after the training of the super-resolution module network is finished, the parameters of two middle layers of encoders of the super-resolution sub-module network are taken out and are used as the supervision information of two middle layers of parameters of the semantic segmentation module network encoder to assist the training of two middle layers of the semantic segmentation network encoder. Namely, the following objective function is used for carrying out optimization solution on the network parameters of the semantic segmentation module:

wherein weight [ class ] represents the weight of class, the value of the weight is equal to the proportion of the number of class pixels in the data set to the total number of pixels, x [ class ] represents the class channel of the output characteristic diagram, j represents the position of the pixel, and x [ j ] represents the probability that the pixel j is predicted to be the class, and the value range is 0-1.

3. Model application

After the model parameters are optimized and learned, the highest-resolution visible light image which can be collected by the visible light sensor in a real scene and the highest-resolution depth image which can be collected by the depth sensor are input into the super-resolution module network, and the depth image which is aligned to the resolution of the visible light image upwards can be obtained. And then, inputting the depth image obtained by prediction and the visible light image collected by the visible light sensor into a semantic segmentation module network for predicting the segmentation result.

In order to verify the effectiveness of the method, a simulation experiment is carried out by using Pythrch on an operating system with a central processing unit of Intel (R) core (TM) i7-6800K CPU @3.40GHz and a memory 60G, Linux. The data used in the experiment was the published SUN RGB-D dataset. The SUN RGB-D dataset is the current largest RGB-D semantically segmented dataset containing 40 classes of 10335 annotated RGB-D images, of which 5285 pairs were used for training and 5050 pairs were used for testing. It is captured by four different sensors, Kinect V1, Kinect V2, xution and RealSense.

To demonstrate the effectiveness of the algorithm, RedNet, ACNet, PAP, FSFNet were chosen as comparison methods, respectively. Wherein RedNet is proposed in the document J.Jiang, L.Zheng, F.Luo, and Z.Zhang.RedNet for index RGB-D magnetic segmentation. Eprint Arxiv,2018. "; PAP is set forth in the literature "Z.Zhang, Z.Cui, C.Xu, Y.Yan, N.Sebe, J.Yang.Pattern-affinity amplification error Depth, Surface Normal and selection. in IEEE Conference on Computer Vision and Pattern Recognition,2019, pp.4106-4115"; FSFNet is set forth in the document "Y.Su, Y.Yuan, Z.Jiang.deep feature selection-and-fusion for RGB-D magnetic segmentation. in International Conference on Multimedia & Expo, 2021."; ACNet is proposed in documents X.Hu, K.Yang, L.Fei, and K.Wang, "ACNET: Attention Based networks to explicit comparative feeds for RGBD continuous," in Proc.IEEE International Conference on Image Processing,2019, pp.1440-1444. And respectively calculating two indexes of an average cross-over ratio (mIoU) and a Pixel precision (Pixel Acc), evaluating the RGB-D semantic segmentation quality, and obtaining a better segmentation effect when the index value is larger. The calculation results are shown in table 1, and it can be seen that both indexes of the present invention perform better than other methods.

TABLE 1

Fig. 2 shows an input low-resolution depth image, a super-resolution depth image obtained by network prediction using the super-resolution module of the present invention, and an actual high-resolution depth image and a visible light image, and it can be seen that the super-resolution subtask of the present invention can super-divide non-aligned resolution RGB-D image data into aligned resolution RGB-D data, thereby laying a foundation for a subsequent semantic segmentation subtask.

Fig. 3 shows a semantic segmentation result image and a real segmentation image obtained by the method of the present invention when a low resolution depth image and a visible light image are input. It is seen that the invention can achieve better semantic segmentation performance even when the resolution of the input depth image is lower than that of the visible light image.

Claims

1. A semantic segmentation method based on a visible light image and a low-resolution depth image is characterized by comprising the following steps:

the RGB-D data set is a public visible light and depth image data set;

representing the original depth image in the RGB-D data set,

3. A method of semantic segmentation based on visible light images and low resolution depth images according to claim 1 or 2, characterized in that: the encoders in the super-resolution module network and the semantic segmentation module network are both 4-layer convolutional networks, and the loss function L of the semantic segmentation module network_segComprises the following steps:

wherein, the calculation formula of L (x, class) is as follows: