WO2018076212A1

WO2018076212A1 - De-convolutional neural network-based scene semantic segmentation method

Info

Publication number: WO2018076212A1
Application number: PCT/CN2016/103425
Authority: WO
Inventors: 黄凯奇; 赵鑫; 程衍华
Original assignee: 中国科学院自动化研究所
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2018-05-03

Abstract

Disclosed is a de-convolutional neural network-based scene semantic segmentation method. The method comprises the following steps of: S1, extracting intensive feature expression for a scene picture by using a full-convolutional neural network; and S2, carrying out up-sampling learning and object edge optimization on the intensive feature expression obtained in the step S1 through utilizing a locally sensitive de-convolutional neural network by means of a local affinity matrix of the picture, so as to obtain a score map of the picture and then realize refined scene semantic segmentation. Through the locally sensitive de-convolutional neural network, the sensitivity, to the local edge, of the full-convolutional neural network is strengthened by utilizing local bottom layer information, so that scene segmentation with higher precision is obtained.

Description

基于反卷积神经网络的场景语义分割方法Scene Semantic Segmentation Method Based on Deconvolution Neural Network

技术领域Technical field

本发明涉及模式识别、机器学习、计算机视觉领域，特别涉及一种基于反卷积神经网络的场景语义分割方法。The invention relates to the field of pattern recognition, machine learning and computer vision, in particular to a scene semantic segmentation method based on deconvolution neural network.

背景技术Background technique

随着计算机运算能力的飞速提升，计算机视觉、人工智能、机器感知等领域也迅猛发展。场景语义分割作为计算机视觉中一个基本问题之一，也得到了长足的发展。场景语义分割就是利用计算机对图像进行智能分析，进而判断图像中每个像素点所属的物体类别，如地板、墙壁、人、椅子等等。传统的场景语义分割算法一般仅仅依靠RGB(红绿蓝三原色)图片来进行分割，很容易受到光线变化、物体颜色变化以及背景嘈杂的干扰，在实际运用中很不鲁棒，精度也很难到用户需求。With the rapid advancement of computer computing power, the fields of computer vision, artificial intelligence, and machine perception have also developed rapidly. As one of the basic problems in computer vision, scene semantic segmentation has also been greatly developed. Scene semantic segmentation is the use of computer to intelligently analyze images, and then determine the object categories to which each pixel in the image belongs, such as floors, walls, people, chairs, and so on. The traditional scene semantic segmentation algorithm generally only relies on RGB (red, green and blue) to segment, which is easy to be affected by light changes, object color changes and background noise. It is not robust in practical use, and the accuracy is difficult. User needs.

深度传感技术的发展，像微软的Kinect，能够捕捉到高精度的深度图片，很好的弥补了传统的RGB图片的上述缺陷，为鲁棒性好、精度高的物体识别提供了可能性。在计算机视觉和机器人领域，有大量的研究探索如何有效的利用RGB和深度信息来提高场景分割的精度。这些算法基本上都是利用现在最先进的全卷积神经网络来进行场景分割，但是全卷积神经网络每个神经单元都有很大的感受野，很容易造成分割的物体边沿非常粗糙。其次在RGB和深度信息融合时也采用最简单的叠加策略，并不考虑这两种模态的数据在区分不同场景下的不同物体时所起的作用截然不同的情况，造成在语义分割时候许多物体分类错误。The development of depth sensing technology, like Microsoft's Kinect, can capture high-precision depth pictures, which makes up for the above-mentioned defects of traditional RGB pictures, and provides possibilities for object recognition with high robustness and high precision. In the field of computer vision and robotics, there is a lot of research to explore how to effectively use RGB and depth information to improve the accuracy of scene segmentation. These algorithms basically use the most advanced full convolutional neural network to perform scene segmentation, but each neural unit of the full convolutional neural network has a large receptive field, which is easy to cause the edge of the segmented object to be very rough. Secondly, the simplest superposition strategy is adopted in the fusion of RGB and depth information. It does not consider the fact that the data of these two modes play different roles in distinguishing different objects in different scenes, resulting in many semantic segmentation. The object is classified incorrectly.

发明内容Summary of the invention

本发明针对现有技术存在的上述问题，提出一种基于反卷积神经网络的场景语义分割方法，以提高场景语义分割的精度。The present invention is directed to the above problems existing in the prior art, and proposes a method based on deconvolution. Neural network scene semantic segmentation method to improve the accuracy of scene semantic segmentation.

本发明的基于反卷积神经网络的场景语义分割方法，包括下述步骤：The scene semantic segmentation method based on deconvolution neural network of the present invention comprises the following steps:

步骤S1，对场景图片用全卷积神经网络提取密集特征表达；Step S1, extracting a dense feature expression by using a full convolutional neural network for the scene picture;

步骤S2，利用局部敏感的反卷积神经网络并借助所述图片的局部亲和度矩阵，对步骤S1中得到的密集特征表达进行上采样和优化，得到所述图片的分数图，从而实现精细的场景语义分割。Step S2, using a locally sensitive deconvolution neural network and using the local affinity matrix of the picture, upsampling and optimizing the dense feature expression obtained in step S1 to obtain a score map of the picture, thereby achieving fine Semantic segmentation of the scene.

进一步地，所述局部亲和度矩阵通过提取所述图片的SIFT(Scale-invariant feature transform：尺度不变特征变换)特征、SPIN(Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes：在复杂三维场景中利用旋转图像进行有效的目标识别)特征以及梯度特征，然后利用ucm-gPb(Contour Detection and Hierarchical Image Segmentation：轮廓检测和多级图像分割)算法求得。Further, the local affinity matrix is obtained by extracting a SIFT (Scale-invariant feature transform) feature of the picture, and SPIN (Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes: in a complex three-dimensional scene) The feature of the effective target recognition using the rotated image and the gradient feature are then obtained by using the ucm-gPb (Contour Detection and Hierarchical Image Segmentation) algorithm.

进一步地，所述局部敏感的反卷积神经网络由三个模块多次拼接而成，该三个模块分别是局部敏感的反聚集层、反卷积层和局部敏感的均值聚集层。Further, the locally sensitive deconvolution neural network is formed by multiple splicing of three modules, which are a locally sensitive anti-aggregation layer, a deconvolution layer, and a locally sensitive mean aggregation layer.

进一步地，所述拼接次数为2或3次。Further, the number of stitching is 2 or 3 times.

进一步地，通过以下公式得到所述局部敏感的反聚集层的输出结果：

其中x代表特征图中某个像素点的特征向量，A＝{A_i,j}是x为中心得到的一个s×s大小的局部亲和度矩阵，表征周围领域的像素点和中间像素点是否相似，(i,j)和(o,o)分别代表亲和度矩阵中的任意位置及中心位置，Y＝{Y_i,j}是反聚集输出的特征图。Further, the output of the locally sensitive anti-aggregation layer is obtained by the following formula:

Where x represents the eigenvector of a pixel in the feature map, A={A _i,j } is a local affinity matrix of s×s size obtained by centering x, representing the pixel points and intermediate pixels of the surrounding domain Whether they are similar, (i, j) and (o, o) represent arbitrary positions and center positions in the affinity matrix, respectively, and Y = {Y _{i, j} } is a feature map of the inverse aggregate output.

进一步地，通过以下公式实现所述局部敏感的均值聚集层：

其中，y是输出的特征向量，A＝{A_i,j}是y为中心得到的一个s×s大小的局部亲和度矩阵，A_i,j表征周围领域的像素点和中间像素点是否相似，(i,j)和(o,o)分别代表亲和度矩阵中的任意位置及中心位置，X＝{X_i,j}是输入特征图。Further, the locally sensitive mean aggregation layer is implemented by the following formula:

Wherein, y is the feature vector _{output, A = {A i, j} } of y obtained for the central partial affinity matrix a s × s _size, A _{i, j} characterization pixels and the center pixel points around the art whether Similarly, (i, j) and (o, o) represent arbitrary positions and center positions in the affinity matrix, respectively, and X = {X _{i, j} } is an input feature map.

进一步地，在所述步骤S1中，所述场景图片包括RGB图片和深度图片，所述方法还包括步骤S3：将得到的RGB分数图和深度分数图通过开关门融合层进行最优化融合，从而实现更精细的场景语义分割。Further, in the step S1, the scene picture includes an RGB picture and a depth picture, and the method further includes step S3: optimizing the RGB score map and the depth score map through the switch gate fusion layer, thereby Achieve more detailed scene semantic segmentation.

进一步地，所述的开关门融合层包括拼接层、卷积层以及归一化层。Further, the switch gate fusion layer includes a splicing layer, a convolution layer, and a normalization layer.

进一步地，所述卷积层通过如下函数实现：

其中P^rgb∈□^c×h×w为基于RGB数据预测的分数图，P^depth∈□^c×h×w为基于深度数据预测的分数图，W∈R^c×2c×1×1为开关门融合层学习的滤波子，C∈R^c×h×w是卷积输出的贡献系数矩阵。Further, the convolution layer is implemented by the following function:

Where P ^rgb ∈ □ ^{c × h × w} is a score map based on RGB data prediction, P ^depth ∈ □ ^{c × h × w} is a score map based on depth data prediction, and W ∈ R ^{c × 2c × 1 × 1} is a switch gate The filter of the fusion layer learning, ^C ∈ R ^{c × h × w} is the contribution coefficient matrix of the convolution output.

进一步地，所述归一化层通过sigmoid函数(S型的函数，也称为S型生长曲线)实现。Further, the normalized layer is implemented by a sigmoid function (a function of S type, also referred to as an S-type growth curve).

本发明中，通过局部敏感的反卷积神经网络，利用局部底层信息来加强全卷积神经网络对局部边沿的敏感性，从而得到更高精度的场景分割，能够有效的克服全卷积神经网络的固有缺陷，即聚合了非常大的上下文信息来进行场景分割，造成边沿的模糊效应。In the invention, the locally sensitive deconvolution neural network is used to strengthen the sensitivity of the full convolutional neural network to the local edges by using the local underlying information, thereby obtaining a more accurate scene segmentation, which can effectively overcome the full convolutional neural network. The inherent defect is that a large amount of context information is aggregated to perform scene segmentation, causing blurring effects of edges.

进一步地，通过设计开关门融合层，能够有效的自动学习到语义分割中，对于不同场景下不同物体中RGB和深度两个模态所起的不同作用。这种动态自适应的贡献系数要优于传统算法所使用的无差别对待方法，能进一步提高场景分割精度。 Further, by designing the switch gate fusion layer, the different functions of RGB and depth modes in different objects in different scenes can be effectively and automatically learned. This dynamic adaptive contribution coefficient is superior to the non-discriminatory treatment method used by traditional algorithms, which can further improve the scene segmentation accuracy.

附图说明DRAWINGS

图1为本发明方法的一个实施例的流程图；Figure 1 is a flow chart of one embodiment of the method of the present invention;

图2为本发明中全卷积神经网络用于密集特征提取的原理图；2 is a schematic diagram of a full convolutional neural network for dense feature extraction in the present invention;

图3a为本发明的一个实施例的局部敏感反卷积神经网络原理图；3a is a schematic diagram of a local sensitive deconvolution neural network according to an embodiment of the present invention;

图3b为本发明的一个实施例的局部敏感的反聚集层和局部敏感的均值聚集层的原理图；Figure 3b is a schematic diagram of a locally sensitive anti-aggregation layer and a locally sensitive mean aggregation layer in accordance with one embodiment of the present invention;

图4为本发明的一个实施例的开关门融合层。4 is a switch door fusion layer in accordance with an embodiment of the present invention.

具体实施方式detailed description

下面参照附图来描述本发明的优选实施方式。本领域技术人员应当理解的是，这些实施方式仅仅用于解释本发明的技术原理，并非旨在限制本发明的保护范围。Preferred embodiments of the present invention are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are only used to explain the technical principles of the present invention, and are not intended to limit the scope of the present invention.

如图1所示，本发明的一个实施方式的基于反卷积神经网络的场景语义分割方法包括下述步骤：As shown in FIG. 1, a deconvolution neural network based scene semantic segmentation method according to an embodiment of the present invention includes the following steps:

步骤S1，对场景图片用全卷积神经网络提取低分辨率的密集特征表达；Step S1, extracting low-resolution dense feature expressions from the full-convolution neural network for the scene picture;

场景语义分割是一种典型的密集预测问题，需要预测图片中每个像素点的语义类别，因而要求对图片中的每个像素点都能够提取到一个鲁棒的特征表达。本发明采用全卷积神经网络来有效的提取图片的密集特征，所述图片可以是RGB图片，和/或深度图片。如图2所示，全卷积神经网络通过多次卷积、降采样和最大值聚集过程，能够聚合丰富的上下文信息来对图片中每个像素点进行特征表达，得到RGB特征图S1和/或深度特征图S1。但是由于存在多次降采样操作以及最大值聚集，全卷积神经网络得到的是一个低分辨率特征图，并且物体边沿非常的模糊。Scene semantic segmentation is a typical intensive prediction problem. It is necessary to predict the semantic category of each pixel in the image. Therefore, it is required to extract a robust feature representation for each pixel in the image. The present invention employs a full convolutional neural network to efficiently extract dense features of a picture, which may be RGB pictures, and/or depth pictures. As shown in Figure 2, the full convolutional neural network can be aggregated through multiple convolution, downsampling, and maximum aggregation processes. The context information is used to characterize each pixel in the picture to obtain an RGB feature map S1 and/or a depth feature map S1. However, due to multiple downsampling operations and maximum aggregation, the full convolutional neural network yields a low resolution feature map with very blurred edges.

为此，本发明将底层的像素级别的信息嵌入到反卷积神经网络中进行指导网络的训练。利用局部敏感的反卷积神经网络对得到的密集特征表达进行上采样学习以及物体边沿优化，得到RGB分数图S2和/或深度分数图S2，从而实现更精细的场景语义分割。To this end, the present invention embeds the underlying pixel level information into the deconvolution neural network to guide the training of the network. The locally sensitive deconvolution neural network is used to perform upsampling learning and object edge optimization to obtain RGB fractional graph S2 and/or depth fractional graph S2, thereby achieving finer scene semantic segmentation.

具体地，在步骤S2中，首先计算图片中每个像素点与邻近像素的相似度关系，并得到一个二值化的局部亲和度矩阵。本发明中可提取RGB和深度图片的SIFT，SPIN以及梯度特征，利用ucm-gPb算法来得到该局部亲和度矩阵。然后将该局部亲和度矩阵与所得到的RGB特征图S1和/或深度特征图S1输入局部敏感的反卷积神经网络，对密集特征表达进行上采样学习以及物体边沿优化，从而得到更精细的场景语义分割。Specifically, in step S2, the similarity relationship between each pixel in the picture and the neighboring pixels is first calculated, and a binarized local affinity matrix is obtained. In the present invention, SIFT, SPIN and gradient features of RGB and depth pictures can be extracted, and the local affinity matrix is obtained by using the ucm-gPb algorithm. Then input the local affinity matrix and the obtained RGB feature map S1 and/or the depth feature map S1 into the locally sensitive deconvolution neural network, and perform upsampling learning on the dense feature expression and object edge optimization, thereby obtaining finer Semantic segmentation of the scene.

局部敏感的反卷积神经网络的目的在于将全卷积神经网络得到的粗糙的特征图进行上采样和优化得到更加精确的场景分割。如图3a所示，该网络结构可包含三个模块：局部敏感的反聚集层(unpooling)，反卷积层，以及局部敏感的均值聚集层(average pooling)。The purpose of the locally sensitive deconvolution neural network is to upsample and optimize the coarse feature map obtained by the full convolutional neural network to obtain more accurate scene segmentation. As shown in Figure 3a, the network structure can include three modules: a locally sensitive anti-aggregation layer, a deconvolution layer, and a locally sensitive average pooling.

如图3b上部分所示，局部敏感的反聚集层的输入是上一层的特征图响应，以及局部亲和度矩阵，输出是两倍分辨率的特征图响应。该网络层的主要功能是学习恢复原始图片中的更丰富的细节信息，得到物体边沿更加清晰的分割的结果。As shown in the upper part of Figure 3b, the input of the locally sensitive anti-aggregation layer is the feature map response of the previous layer, and the local affinity matrix, the output is a feature map response of twice the resolution. The main function of the network layer is to learn to restore the richer details in the original picture and get the result of clearer segmentation of the edges of the object.

本发明中可通过以下公式得到局部敏感的反聚集层的输出结果：

In the present invention, the output of the locally sensitive anti-aggregation layer can be obtained by the following formula:

其中x代表特征图中某个像素点的特征向量，A＝{A_i,j}是x为中心得到的一个s×s大小的二值化局部亲和度矩阵，表征周围领域的像素点和中间像素点是否相似，(i,j)和(o,o)分别代表亲和度矩阵中的任意位置及中心位置，Y＝{Y_i,j}是反聚集输出的特征图。通过反聚集操作，能够得到一个分辨率更好，细节更多的分割图。Where x represents the eigenvector of a pixel in the feature map, A={A _i,j } is a binarized local affinity matrix of s×s size obtained by centering x, representing the pixel points of the surrounding domain and Whether the intermediate pixels are similar, (i, j) and (o, o) represent arbitrary positions and center positions in the affinity matrix, respectively, and Y = {Y _{i, j} } is a feature map of the inverse aggregate output. Through the anti-aggregation operation, a segmentation map with better resolution and more details can be obtained.

反卷积层的输入是上一层反聚集层的输出，输出是等分辨率的特征图响应。该网络层主要是用来平滑特征图，因为反聚集层容易产生很多断裂的物体边沿，可利用反卷积过程来学***滑一些。The input of the deconvolution layer is the output of the upper layer of the anti-aggregation layer, and the output is the signature response of the equal resolution. The network layer is mainly used to smooth the feature map, because the anti-aggregation layer is prone to generate many broken object edges, and the deconvolution process can be used to learn to splicing the edges of the breaks. Deconvolution uses the inverse of convolution, mapping each stimulus response value to multiple stimulus response outputs. The response graph after deconvolution will become relatively smoother.

如图3b下部分所示，局部敏感的均值聚集层的输入是上一层反卷积层的输出，以及局部亲和度矩阵，输出是等分辨率的特征图响应。该网络层主要是用来得到每个像素点更加鲁棒的特征表达，同时能够保持对物体边沿的敏感性。As shown in the lower part of Figure 3b, the input of the locally sensitive mean gather layer is the output of the upper deconvolution layer, and the local affinity matrix, and the output is an equal resolution feature map response. The network layer is mainly used to obtain a more robust feature representation for each pixel while maintaining sensitivity to the edges of the object.

其中y是输出的特征向量，A＝{A_i,j}是y为中心得到的一个s×s大小的二值化局部亲和度矩阵，A_i,j表征周围领域的像素点和中间像素点是否相似，(i,j)和(o,o)分别代表亲和度矩阵中的任意位置及中心位置，X＝{X_i,j}是该操作的输入特征图。通过局部敏感的均值聚集之后，既能够得到非常鲁棒的特征表达，同时能够保持对物体边沿的敏感性。In the present invention, the output of the locally sensitive anti-aggregation layer can be obtained by the following formula:

Where y is the output eigenvector, A={A _i,j } is a binarized local affinity matrix of s×s size obtained by y centering, A _i,j characterizing the pixel and intermediate pixels of the surrounding domain Whether the points are similar, (i, j) and (o, o) represent arbitrary positions and center positions in the affinity matrix, respectively, and X = {X _{i, j} } is the input feature map of the operation. After locally sensitive mean aggregation, both very robust feature representations can be obtained while maintaining sensitivity to the edges of the object.

本发明将局部敏感的反聚集层、反卷积层以及局部敏感的均值聚集层多次拼接组合在一起，逐渐的上采样和优化场景分割的细节信息，得到更精细、更准确的场景分割效果。优选地，所述拼接次数为2或3次。拼接次数越多，得到的场景分割越精细、准确，但是计算量也越大。 The invention combines the locally sensitive anti-aggregation layer, the deconvolution layer and the locally sensitive mean aggregation layer multiple times, gradually upsampling and optimizing the detailed information of the scene segmentation, and obtaining a finer and more accurate scene segmentation effect. . Preferably, the number of stitching is 2 or 3 times. The more the number of stitching, the finer and more accurate the segmentation is, but the larger the amount of calculation.

RGB色彩信息和深度信息描述了场景中物体的不同模态的信息，比如RGB图片能够描述物体的表观、颜色以及纹理特征，而深度数据提供了物体的空间几何、形状以及尺寸信息。有效的融合这两种互补的信息能够提升场景语义分割的精度。现有的方法基本都是将两种模态的数据等价的看待，无法区分这两种模态在识别不同场景下不同物体时的不同贡献。基于此，本发明的一个优选的实施方式中提出，将通过上述步骤S1和S2得到的RGB分数图和深度分数图通过开关门融合(gate fusion)进行最优化融合，得到融合分数图，从而实现更精细的场景语义分割，如图4所示。开关门融合层能够有效地衡量RGB(表观)和深度(形状)信息对于识别不同场景下的不同物体的重要性程度。RGB color information and depth information describe information about different modalities of objects in the scene. For example, RGB images can describe the appearance, color, and texture features of an object, while depth data provides spatial geometry, shape, and size information of an object. Effectively blending these two complementary information can improve the accuracy of scene semantic segmentation. The existing methods basically treat the data of the two modes equivalently, and cannot distinguish the different contributions of the two modes when identifying different objects in different scenes. Based on this, in a preferred embodiment of the present invention, it is proposed that the RGB fractional map and the depth fraction map obtained by the above steps S1 and S2 are optimally fused by gate fusion to obtain a fusion score map, thereby realizing More detailed scene semantic segmentation, as shown in Figure 4. The switch gate fusion layer can effectively measure the importance of RGB (apparent) and depth (shape) information for identifying different objects in different scenes.

优选地，本发明的开关门融合层主要由拼接层、卷积层以及归一化层组合而成，其能够自动的学习两种模态的权重，从而更好的融合这两种模态的互补信息用于场景语义分割中。Preferably, the switch gate fusion layer of the present invention is mainly composed of a stitching layer, a convolution layer and a normalized layer, which can automatically learn the weights of the two modes, thereby better integrating the two modes. Complementary information is used in scene semantic segmentation.

首先通过拼接层将RGB和深度网络得到的特征进行拼接。其次是卷积操作，通过卷积层学习得到RGB和深度信息的权重矩阵，卷积过程可如下实现：The features obtained by the RGB and deep networks are first spliced through the splicing layer. The second is the convolution operation, which learns the weight matrix of RGB and depth information through convolutional layer learning. The convolution process can be implemented as follows:

其中P^rgb∈□^c×h×w(c个通道的特征图，每个特征图高为h，宽为w)为基于RGB数据预测的分数图，P^depth∈□^c×h×w(参数意义同上)为基于深度数据预测的分数图，W∈R^c×2c×1×1(c个滤波子，每个滤波子为2c×1×1的三维矩阵)为开关门融合层学习的滤波子，C∈R^c×h×w是卷积输出的贡献系数矩阵。最后是归一化处理，优选地，通过sigmoid函数操作将C_k,i,j归一化到[0,1]区间内。最后我们记C^rgb＝C，C^depth＝1-C，并且将贡献系数矩阵作用原来的分数输出，得到： Where P ^rgb ∈ □ ^{c × h × w} (characteristic map of c channels, each feature map height h, width w) is a fractional graph predicted based on RGB data, P ^depth ∈ □ ^{c × h × w} (parameter meanings as defined above) as a fraction of depth based on the prediction ^{data, W∈R c × 2c × 1 ×} 1 (c th sub-filter, filtering each of the sub-three-dimensional matrix 2c × 1 × 1) of the learning filter layer is fusion door Sub, C∈R ^c×h×w is the contribution coefficient matrix of the convolution output. Finally, the normalization process, preferably _, normalizes C _k,i,j into the interval [0,1] by the sigmoid function operation. Finally, we remember that C ^rgb = C, C ^depth = 1 - C, and the contribution coefficient matrix will act on the original fractional output, resulting in:

其中□为矩阵点乘操作。将RGB和深度的分数相加作为最后的融合分数，即为

基于最终的分数图，就能够得到语义分割结果。Where □ is a matrix point multiplication operation. Adding the RGB and depth scores as the final blend score is

Based on the final score map, the semantic segmentation results can be obtained.

在归一化处理中，替代sigmoid函数可以用L1范数，L1范数就是x1＝x1/(x1+x2+...+xn),保证概率和为1。还可以用tanh函数(双曲正切函数)。优选使用sigmoid,因为在神经网络中实现更简单,优化结果更好，收敛更快。In the normalization process, instead of the sigmoid function, the L1 norm can be used, and the L1 norm is x1=x1/(x1+x2+...+xn), and the probability sum is guaranteed to be 1. You can also use the tanh function (hyperbolic tangent function). It is preferable to use sigmoid because the implementation is simpler in the neural network, the optimization result is better, and the convergence is faster.

本发明提出的新的基于局部敏感的反卷积神经网络可用于RGB-D室内场景语义分割。该发明能够很好的适应室内场景的光线变化、背景嘈杂、小物体多以及遮挡等困难，并且能更加有效的利用RGB和深度的互补性，得到更加鲁棒、精度更高、物体边沿保持更好的场景语义分割效果。The new local sensitivity-based deconvolution neural network proposed by the present invention can be used for RGB-D indoor scene semantic segmentation. The invention can well adapt to the light changes of the indoor scene, the background noise, the small objects and the occlusion, and can more effectively utilize the complementarity of RGB and depth to obtain more robust, higher precision, and the edge of the object remains more. Good scene semantic segmentation effect.

至此，已经结合附图所示的优选实施方式描述了本发明的技术方案，但是，本领域技术人员容易理解的是，本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下，本领域技术人员可以对相关技术特征作出等同的更改或替换，这些更改或替换之后的技术方案都将落入本发明的保护范围之内。 Heretofore, the technical solutions of the present invention have been described in conjunction with the preferred embodiments shown in the drawings, but it is obvious to those skilled in the art that the scope of the present invention is obviously not limited to the specific embodiments. Those skilled in the art can make equivalent changes or substitutions to the related technical features without departing from the principles of the present invention, and the technical solutions after the modifications or replacements fall within the scope of the present invention.

Claims

一种基于反卷积神经网络的场景语义分割方法，其特征在于，所述方法包括下述步骤：A scene semantic segmentation method based on deconvolution neural network, characterized in that the method comprises the following steps:

步骤S1，对场景图片用全卷积神经网络提取密集特征表达；Step S1, extracting a dense feature expression by using a full convolutional neural network for the scene picture;

步骤S2，利用局部敏感的反卷积神经网络并借助所述图片的局部亲和度矩阵，对步骤S1中得到的密集特征表达进行上采样和优化，得到所述图片的分数图，从而实现精细的场景语义分割。Step S2, using a locally sensitive deconvolution neural network and using the local affinity matrix of the picture, upsampling and optimizing the dense feature expression obtained in step S1 to obtain a score map of the picture, thereby achieving fine Semantic segmentation of the scene.
根据权利要求1所述的方法，其特征在于，所述局部亲和度矩阵通过提取所述图片的SIFT特征、SPIN特征以及梯度特征，然后利用ucm-gPb算法求得。The method according to claim 1, wherein the local affinity matrix is obtained by extracting SIFT features, SPIN features, and gradient features of the picture, and then using an ucm-gPb algorithm.
根据权利要求1所述的方法，其特征在于，所述局部敏感的反卷积神经网络由三个模块多次拼接而成，该三个模块分别是局部敏感的反聚集层、反卷积层和局部敏感的均值聚集层。The method according to claim 1, wherein said locally sensitive deconvolution neural network is formed by splicing a plurality of modules, wherein the three modules are locally sensitive anti-aggregation layers and deconvolution layers, respectively. And a locally sensitive mean aggregation layer.
根据权利要求3所述的方法，其特征在于，所述拼接次数为2或3次。The method according to claim 3, wherein the number of stitching is 2 or 3 times.
根据权利要求3所述的方法，其特征在于，通过以下公式得到所述局部敏感的反聚集层的输出结果：The method according to claim 3, wherein the output of said locally sensitive anti-aggregation layer is obtained by the following formula:

其中x代表特征图中某个像素点的

Where x represents a pixel in the feature map

特征向量，A＝{A_i,j}是x为中心得到的一个s×s大小的局部亲和度矩阵，表征周围领域的像素点和中间像素点是否相似，(i,j)和(o,o)分别代表亲和度矩阵中的任意位置及中心位置，Y＝{Y_i,j}是反聚集输出的特征图。 The eigenvector, A={A _i,j } is a local affinity matrix of s×s size obtained by centering x, and whether the pixel points and intermediate pixel points of the surrounding domain are similar, (i, j) and (o) , o) respectively represent an arbitrary position and a center position in the affinity matrix, and Y={Y _i,j } is a feature map of the inverse aggregation output.
根据权利要求3所述的方法，其特征在于，通过以下公式实现所述局部敏感的均值聚集层：
其中，y是输出的特征向量，A＝{A_i,j}是y为中心得到的一个s×s大小的局部亲和度矩阵，A_i,j表征周围领域的像素点和中间像素点是否相似，(i,j)和(o,o)分别代表亲和度矩阵中的任意位置及中心位置，X＝{X_i,j}是输入特征图。The method of claim 3 wherein said locally sensitive mean aggregation layer is implemented by the following formula:
Where y is the output eigenvector, A={A _i,j } is a local affinity matrix of s×s size obtained by y centering, and A _{i,j is used to} represent whether the pixel points and intermediate pixel points of the surrounding domain are Similarly, (i, j) and (o, o) represent arbitrary positions and center positions in the affinity matrix, respectively, and X = {X _{i, j} } is an input feature map.
根据权利要求1-6中任一项所述的方法，其特征在于，在所述步骤S1中，所述场景图片包括RGB图片和深度图片，所述方法还包括步骤S3：将得到的RGB分数图和深度分数图通过开关门融合层进行最优化融合，从而实现更精细的场景语义分割。The method according to any one of claims 1 to 6, wherein in the step S1, the scene picture comprises an RGB picture and a depth picture, the method further comprising the step S3: the obtained RGB score The graph and the depth score map are optimally fused by the switch gate fusion layer to achieve finer scene semantic segmentation.
根据权利要求7所述的方法，其特征在于，所述的开关门融合层包括拼接层、卷积层以及归一化层。The method according to claim 7, wherein said switching gate fusion layer comprises a splicing layer, a convolution layer and a normalization layer.
根据权利要求8所述的方法，其特征在于，所述卷积层通过如下函数实现：The method of claim 8 wherein said convolutional layer is implemented by the following function:

其中P^rgb∈□^c×h×w为基于RGB数据

Where P ^rgb ∈ □ ^{c × h × w} is based on RGB data

预测的分数图，P^depth∈□^c×h×w为基于深度数据预测的分数图，W∈R^c×2c×1×1为开关门融合层学习的滤波子，C∈R^c×h×w是卷积输出的贡献系数矩阵。FIG predicted ^{^{score, P depth ∈ □ c × h}} × w fractional FIG depth data based on the ^{prediction, W∈R c × 2c × 1 ×} 1 gate to switch filtering sub-confluent layer of learning, C∈R ^{c × h × w} is the contribution coefficient matrix of the convolution output.
根据权利要求8所述的方法，其特征在于，所述归一化层通过sigmoid函数实现。 The method of claim 8 wherein said normalized layer is implemented by a sigmoid function.