CN115131559A

CN115131559A - Road scene semantic segmentation method based on multi-scale feature self-adaptive fusion

Info

Publication number: CN115131559A
Application number: CN202210661080.XA
Authority: CN
Inventors: 张科; 彦华; 苏雨; 王靖宇
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-06-12
Filing date: 2022-06-12
Publication date: 2022-09-30

Abstract

The invention relates to a road scene semantic segmentation method based on multi-scale feature adaptive fusion, which fully utilizes inherent multi-level characteristics of a convolution network to carry out multi-scale reasoning on each level of feature layer output by each sub-network in the network, and utilizes an attention mechanism to carry out adaptive fusion on prediction results under each scale. The invention carries out multi-scale reasoning by digging the inherent characteristics of the convolution network, avoids reducing the calculated amount by operations such as convolution decomposition and the like, thereby realizing high-precision real-time reasoning on a light-weight backbone network; since the low-level features of the backbone network contain detailed information and have a higher spatial resolution, reasoning on the low-level features can improve the segmentation accuracy for small-sized objects.

Description

Road scene semantic segmentation method based on multi-scale feature self-adaptive fusion

Technical Field

The invention belongs to the field of image processing and an image semantic segmentation method, and relates to a road scene semantic segmentation method based on multi-scale feature self-adaptive fusion.

Background

Image semantic segmentation is one of basic tasks in the field of computer vision, classifies each pixel point in an image according to semantic information and spatial information thereof, and is a point-to-point classification mode. Semantic segmentation is a key step for realizing scene understanding, is also an important technical means for machine perception environment, and is widely applied to multiple fields of automatic driving, indoor navigation, emotion recognition, attitude detection and the like. The patent provides a semantic segmentation technology which can be applied to road scenes, and the semantic segmentation of the road scenes is an important link in an automatic driving technology. Meanwhile, the automatic driving system can respond to the change of the external environment in time, so that the real-time performance of the algorithm is high.

The document "real-time segmentation algorithm based on attention mechanism and efficient deconvolution, computer application, 2022, 1-1" discloses a road scene segmentation algorithm based on attention mechanism and deconvolution. The convolution layer is constructed by utilizing a one-dimensional Non-bottleneck structure (Non-bottle-1D) so as to reduce the operation amount of convolution operation; the pooling operation and attention module are then combined to capture global context information to improve the segmentation effect on larger sized objects. The method described in the literature uses decomposed convolution to reduce the amount of computation of the convolutional layer, which, although effective in improving speed, significantly reduces the segmentation accuracy; the method is focused on the global context information acquisition, and the segmentation of small-scale objects in the road scene is ignored.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a road scene semantic segmentation method based on multi-scale feature self-adaptive fusion, and solves the problems of high speed, low precision and poor segmentation effect on small-size objects of the conventional real-time semantic segmentation algorithm.

Technical scheme

A road scene semantic segmentation method based on multi-scale feature adaptive fusion is characterized by comprising the following steps:

step 1: extracting bottom multi-scale features of the input image by using Resnet-18 neural network to obtain a feature map S from shallow to deep under 4 scales ₁ 、S ₂ 、S ₃ 、S ₄ ；

Step 2: constructing a backbone network based on bidirectional branch information fusion on the Resnet-18 neural network, and constructing a characteristic diagram S ₁ 、S ₂ 、S ₃ 、S ₄ Fusing the left branch from bottom to top and the right branch from top to bottom to obtain a characteristic diagram D of each branch ₁ 、D ₂ 、D ₃ 、D ₄ And U ₁ 、U ₂ 、U ₃ 、U ₄ Finally, fusing the feature maps with the same scale in the two branches to obtain an output F based on the bidirectional branch information fusion backbone network ₁ 、F ₂ 、F ₃ 、F ₄ ；

And step 3: pairing the output F in step 2 with a progressive upsampling module based on optical flow estimation ₁ 、F ₂ 、F ₃ 、F ₄ Performing step-by-step upsampling;

calculating the distance between adjacent feature maps in the 4 scale feature maps, namely F, by an optical flow method ₁ And F ₂ 、F ₂ And F ₃ 、F ₃ And F ₄ The optical flow grid of (1). Then F is sampled by a step-by-step up-sampling module by utilizing an optical flow grid ₂ 、F ₃ 、F ₄ Up-sampling to ₁ The same size is obtained to obtain an aligned characteristic diagram P ₁ 、P ₂ 、P ₃ 、P ₄ ；

And 4, step 4: the feature map P under each scale ₁ 、P ₂ 、P ₃ 、P ₄ The input segmentation module obtains the segmentation result Scale under each Scale ₁ 、Scale ₂ 、Scale ₃ 、Scale ₄ Feature map P at simultaneous scales ₁ 、P ₂ 、P ₃ 、P ₄ The input attention module obtains the Weight of the segmentation result under each scale ₁ 、Weight ₂ 、Weight ₃ 、Weight ₄ And finally, carrying out linear fusion on the segmentation results under all scales by using the weight to obtain a final segmentation result.

Characteristic diagram S in the step 1 ₁ 、S ₂ 、S ₃ 、S ₄ Are 1/4, 1/8, 1/16 and 1/32, respectively, of the original image.

Advantageous effects

The invention provides a road scene semantic segmentation method based on multi-scale feature adaptive fusion, which fully utilizes inherent multi-level characteristics of a convolutional network to carry out multi-scale reasoning on each level of feature layer output by each sub-network in the network, and utilizes an attention mechanism to carry out adaptive fusion on prediction results under each scale. The invention carries out multi-scale reasoning by digging the inherent characteristics of the convolution network, avoids reducing the calculated amount by operations such as convolution decomposition and the like, thereby realizing high-precision real-time reasoning on a light-weight backbone network; since the low-level features of the backbone network contain detailed information and have a higher spatial resolution, reasoning on the low-level features can improve the segmentation accuracy for small-sized objects.

The invention has the beneficial effects that:

(1) the method comprises the following steps of 1-2, fully utilizing inherent multi-scale characteristics of a convolutional network, obtaining multi-scale characteristics from sub-networks of different stages of the network, fusing the multi-scale characteristics for subsequent processing, and making up for the problem of insufficient learning capacity of a shallow lightweight backbone network, so that rapid and accurate segmentation is realized;

(2) and 3-4, performing multi-scale reasoning on the obtained fusion multi-scale features. The low-level features have detail and texture information and higher spatial resolution, and are suitable for segmentation of small-scale objects. And the high-level features contain semantic information and have larger receptive field, so that the method is suitable for segmenting large-scale objects. After the predicted values of all scales are adaptively fused, the final result can have higher segmentation precision on the object of each scale.

Drawings

Fig. 1 is a diagram showing an overall configuration of a network.

Fig. 2 is a method flow diagram.

Fig. 3 is a graph of test results under the citrescaps data set.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the technical scheme adopted by the invention for solving the technical problems is as follows: a road scene semantic segmentation method based on multi-scale feature self-adaptive fusion is characterized by comprising the following steps:

and step 1, extracting bottom-layer multi-scale features by using a Resnet-18 neural network. The Resnet-18 network rapidly reduces the size of an input image through pooling operation at an initial network layer, and builds an integral network through fewer network layers, so that the network is faster and more efficient than other networks, and is widely used for building a real-time semantic segmentation network. The invention also builds a network on the basis of Resnet-18, and simplifies the network by removing all network layers (including a pooling layer and a full-link layer) after the last residual module, wherein the simplified ResNet-18 comprises 1 convolution layer of 7x7, 1 maximum pooling layer and 8 residual modules. According to the difference of the feature diagram size output by each module, 8 residual modules are divided into 4 sub-networks (stages 1-4 in fig. 1 (a)). The feature map is reduced in size 1/2 for each pass through a sub-network. Thus, a characteristic diagram S under 4 scales from shallow to deep can be obtained ₁ 、S ₂ 、S ₃ 、S ₄ (size 1/4, 1/8, 1/16, and 1/32 of the original image, respectively).

And 2, building a backbone network based on bidirectional branch information fusion on the basis of the simplified Resnet-18 constructed in the step 1 (figure 1 (a)). Respectively by applying a characteristic map S ₁ 、S ₂ 、S ₃ 、S ₄ The feature map D of each branch is obtained by fusing two branches from bottom to top (left branch in figure 1(a)) and from top to bottom (right branch in figure 1(a)) ₁ 、D ₂ 、D ₃ 、D ₄ And U ₁ 、U ₂ 、U ₃ 、U ₄ Finally, fusing the feature maps with the same scale in the two branches to obtain an output F based on the bidirectional branch information fusion backbone network ₁ 、F ₂ 、F ₃ 、F ₄ 。

Step 3, utilizing a progressive up-sampling module based on optical flow estimation to output F in the step 2 ₁ 、F ₂ 、F ₃ 、F ₄ And performing progressive up-sampling. Computing the distance between adjacent feature maps (i.e., F) in the 4 scale feature maps by an optical flow method ₁ And F ₂ 、F ₂ And F ₃ 、F ₃ And F ₄ ) The optical flow grid of (1). Then F is sampled by a progressive upsampling module by using an optical flow grid ₂ 、F ₃ 、F ₄ Up-sampling to ₁ Same size, get P ₂ 、P ₃ 、P ₄ (due to F) ₁ Without upsampling, P is thus obtained directly ₁ )。

And 4, acquiring the segmentation result under each scale by using a segmentation module, and performing self-adaptive fusion on the segmentation result under each scale by using an attention mechanism. Feature map P at each scale ₁ 、P ₂ 、P ₃ 、P ₄ On the basis, a segmentation module is utilized to obtain a segmentation result Scale under each Scale ₁ 、Scale ₂ 、Scale ₃ 、Scale ₄ Then, the attention module is utilized to obtain the Weight of the segmentation result under each scale ₁ 、Weight ₂ 、Weight ₃ 、Weight ₄ . And finally, performing linear fusion on the segmentation results under all scales by using the weight to obtain a final segmentation result.

The specific embodiment is as follows:

the following describes a specific embodiment of the present invention with reference to an example of identifying a semantically segmented data set citrespaces under city streets, but the technical content of the present invention is not limited to the scope described above.

The invention provides a road scene semantic segmentation method based on multi-scale feature self-adaptive fusion, which comprises the following steps of: step 1: obtaining feature maps under different scales by using the simplified ResNet-18 neural network; step 2: constructing a backbone network based on bidirectional branch information fusion, and performing efficient fusion among multi-scale feature maps; and step 3: up-sampling feature maps of different scales by utilizing a progressive up-sampling module based on optical flow estimation; and 4, step 4: constructing an attention mechanism-based adaptive multi-scale fusion module, and performing adaptive fusion on segmentation results under different scale characteristic graphs; and 5: and training the built neural network model by using the data set.

And step one, acquiring characteristic diagrams under different scales by using the simplified ResNet-18 neural network.

The Resnet-18 network model is suitable for building a real-time semantic segmentation model because the size of an input image is rapidly reduced through the pooling operation at the initial network layer and the number of network layers is small. The Resnet-18 network is simplified by removing all network layers after the last residual module, and the simplified ResNet-18 contains 1 convolution layer of 7x7, 1 max pooling layer, and 8 residual modules. According to the difference of the feature diagram size output by each module, 8 residual modules can be divided into 4 sub-networks (stages 1-4 in fig. 1 (a)). And then inputting picture samples in the citrespaces data set into a Resnet-18 network for feature extraction. The feature map size is reduced 1/2 for each pass through a sub-network. Thus, a characteristic diagram S under 4 scales from shallow to deep can be obtained ₁ 、S ₂ 、S ₃ 、S ₄ 。

And step two, fusing the feature graphs of 4 scales by using a backbone network based on bidirectional branch information fusion.

(1) Feature fusion is performed using the bottom-up fusion leg (left leg in fig. 1 (a)). Because the lower layer feature size is large, the lower layer feature layer needs to be downsampled and then merged with the upper layer feature. At each fusion, the feature map D is fused through the previous scale _n-1 Feature map S of current scale after being subjected to bilinear interpolation downsampling _n Adding the two images, and fusing by convolution of 3x3 to obtain a fused feature map D _n The features under the n-th and n + 1-th scales are included, and a fused feature map D under a plurality of scales is obtained in sequence through recursive operation ₁ 、D ₂ 、D ₃ 、D ₄ ：

Wherein conv _3x3 represents a convolution operation with a convolution kernel size of 3x3 for feature fusion; conv _1x1 represents a convolution operation with a convolution kernel size of 1x1, used for adjustment of the feature map depth dimension; down represents a bilinear interpolation downsampling operation.

(2) Feature fusion is performed using the top-to-bottom fusion branch (right branch in fig. 1 (a)). Because the upper layer feature size is smaller, the upper layer feature graph needs to be fused with the lower layer feature graph after upsampling, other operations are the same in fused branches from bottom to top, and the fused feature graph U is obtained in sequence through recursive operation ₄ 、U ₃ 、U ₂ 、U ₁ ：

Where up represents a bilinear upsampling operation.

(3) Fusing the characteristic graphs of the through scales in the two branches to obtain an output F based on the bidirectional branch information fusion backbone network ₁ 、F ₂ 、F ₃ 、F ₄ ：

F _n ＝conv_3x3(D _n +U _n ),n＝1,2,3,4

Step three, utilizing a step-by-step up-sampling module based on optical flow estimation to output F in the step two ₂ 、F ₃ 、F ₄ Up-sampling to ₁ The same size to align the spatial positions between the feature maps of each scale.

(1) Computing the distance between adjacent feature maps (i.e., F) in the 4 scale feature maps by using an optical flow method ₁ And F ₂ 、F ₂ And F ₃ 、F ₃ And F ₄ ) Spatial warped mesh of cells:

firstly, the optical flow field delta is calculated by utilizing the characteristic graphs of two adjacent scales _n-1 ：

Δ _n-1 ＝conv_3x3(concat(F _n-1 ,up(F _n )))

Wherein concat represents splicing the feature graph along the depth dimension; delta _n-1 Is represented by the formula _n-1 The optical flow field is the same size and the depth is 2.

Then, using g _n-1 Is represented by F _n-1 Of the space grid omega _n-1 Position point of (1), then F _n Relative to F _n-1 Is spatially distorted grid omega _n Offset in (1) _n Can be expressed as:

wherein, when n is 2, 3, 4, the sub-table represents F ₁ And F ₂ 、F ₂ And F ₃ 、F ₃ And F ₄ The space between them distorts the grid.

(2) Up-sampling step by using space distortion network to convert F into ₂ 、F ₃ 、F ₄ Up-sampled to and F, respectively ₁ The same size is obtained to obtain an up-sampled characteristic diagram P ₂ 、P ₃ 、P ₄ (due to F) ₁ Without upsampling, P is thus obtained directly ₁ ). The recursive process can be expressed as:

up _l indicating the use of offset in warped mesh _l And performing bilinear interpolation upsampling on the feature map F. This process can be expressed as:

wherein, N (omega) _l ) Representing a spatially warped mesh omega _n 4 fields (upper left, lower left, upper right, lower right); w is a _i To representUsing omega _n Offset in (1) _l The weight of the bilinear kernel calculated.

Step four, characteristic diagram P under each scale ₁ 、P ₂ 、P ₃ 、P ₄ On the basis, a segmentation module is used for obtaining segmentation results under each scale, and an attention mechanism is used for carrying out self-adaptive fusion on the segmentation results under each scale.

(1) Obtaining the Scale of the segmentation result under each Scale by utilizing the segmentation module ₁ 、Scale ₂ 、Scale ₃ 、Scale ₄ Then, the attention module is utilized to obtain the Weight of the segmentation result under each scale ₁ 、Weight ₂ 、Weight ₃ 、Weight ₄ The segmentation module and attention module both use a convolution kernel of 3x3 to construct:

Scale _n ＝conv_3x3(P _n ),n＝1,2,3,4

Weight _n ＝conv_3x3(P _n ),n＝1,2,3,4

(2) and performing linear fusion on the segmentation results under all scales by using the weight to obtain a final segmentation result:

wherein Output represents a final division structure; an even line indicates a matrix dot product.

And finally, performing linear fusion on the segmentation results under all scales by using the weight to obtain a final segmentation result.

And fifthly, training the built neural network model by using the data set.

Claims

1. A road scene semantic segmentation method based on multi-scale feature adaptive fusion is characterized by comprising the following steps:

And 2, step: constructing a backbone network based on bidirectional branch information fusion on the Resnet-18 neural network, and constructing a characteristic diagram S ₁ 、S ₂ 、S ₃ 、S ₄ Fusing the left branch from bottom to top and the right branch from top to bottom to obtain a characteristic diagram D of each branch ₁ 、D ₂ 、D ₃ 、D ₄ And U ₁ 、U ₂ 、U ₃ 、U ₄ Finally, fusing the feature maps with the same scale in the two branches to obtain an output F based on the bidirectional branch information fusion backbone network ₁ 、F ₂ 、F ₃ 、F ₄ ；

And step 3: pairing the output F in step 2 with a progressive upsampling module based on optical flow estimation ₁ 、F ₂ 、F ₃ 、F ₄ Performing progressive up-sampling;

calculating the distance between adjacent feature maps in the 4 scale feature maps, namely F, by an optical flow method ₁ And F ₂ 、F ₂ And F ₃ 、F ₃ And F ₄ The optical flow grid of (2). Then F is sampled by a step-by-step up-sampling module by utilizing an optical flow grid ₂ 、F ₃ 、F ₄ Up-sampling to ₁ The same size is obtained to obtain an aligned characteristic diagram P ₁ 、P ₂ 、P ₃ 、P ₄ ；

And 4, step 4: the feature map P under each scale ₁ 、P ₂ 、P ₃ 、P ₄ The input segmentation module obtains a segmentation result Scale under each Scale ₁ 、Scale ₂ 、Scale ₃ 、Scale ₄ Feature map P at simultaneous scales ₁ 、P ₂ 、P ₃ 、P ₄ The input attention module obtains the Weight of the segmentation result under each scale ₁ 、Weight ₂ 、Weight ₃ 、Weight ₄ And finally, carrying out linear fusion on the segmentation results under all scales by using the weight to obtain a final segmentation result.

2. The road scene semantic segmentation method based on the multi-scale feature adaptive fusion according to claim 1,the method is characterized in that: characteristic diagram S in the step 1 ₁ 、S ₂ 、S ₃ 、S ₄ Are 1/4, 1/8, 1/16 and 1/32, respectively, of the original image.