CN114638836A

CN114638836A - Urban street view segmentation method based on highly effective drive and multi-level feature fusion

Info

Publication number: CN114638836A
Application number: CN202210148745.7A
Authority: CN
Inventors: 熊炜; 赵迪; 孙鹏; 陈奕博; 田紫欣; 强观臣; 万相奎; 李利荣; 宋海娜
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-06-17
Anticipated expiration: 2042-02-18
Also published as: CN114638836B

Abstract

The invention discloses a city street view segmentation method based on highly effective driving and multi-level feature fusion, which combines the position prior of a city street view image and adds HEAM in an improved network, thereby enhancing the feature extraction capability of the network on corresponding targets at different height positions. The HEAM is embedded into the feature extraction network and the ASPP structure, so that the extraction capability of deep features and multi-scale features is improved when the network performs deep convolution and performs multilayer cavity convolution through the ASPP. The HEAM extracts height context information representing the corresponding target level portion, and predicts features or categories of the target level portion from the height context information. Shallow features in the network have high resolution and contain more position and detail information, while deep features have low resolution and poor perception of detail, but have stronger semantic information. Therefore, segmentation accuracy is often improved by fusing shallow and deep features to make the features have more complete information representation.

Description

Urban street view segmentation method based on highly effective driving and multi-level feature fusion

Technical Field

The invention belongs to the technical field of digital image processing and computer vision, relates to a city street view segmentation method, and particularly relates to a city street view semantic segmentation method based on highly efficient driving and multi-level feature fusion.

Background

With the rapid development of computer hardware equipment, image semantic segmentation algorithms are widely applied to computer vision automatic driving tasks, and many segmentation algorithms are used for processing city street view segmentation tasks, and the results are remarkable. For the task of segmenting the urban street scene, a lot of target objects exist in the street scene, and various small target objects are difficult to segment. Therefore, high-precision segmentation of multi-target objects under city street view becomes a key point of research.

At present, the semantic segmentation method is mainly based on the full convolution network and the image context knowledge, and can be roughly divided into 3 modes: (1) adopting jump connection, multi-scale feature fusion or cavity convolution in CNN to improve receptive field; (2) introducing a conditional random field to perform subsequent image segmentation processing after the CNN processing is finished; (3) the images are input as sequences by means of the memory capacity of the recurrent neural network RNN, and the segmentation performance is improved.

The attention mechanism can enable the network to focus more on important characteristics of the target, and irrelevant information is omitted, so that the depth dependence relationship between two pixels is established. For a two-dimensional image, besides the size space of the image, the other dimension is the number of channels, and the channel attention mechanism is used for enhancing or suppressing the corresponding channels for different tasks by judging the importance degree of each channel of the feature map so as to pay attention to local interesting information. However, these methods have poor robustness and high model complexity, and most importantly, the spatial position information of the urban street scene object is not well utilized.

Disclosure of Invention

In order to solve the technical problems, the invention provides a city street view semantic segmentation method based on highly effective driving and multi-level feature fusion, which can remarkably improve the city street view segmentation effect, does not bring excessive parameters and can realize high-precision classification segmentation on multiple targets.

The technical scheme adopted by the invention is as follows: a city street view segmentation method based on highly effective drive and multi-level feature fusion inputs a city street view image into a city street view segmentation network to obtain a well segmented city street view;

the city street view segmentation network comprises a ResNet50 feature extraction network, a multi-level feature fusion network MFFM, a highly effective attention-driving network HEAM and a cavity space pyramid pooling network ASPP;

the highly effective driving attention network HEAM is respectively embedded into the ResNet50 feature extraction network and the void space pyramid pooling network ASPP network so as to improve the effective extraction of the features of the network in the height direction; shallow feature map of the ResNet50 feature extraction network

Inputting and outputting deep characteristic diagram for the high-driving effective attention network HEAM

Wherein C_xIs the number of channels, H_xAnd W_xThe height and width dimensions of the feature map, x ═ l, h, respectively; the highly effective driving attention network HEAM is embedded into the ASPP network, namely the output of the ResNet50 feature extraction network is used as a shallow feature map X_lSaid hollow space pyramidOutput of the pooled network ASPP network as a deep profile X_hThus realizing attention operation of HEAM;

the input city street view image firstly enters the ResNet50 feature extraction network to complete the deep extraction of features, and further the expansion of the receptive field is realized through the ASPP, so that the coding processing is completed; the multi-level feature fusion network MFFM fuses deep-layer and shallow-layer features of the network in a decoding process, and then the fusion features are connected to a decoding end in a jumping mode, so that data information loss caused by an up-sampling operation is reduced.

Compared with the prior art, the invention has the remarkable advantages that:

(1) the invention provides a highly effective driving attention network HEAM, which effectively combines the position prior of the city street view image and successfully embeds the position prior into a feature extraction network, thereby enhancing the feature extraction capability of the network on corresponding targets at different height positions.

(2) The invention provides a multi-level feature fusion network MFFM, which enables features to have more complete information expression to improve the segmentation precision by deeply fusing shallow features and deep features.

(3) The segmentation effect of the method is on a test set of a CamVid data set, the MIoU evaluation index reaches 68.2%, and the current higher SOTA segmentation effect is achieved.

Drawings

FIG. 1 is a diagram of a network structure for dividing city street views according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an improved FPN multi-scale fusion according to an embodiment of the present invention;

FIG. 3 is a diagram of an ECA network architecture according to an embodiment of the present invention;

FIG. 4 is a block diagram of a highly efficient attention driving network HEAM according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The invention provides a city street view segmentation method based on highly effective driving and multi-level feature fusion, which is characterized in that a city street view image is input into a city street view segmentation network to obtain a well segmented city street view;

referring to fig. 1, the city street view segmentation network provided in this embodiment includes a ResNet50 feature extraction network, a multi-level feature fusion network MFFM, a highly efficient attention-driving network HEAM, and a void space pyramid pooling network ASPP;

the attention network HEAM is driven highly effectively and is respectively embedded into a ResNet50 feature extraction network and a void space pyramid pooling network ASPP network so as to improve the effective extraction of the features of the network in the height direction; shallow feature map of ResNet50 feature extraction network

For high-driving effective attention network HEAM input, the output is a deep characteristic diagram

Wherein C is_xIs the number of channels, H_xAnd W_xThe height and width dimensions of the feature map, x ═ l, h, respectively; as shown in FIG. 1, HEAM is embedded into ASPP structure, and the output of feature extraction network is taken as shallow feature map X in FIG. 4_lThe output of the ASPP structure is taken as the deep profile X in FIG. 4_hThus, attention operation of the HEAM is realized.

The city street view image input by the embodiment firstly enters a ResNet50 feature extraction network to complete the deep extraction of features, and further expands the receptive field through a cavity space pyramid pooling network ASPP, so that the coding processing is completed; the multi-level feature fusion network MFFM fuses deep and shallow features of the network in a decoding process, and then the fusion features are connected to a decoding end in a jumping mode, so that data information loss caused by an up-sampling operation is reduced.

In the embodiment, the city street view image is selected as input data, and the input image is preprocessed by using methods such as random erasing, image turning, smoothing and the like. The image resize is 720 x 720 when the network model is input, and then the image is fed into the city street view segmentation network as shown in fig. 1. In the multi-stage feature fusion network MFFM (MFFM) of this embodiment, a backbone network based on Resnet is divided into 4 layers, and features of each layer are extracted from the backbone network Resnet50 as feature fusion branches. And a first branch, namely, a third layer and a fourth layer of characteristics are spliced through a channel and then fused with the characteristics of a second layer, and due to the fact that the sizes of the characteristic graphs are different, FPN multi-scale characteristic fusion needs to be carried out on the characteristic graphs, and finally jump-splicing is carried out on the characteristic graphs and the characteristics sampled by the first 2 times in the decoding block. And after splicing of the first branch and 2 times of upsampling operation, extracting the first layer and second layer characteristics from the backbone network to perform improved FPN fusion operation, and further leading out another branch to be connected to the second 2 times of upsampling characteristics in a jumping mode. The improved network refines the up-sampling of 4 times in the original network into two up-sampling operations of 2 times, and jumps the deep characteristics of multiple stages in the connection backbone network after each up-sampling operation, thereby effectively improving the information loss in the decoding block image recovery process.

The shallow layer features correspond to local information of the image, and simple targets can be distinguished by utilizing abundant local information; the deep features correspond to global information of the image, and the global information such as color, texture and shape is used for distinguishing more fine complex targets. The characteristic pyramid network can be fused with the multilevel characteristics of the backbone network, and multi-target segmentation of different sizes is realized.

In this embodiment, an FPN fusion operation is performed by extracting features of a first layer and a second layer from a backbone network, see fig. 2, where three feature maps of different levels are provided, and the size and resolution of the three feature maps are sequentially reduced from bottom to top, and the two feature maps of different levels can be fused into a new feature map with the same size as the features of the shallow layer through the FPN. In this embodiment, taking the first layer and the second layer of the multi-level feature fusion network MFFM as an example, the second layer as a deep feature is firstly subjected to 1 × 1 convolution operation to reduce the dimensionality, and then is subjected to an upsampling operation by adopting a bilinear interpolation method, so that the size of the deep feature is expanded to the size of the first layer of feature size; and the first layer is used as a shallow layer feature to perform channel dimension reduction treatment, then the two layers of features are subjected to important feature information attention through an ECA network, and finally a feature map is fused by using a channel splicing method. By adopting the improved characteristic pyramid module, richer semantic information and spatial information can be obtained, and meanwhile, the ECA network is added, so that the characteristic extraction capability of the network is improved, and the prediction accuracy of the network is effectively enhanced.

The ECA network is improved on the basis of SENEt, which reduces the dimension of a characteristic diagram, and researches show that: the reduction in dimensions is detrimental to the channel attention. As shown in fig. 3, without reducing dimensionality, first performing channel-by-channel global average pooling, then performing cross-channel interaction capture on each channel and its k adjacent channels through a one-dimensional convolution operation, and finally generating channel weights through a nonlinear Sigmoid function. The channel weight in ECA can be calculated by equation (1):

ω＝σ(Wy), (1)

where ω represents the channel weight of the entire ECA, σ is the Sigmoid function, W is a C × C channel weight matrix, and y is the input feature matrix. ω can in turn be represented as:

ω＝σ(C1D_k(y)), (2)

wherein, C1D represents a one-dimensional convolution, k is the number of adjacent channels, and k is a preset value.

Referring to fig. 4, the highly efficient driver attention network HEAM provided by the present embodiment can generate channel-by-channel scaling factors from its context information by compressing the width dimension to obtain the weight magnitude on the channel-by-channel height. Shallow profile X_lGenerating a two-dimensional attention weight map A with dimension reduced in the width direction after the HANet operation, and combining the attention weight map A and the deep layer feature map X_hElement-by-element point multiplication to obtain brand-new three-dimensional characteristic diagram with height direction and position dependence

Continue to deep profile X_hDoing ECA network processingGenerating

And adding the two feature graphs generated in parallel element by element to generate a final output result, thereby realizing highly effective driving of the features.

In this embodiment, the highly efficient driver attention network HEAM generates a channel attention map consisting of channel-by-channel height scaling factors via the HANET

Shallow feature map X_lAnd the two-dimensional attention weight map a is obtained by element-by-element multiplication,

generated by deep profile through ECA network, and finally

And

fusion generation

As shown in formulas (3), (4), and (5):

wherein,

in order to multiply the elements one by one,

is an element-by-element addition.

F_HANetThe method specifically comprises the following five steps of (a), (b), (c), (d) and (e):

(a) width pooling: firstly, the characteristic diagram

Compressing the width dimension to generate a feature map

To obtain the height context information for each row, as shown in equation (6):

Z＝G_pool(X_l), (6)

(b) and (d) interpolation processing: because the urban street view images are distributed with great difference in the height direction, all row information of the matrix Z does not need to be considered, and therefore the characteristic diagram is generated by the Z interpolation through the downsampling

Order hyper-parameter in the invention

Then (d) sampling again to restore the dimension to C_l×H_l×1。

(c) High drive attention map calculation: by means of characteristic diagrams

Convolution operations are employed as inputs to generate an attention map that better accounts for the relationship between adjacent rows than if fully connected layers were used. Obtaining an attention map A from N convolutional layers can be expressed by equation (7):

where σ denotes a Sigmoid function, δ denotes a ReLU activation function,

representing the ith one-dimensional convolutional layer. In the present invention, let the super parameter N be 3, i.e. there are 3 convolutional layers operating. The first layer convolution compresses the channel by r times

The second layer of convolution stretches the channel by a factor of 2

The last layer of convolution restores the channel to C_h，

G_upRepresenting an upsampling operation.

(e) Position coding: since the person has position prior knowledge of the object during driving observation, the invention is inspired by the fact that the characteristic diagram Q of the intermediate layerⁱAdding sinusoidal position codes, the position codes can be defined as formulas (8), (9):

PE_(p,2i)＝sin(p/100^2i/C), (8)

PE_(p,2i+1)＝cos(p/100^2i/C). (9)

where p represents a position factor in the vertical direction of the entire graph, and i is the number of vertical positions, such that

New characteristic diagram

Produced by equation (10):

in the embodiment, a ResNet50 feature extraction network is pre-trained on an Imagenet classification data set, and then a pre-training model is used for carrying out migration training on a network model; calculating a gradient by using a random gradient (SGD), wherein the initial learning rate lr is 1e-2, the momentum is 0.9, the weight attenuation degree is 5e-4, and the learning rate attenuation adopts a poly strategy; when training the city street view data set CamVid, the input size resize is 720 × 720, the batch size is 4, the number of training iterations is 14000 (300epoch), and the loss function is the cross entropy loss function.

According to the embodiment, aiming at the test of the training model, multi-class segmentation graphs of the image can be output, different classes of semantic information in the graphs are marked into different colors, and then the automatic driving system is assisted to distinguish city street view targets.

The method combines the position prior of the city street view image, and adds a height-driven attention attachment model (HEAM) in the improved network, thereby enhancing the feature extraction capability of the network on corresponding targets at different height positions. The HEAM is embedded into a ResNet50 feature extraction network and a void space pyramid pooling (ASPP) structure, so that the extraction capability of deep features and multi-scale features is improved when the network performs deep convolution and performs multi-layer void convolution through the ASPP. The HEAM extracts height context information representing each horizontal partition, and predicts a feature or category of each horizontal partition from the height context information. For the multi-target semantic segmentation task, different target objects have different sizes, and if the same layer of features are used for segmentation, accurate segmentation of multiple targets is difficult to complete. Shallow features in the network have high resolution and contain more location and detail information, while deep features have low resolution and poor perception of detail, but possess stronger semantic information. Therefore, segmentation accuracy is often improved by fusing shallow and deep features to make the features have more complete information representation.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A city street view segmentation method based on highly efficient driving and multi-level feature fusion is characterized by comprising the following steps: inputting the city street view image into a city street view segmentation network to obtain a city street view with segmentation performance meeting preset conditions;

the highly effective driving attention network HEAM is respectively embedded into the ResNet50 feature extraction network and the hole space pyramid pooling network ASPP network; shallow feature map of the ResNet50 feature extraction network

Wherein C is_xIs the number of channels, H_xAnd W_xThe height and width dimensions of the feature map are, respectively, x ═ l, h; the highly effective drive attention network HEAM is embedded into the ASPP network, and the output of the ResNet50 feature extraction network is used as a shallow feature map X_lThe output of the ASPP network is used as a deep feature map X_hThus realizing attention operation of the HEAM;

the input city street view image firstly enters the ResNet50 feature extraction network to complete the deep extraction of features, and further the amplification of the receptive field is realized through the ASPP, so that the coding processing is completed; the multi-level feature fusion network MFFM fuses deep and shallow features of the network in a decoding process, and then the fusion features are connected to a decoding end in a jumping mode, so that data information loss caused by an up-sampling operation is reduced.

2. The city street view segmentation method based on the fusion of the highly efficient driver and the multi-level features as claimed in claim 1, wherein: the multi-level feature fusion network MFFM divides the ResNet50 feature extraction network into 4 layers, and respectively extracts features of each layer from the ResNet50 feature extraction network as feature fusion branches; splicing the characteristics of the third layer and the fourth layer through a channel, then performing FPN multi-scale characteristic fusion with the characteristics of the second layer, and finally performing jump splicing with the first 2 times of up-sampled characteristics in the decoding block; and in the second branch, after splicing of the first branch is finished and a 2-time upsampling operation is performed, the first layer and the second layer of characteristics are extracted from the ResNet50 characteristic extraction network to perform FPN fusion operation, and then the other branch is led out to be connected to the second 2-time upsampling characteristic in a jumping mode.

3. The city street view segmentation method based on the fusion of the highly efficient driver and the multi-level features as claimed in claim 2, wherein: in the second branch, extracting first layer and second layer features from the ResNet50 feature extraction network to perform FPN fusion operation, wherein for improved FPN fusion operation, the second layer is used as a deep layer feature, firstly, the dimension is reduced through 1 × 1 convolution operation, and then, an up-sampling operation is performed by adopting a bilinear interpolation method to expand the dimension of the deep layer feature to the dimension of the first layer feature; and the first layer is used as a shallow layer feature to perform channel dimension reduction treatment, then the two layers of features are subjected to important feature information attention through an ECA network, and finally a channel splicing method is used for fusing feature graphs.

4. The city street view segmentation method based on the fusion of the highly efficient driver and the multi-level features as claimed in claim 3, wherein: the ECA network firstly performs channel-by-channel global average pooling under the condition of not reducing dimensionality, and then realizes the purpose of performing one-dimensional convolution operation on each channel and k channels thereofCapturing cross-channel interaction of adjacent channels, and finally generating channel weight through a nonlinear Sigmoid function; wherein, in the ECA, the channel weight ω is σ (Wy), ω represents the channel weight of the entire ECA, σ is a Sigmoid function, W is a C × C channel weight matrix, and y is an input feature matrix; ω can again be represented as: ω ═ σ (C1D)_k(y)), wherein C1D represents a one-dimensional convolution, k is the number of adjacent channels, and k is a preset value.

5. The city street view segmentation method based on the fusion of the highly efficient driver and the multi-level features as claimed in claim 1, wherein: the highly efficient driver attention network HEAM, shallow profile X_lGenerating a two-dimensional attention weight map A with dimension reduced in the width direction after the HANet operation, and combining the attention weight map A and the deep layer feature map X_hElement-by-element point multiplication to obtain brand-new three-dimensional characteristic diagram with height direction and position dependence

Continue to deep profile X_hECA network processing generation

6. The city street view segmentation method based on the fusion of the highly efficient driver and the multi-level features as claimed in claim 5, wherein: the highly efficient driver attention network HEAM generates a channel attention map consisting of channel-by-channel height scaling factors via the HANET

Mapping shallow feature X_lAnd two-dimensional attention weight map A by element multiplicationSo as to obtain the compound with the characteristics of,

from the deep profile X_hGenerated through an ECA network and finally will

And

fusion generation

As shown in formulas (3), (4), and (5):

wherein, the element-by-element multiplication,

is added element by element;

F_HANetcomprises five steps of (a), (b), (c), (d) and (e), wherein the step (a) is width pooling: firstly, a characteristic diagram is drawn

Compressing width dimensions to generate feature maps

To obtain high-level context information per line, e.g. publicFormula (6):

Z＝G_pool(X_l), (6)

the step (b) and the step (d) are interpolation processing: generation of feature maps by downsampling versus Z-interpolation

Then, in step (d), the dimension C is recovered by up-sampling_l×H_l×1；

Step (c) calculates for the high drive attention map: by means of characteristic diagrams

Convolution operations are used as input to generate an attention map; the attention map A obtained from the N convolutional layers is expressed by equation (7):

where σ denotes the Sigmoid function, δ denotes the ReLU activation function,

representing the ith one-dimensional convolutional layer; the super parameter N is 3, namely 3 convolutional layers are operated; the first layer of convolution compresses the channel by r times

The second layer of convolution stretches the channel by a factor of 2

The last layer of convolution restores the channel to C_h，

G_upRepresenting an upsample operation;

the step (e) is position coding: in the middle layer characteristic diagram QⁱAdding sinusoidal position codes, wherein the position codes are defined as formulas (8) and (9):

PE_(p,2i)＝sin(p/100^2i/C), (8)

PE_(p,2i+1)＝cos(p/100^2i/C). (9)

New characteristic diagram

Produced by equation (10):

。

7. the city street view segmentation method based on the fusion of the highly efficient driver and the multi-level features as claimed in any one of claims 1 to 4, wherein: obtaining a trained city street view segmentation network after training;

firstly, pre-training a ResNet50 feature extraction network on an Imagenet classification data set, and then carrying out migration training on a network model by using a pre-training model; calculating gradient by using random gradient descent, wherein the initial learning rate lr is 1e-2, momentum is 0.9, weight attenuation is 5e-4, and the learning rate attenuation adopts a poly strategy; when training the city street view data set CamVid, the input size resize is 720 × 720, the batch size is 4, the number of training iterations is 14000, and the loss function is a cross entropy loss function.