CN113436210A - Road image segmentation method fusing context progressive sampling - Google Patents

Road image segmentation method fusing context progressive sampling Download PDF

Info

Publication number
CN113436210A
CN113436210A CN202110706637.2A CN202110706637A CN113436210A CN 113436210 A CN113436210 A CN 113436210A CN 202110706637 A CN202110706637 A CN 202110706637A CN 113436210 A CN113436210 A CN 113436210A
Authority
CN
China
Prior art keywords
layer
output
feature map
pooling
deep
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110706637.2A
Other languages
Chinese (zh)
Other versions
CN113436210B (en
Inventor
陆彦钊
刘惠义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202110706637.2A priority Critical patent/CN113436210B/en
Publication of CN113436210A publication Critical patent/CN113436210A/en
Application granted granted Critical
Publication of CN113436210B publication Critical patent/CN113436210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • G06T2207/30256Lane; Road marking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a road image segmentation method fusing context progressive sampling, which comprises the following steps: preprocessing the acquired multiple road images to obtain segmentation pictures; inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map; inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion; inputting the deep characteristic map into a constructed ASPP pyramid module for pooling; fusing the deep layer characteristic graphs or the shallow layer characteristic graphs with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer characteristic graphs or the shallow layer characteristic graphs back to the original graph size; the method and the device can improve the segmentation accuracy of the picture, and meanwhile, the segmentation is more precise in detail.

Description

Road image segmentation method fusing context progressive sampling
Technical Field
The invention relates to a road image segmentation method fusing context progressive sampling, and belongs to the technical field of image segmentation.
Background
Image semantic segmentation is a key problem in the computer field nowadays and is also an important direction for computer vision research. In the early stage, the image segmentation in the computer vision technology generally depends on information such as edges, gradual change and the like, pixel-level understanding is not provided, so that the segmentation accuracy is low, and the method cannot be applied to the fields of intelligent driving and the like. In recent years, with the deep research of convolutional neural networks, the understanding capability of a computer to a pixel level is stronger and stronger, networks for semantic segmentation are more and more perfect, and the method has wide application prospects in the fields of automatic driving, human-computer interaction, virtual reality and the like.
Early image semantic segmentation generally has methods based on thresholds, edges, regions, etc. Although the methods are convenient to use and easy to understand, a lot of spatial information is lost, and the segmentation effect is poor. To solve these problems, Jonathan Long et al proposed a full convolution network fcn (full convolution networks) based on the CNN Convolutional neural network. The network removes the last full connection layer of the CNN, deconvolves the last characteristic graph of the CNN to perform upsampling, and then amplifies the upsampling graph into the size of an original graph to achieve the purpose of pixel-level classification. The research of Jonathan Long et al has made a great breakthrough in image semantic segmentation. However, since the FCN reduces the original image by 32 times and then enlarges, pooling therein may result in information loss and a probabilistic model between tags is not applied. Chen LC et al propose a deplab v1 method that uses hole convolution to enlarge the receptive field, reduce pooling layers, and avoid loss of detail information due to excessive pooling. Meanwhile, due to the adoption of the CRF conditional random field, the edge is further refined, and the segmentation effect of complex edges like trees and bicycles is improved. Based on DeepLabV1, DeeplabB V2 was proposed by Linag-Chieh Chen et al. Compared with the Deeplab V1 network, the VGG16 used as the backbone network is changed into ResNet and an ASPP (asynchronous reactive pyramid) pyramid module is added. The ASPP adopts the cavity convolution layers with a plurality of sampling rates to detect in parallel, and the global and local characteristics are fused to improve the segmentation effect. The ensuing Deeplab V3+ introduced the structure of the encoder and decoder, fusing the backbone network output with the shallow features, and gradually reconstructing the spatial information to better capture the details of the object. While employing a depth separable convolution to reduce the amount of computation. Deeplab V3+ can capture context information well, but the edge segmentation accuracy for small-scale objects is still not high.
In order to solve the problems, the application provides a road image segmentation method fusing context progressive sampling.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a road image segmentation method fusing context and progressive sampling, which can identify small target objects on a road more accurately and obviously improve image detail segmentation.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
a road image segmentation method fusing context progressive sampling comprises the following steps:
preprocessing the acquired multiple road images to obtain segmentation pictures;
inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map;
inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;
inputting the deep characteristic map into a constructed ASPP pyramid module for pooling;
and fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.
Preferably, the preprocessing the acquired multiple road images to obtain the segmentation picture includes:
cutting the road image into 1024 x 1024 pixel pictures, and uniformly storing the pictures into a jpg format;
performing semantic annotation on each picture to obtain a segmented picture;
the semantic annotation content comprises a background, an automobile, a person, the sky, a road, a grassland, a wall, a building and a pedestrian road.
Preferably, the construction of the Xception model includes:
building a block1 intermediate feature layer, which consists of a 3 × 3 convolution layer with 32 channels, a relu active layer, a 3 × 3 convolution layer with 64 channels and a relu active layer;
building a block2 intermediate feature layer, which consists of 2 128-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;
building a block3 intermediate feature layer, which consists of 2 256-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;
building a block4 intermediate feature layer, which consists of 3 x 3 depth separable convolutional layers of 2 728 channels, a relu active layer and a maximum pooling layer;
building block5-block13 intermediate feature layers, which are composed of 3 x 3 depth separable convolution layers of 728 channels and 3 relu activation layers;
wherein, after the output of the block1 middle characteristic layer, the output is simultaneously sent to a 1 × 1 convolution layer, and the result is added with the output of the block2 middle characteristic layer; simultaneously sending the output of the block2 intermediate characteristic layer to a 1 x 1 convolution layer, and adding the result with the output of the block3 intermediate characteristic layer; the output of block3 intermediate feature layers is simultaneously fed into a 1 x 1 convolutional layer, and the result is added to the output of block4 intermediate feature layers.
Preferably, the step of inputting the segmented picture into the constructed Xception model to extract the deep feature map and the shallow feature map includes: the Xception model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.
Preferably, the inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets therein, and inputting the output result into the constructed HRNet module for fusion includes:
inputting a shallow layer feature diagram extracted from a feature layer among block2, block3 and block4 into a constructed CBAM attention model to amplify the features of small targets, and outputting out1, out2 and out 3;
performing cross fusion on out1, out2 and out3 in an up-sampling and down-sampling mode to obtain feature maps of 3 corresponding resolutions, namely, hrout1, hrout2 and hrout 3;
wherein, the small target is an object with an area smaller than 10 x 10 pixels in the segmented picture;
the size of hrout2 is 1/2 for hrout1, and the size of hrout3 is 1/2 for hrout 2.
Preferably, the CBAM attention model is constructed by: constructing a channel attention mechanism and a space attention mechanism;
the channel attention mechanism comprises:
respectively performing maximum pooling and average pooling on the input feature map on channel dimensions to extract the maximum weight and the average weight on each channel;
respectively sending the maximum weight and the average weight to two full-connection layers for classification;
adding the classification results and activating by using a sigmoid function to obtain an importance weight matrix of each channel;
multiplying the importance weight matrix of each channel with the input feature map to obtain the output of amplified channel features;
the maximum pooling refers to the maximum value of pixel points of each channel, the average pooling refers to the average value of the pixel points of each channel, and the sigmoid activation function is used for enabling a larger value in input to be larger and a smaller value in input to be smaller;
the spatial attention mechanism comprises:
performing primary maximum pooling and primary average pooling on the output of the amplified channel characteristics on the spatial dimension to extract the maximum weight and the average weight of each pixel point;
carrying out convolution operation on the maximum weight and the average weight through a 3-by-3 convolution layer, activating by using a sigmoid function, and outputting an importance weight matrix of each pixel point;
and multiplying the importance weight matrix of each pixel point by the output of the amplified channel characteristic to obtain the output of the amplified pixel point characteristic.
Preferably, the ASPP pyramid module comprises 3 × 3 convolution layers with step sizes of 6, 12 and 18, respectively, and an average pooling layer with step size of 1; the step of inputting the deep feature map into the constructed ASPP pyramid module for pooling comprises the following steps:
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 6, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 x 3 convolution layer with the step length of 12, and then sending two 1 x 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 18, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into an average pooling layer with the step length of 1 to output a result;
and combining the output results to obtain the final ASPP pyramid module pooling output.
Preferably, the fusing the deep feature map or the shallow feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling 2 times step by step to enlarge the deep feature map or the shallow feature map back to the original image size includes:
convolving the pooled output of the ASPP pyramid module once, upsampling by 2 times once, and combining with hrout 3;
convolve the combined result once, upsample by 2 times once, and combine with hrout 2;
convolve the combined result once, upsample by 2 times once, and combine with hrout 1;
and (4) convolving the merging result twice, performing 2 times of upsampling once, and activating by using a softmax function to obtain final output.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a road image segmentation method fusing context and step-by-step sampling, which fuses different levels of features by utilizing a HRNet mode, adds a CBAM attention mechanism in front of an HRNet module, enhances a beneficial feature channel, weakens a useless feature channel, and finally samples the output of an ASPP pyramid module and the fused different levels of features step by step. The experimental results show that: the method for integrating context and up-sampling step by step is more accurate in identifying small target objects on the road and has obvious improvement on image detail segmentation; the invention can help the automobile to identify the type, position and size of the road object, and can effectively pre-judge the remote small target pedestrian in advance because the identification of the small target object is more accurate, thereby having a large exertion space in the intelligent driving direction.
Drawings
Fig. 1 is a flowchart of a road image segmentation method fusing context progressive sampling according to an embodiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The first embodiment is as follows:
the embodiment provides a road image segmentation method, which comprises the following steps:
step 1, preprocessing a plurality of acquired road images to obtain segmented pictures;
cutting the road image into 1024 x 1024 pixel pictures, and uniformly storing the pictures into a jpg format;
performing semantic annotation on each picture to obtain a segmented picture;
the semantic annotation content comprises a background, an automobile, a person, the sky, a road, a grassland, a wall, a building and a pedestrian road.
Step 2, inputting the segmented picture into the constructed Xscene model to extract a deep layer feature map and a shallow layer feature map;
the Xception model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.
Step 3, inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;
inputting a shallow layer feature diagram extracted from a feature layer among block2, block3 and block4 into a constructed CBAM attention model to amplify the features of small targets, and outputting out1, out2 and out 3;
performing cross fusion on out1, out2 and out3 in an up-sampling and down-sampling mode to obtain feature maps of 3 corresponding resolutions, namely, hrout1, hrout2 and hrout 3;
wherein, the small target is an object with an area smaller than 10 x 10 pixels in the segmented picture;
the size of hrout2 is 1/2 for hrout1, and the size of hrout3 is 1/2 for hrout 2.
Step 4, inputting the deep feature map into the constructed ASPP pyramid module for pooling;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 6, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 x 3 convolution layer with the step length of 12, and then sending two 1 x 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 18, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into an average pooling layer with the step length of 1 to output a result;
and combining the output results to obtain the final ASPP pyramid module pooling output.
And 5, fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.
Convolving the pooled output of the ASPP pyramid module once, upsampling by 2 times once, and combining with hrout 3;
convolve the combined result once, upsample by 2 times once, and combine with hrout 2;
convolve the combined result once, upsample by 2 times once, and combine with hrout 1;
and (4) convolving the merging result twice, performing 2 times of upsampling once, and activating by using a softmax function to obtain final output.
The Xception model is a network structure proposed by *** corporation for image classification, and in this embodiment, the construction of the Xception model includes:
building a block1 intermediate feature layer, which consists of a 3 × 3 convolution layer with 32 channels, a relu active layer, a 3 × 3 convolution layer with 64 channels and a relu active layer;
building a block2 intermediate feature layer, which consists of 2 128-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;
building a block3 intermediate feature layer, which consists of 2 256-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;
building a block4 intermediate feature layer, which consists of 3 x 3 depth separable convolutional layers of 2 728 channels, a relu active layer and a maximum pooling layer;
building block5-block13 intermediate feature layers, which are composed of 3 x 3 depth separable convolution layers of 728 channels and 3 relu activation layers;
wherein, after the output of the block1 middle characteristic layer, the output is simultaneously sent to a 1 × 1 convolution layer, and the result is added with the output of the block2 middle characteristic layer; simultaneously sending the output of the block2 intermediate characteristic layer to a 1 x 1 convolution layer, and adding the result with the output of the block3 intermediate characteristic layer; the output of block3 intermediate feature layers is simultaneously fed into a 1 x 1 convolutional layer, and the result is added to the output of block4 intermediate feature layers.
The Convolition Block Attention Module (CBAM) is an Attention model combining space and channels, and can improve the recognition capability of a small target object by using the space and channel information among pixels; the construction of the CBAM attention model in this embodiment includes: constructing a channel attention mechanism and a space attention mechanism;
the channel attention mechanism comprises:
respectively performing maximum pooling and average pooling on the input feature map on channel dimensions to extract the maximum weight and the average weight on each channel;
respectively sending the maximum weight and the average weight to two full-connection layers for classification;
adding the classification results and activating by using a sigmoid function to obtain an importance weight matrix of each channel;
multiplying the importance weight matrix of each channel with the input feature map to obtain the output of amplified channel features;
the maximum pooling refers to the maximum value of pixel points of each channel, the average pooling refers to the average value of the pixel points of each channel, and the sigmoid activation function is used for enabling a larger value in input to be larger and a smaller value in input to be smaller;
the spatial attention mechanism comprises:
performing primary maximum pooling and primary average pooling on the output of the amplified channel characteristics on the spatial dimension to extract the maximum weight and the average weight of each pixel point;
carrying out convolution operation on the maximum weight and the average weight through a 3-by-3 convolution layer, activating by using a sigmoid function, and outputting an importance weight matrix of each pixel point;
and multiplying the importance weight matrix of each pixel point by the output of the amplified channel characteristic to obtain the output of the amplified pixel point characteristic.
The ASPP pyramid module enlarges the field of view by using the hole convolution with different step sizes, which avoids the problem that the resolution is sacrificed in order to obtain a larger field of view in the conventional method, and in this embodiment, the ASPP pyramid module includes 3 × 3 convolution layers with step sizes of 6, 12, and 18, respectively, and an average pooling layer with step size of 1.
In this embodiment, constructing the HRNet module includes:
combine out1, 2 times up-sampled out2, 4 times up-sampled out3 into out 11; combining the 2 times downsampled out1, out2, and 2 times upsampled out3 into out 22; combining the 4-fold downsampled out1, the 2-fold downsampled out2, out3 into out 33;
combine out11, 2 times up-sampled out22, 4 times up-sampled out33 into out 111; combining the 2 times downsampled out11, out22, and 2 times upsampled out33 into out 222; the 4-fold downsampled out11, the 2-fold downsampled out22, out33 are combined into out 333.
Taking the city street view of Germany as an example, the data set contains 9 broad categories: background, car, person, sky, road, grass, wall, building, pedestrian road. The data set has 1300 road street view pictures of 10 cities in the Germany. 1000 samples were used for training and 300 samples were tested. Each picture is 2048 x 1024 pixels in size. During training, a Tesla P100 GPU with a video memory of 16G is used for training. During training, a random gradient descending mode is adopted, an optimizer is AdamaOptizer, the learning rate is 0.001 in the first 500 epochs, and the learning rate is adjusted to 0.0001 in the last 200 epochs. The loss function is a cross entropy loss function (categorial _ cross).
Putting the test data set into the step (1) for image preprocessing, wherein the steps are as follows:
1.1 cut 1000 test pictures into 2000 pictures with 1024 × 1024 pixels.
1.2 convert the 2000 cut pictures from step 1.1 into 3 channel array format, resulting in 2000 matrices of 1024 x 3 size.
1.3 merge the 2000 three-dimensional matrices of step 1.2 into one 2000 x 1024 x 3 four-dimensional matrix.
Putting the result in the step 1.3 into the step (2), and extracting the characteristics by using an Xception network, wherein the steps are as follows:
2.1 the field width of the four dimensional matrix is filled twice with 0 to form a 2000 x 1028 x 3 matrix, which is placed in block1 in the Xception network to create a 2000 x 512 x 64 matrix.
2.2 put the output of step 2.1 into block2, resulting in a matrix of shallow signatures of size 2000 × 256 × 128.
2.3 put the output of step 2.2 into block3, resulting in a matrix of shallow signatures of size 2000 x 128 x 256.
2.4 put the output of step 2.3 into block4, resulting in a matrix of shallow profiles of size 2000 x 64 x 728.
2.5, sequentially putting the input in the step 2.4 into block5, block6, block7, block8, block9, block10, block11, block12 and block13 to obtain a deep characteristic diagram matrix with the size of 2000 × 64 × 728.
And (3) respectively putting the outputs of the steps 2.2, 2.3 and 2.4 into the step (3), and amplifying the small target characteristics by using a CBAM attention mechanism, wherein the specific steps are as follows:
3.1, the feature matrix before block2 pooling in step 2.2 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 512 × 128.
3.2, the feature matrix before block3 pooling in step 2.3 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 256.
3.3 the feature matrix before block4 pooling in step 2.4 sequentially enters a channel attention mechanism and a space attention mechanism in the CBAM module to obtain an output matrix diagram after amplifying small target features, wherein the size of the output matrix diagram is 2000 × 128 × 512.
Putting the deep feature map block13 obtained in the step 2.5 into the step (4), and obtaining a larger receptive field by using an ASPP pyramid module, wherein the method specifically comprises the following steps:
4.1 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 6 steps, resulting in feature matrices of size 2000 × 64 × 256.
4.2 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 12 steps, resulting in feature matrices of size 2000 × 64 × 256.
4.3 put the output feature matrix map of block13 into 3 × 3 convolutional layers with 256 channels and 18 steps, resulting in feature matrices of size 2000 × 64 × 256.
4.4 put the output feature matrix map of block13 into pooling layer with step size 1, and then go through a 3 × 3 convolution layer with channel number 256 to get feature matrix with size 2000 × 64 × 256.
4.5 combining the outputs of steps 4.1, 4.2, 4.3, 4.4 to obtain a feature matrix of 2000 x 64 x 1024, and passing through a 3 x 3 convolution layer with 256 channels to obtain a feature matrix of 2000 x 64 x 256.
And (5) putting the shallow feature matrix in the step (3) into the step (5), and amplifying the small target features by using a CBAM attention mechanism, wherein the method specifically comprises the following steps:
5.1 multiply output upsampling 2 from step 3.2 by 2000 x 512 x 256, output upsampling 4 from step 3.3 by 2000 x 512, combining these two results with the results from step 3.1 to obtain a matrix of 2000 x 512 x 896, and then passing through two 3 x 3 convolutional layers with a channel number of 128 to obtain a signature matrix of 2000 x 512 x 128.
5.2 multiply the output of step 3.1 down-sampled 2 by 2000 x 256 x 128, the output of step 3.3 up-sampled 2 by 2000 x 256 x 512, these two results are combined with the result of step 3.2 to give a matrix of 2000 x 256 x 896, and then passed through two 3-up layers of 256 channels to give a signature matrix of 2000 x 256.
5.3 multiply the output of step 3.1 by 4 to 2000 x 128, multiply the output of step 3.2 by 2 to 2000 x 128 x 256, combine these two results with the result of step 3.3 to get a matrix of 2000 x 128 x 896, then pass through two 3 convolution layers of 512 channels to get a signature matrix of 2000 x 128 x 512.
5.4 multiply output upsampling 2 of step 5.2 by 2000 x 512 x 256, output upsampling 4 of step 5.3 by 2000 x 512, combining these two results with the result of step 5.1 to obtain a matrix of 2000 x 512 x 896, and then passing through two 3 x 3 convolutional layers of channel number 128 to obtain a signature matrix of 2000 x 512 x 128.
5.5 multiply the output of step 5.1 by 2 to 2000 x 256 x 128, the output of step 5.3 by 2 to 2000 x 256 x 512, combine these two results with the result of step 5.2 to obtain a matrix of 2000 x 256 x 896, and then pass through two 3 convolution layers with 256 channels to obtain a signature matrix of 2000 x 256.
5.6 multiply the output of step 5.1 by 4 to 2000 x 128, multiply the output of step 5.2 by 2 to 2000 x 128 x 256, combine these two results with the result of step 5.3 to obtain a matrix of 2000 x 128 x 896, and then pass through two 3 convolution layers of 512 channels to obtain a signature matrix of 2000 x 128 x 512.
And (4) gradually sending the steps 4.5, 5.4, 5.5 and 5.6 into the step (6), and gradually up-sampling and amplifying the feature map, wherein the specific steps are as follows:
6.1 put the output of step 5.6 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 128 x 80.
6.2 the outputs of step 6.1 and 4.5 are combined to give a matrix size of 2000 x 128 x 336, then 3 x 3 convolutional layers with 256 channels, and up sampled 2 times to give a matrix size of 2000 x 256.
6.3 put the output of step 5.5 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 256 x 80.
6.4 combining the output of step 6.3 and the output of 6.2 to get a matrix of size 2000 x 256 x 336, then passing through 3 x 3 convolutional layers of 256 channels, and up sampling 2 times to get a matrix of size 2000 x 512 x 256.
6.5 put the output of step 5.4 into a 1 x 1 convolutional layer with 80 channels and up-sample 2 times to get a matrix with size of 2000 x 512 x 80.
6.6 combining the output of step 6.5 with the output of 6.4 to get a matrix size of 2000 x 512 x 336, then passing through 3 x 3 convolutional layers with 256 channels, and up sampling 2 times to get a matrix size of 2000 x 1024 256.
6.7 put the output matrix of step 6.6 into a 9-channel 1 x 1 convolutional layer and activate with softmax function, resulting in a matrix size of 2000 x 1024 x 9.
6.8 comparing the difference between the output matrix and the marked picture matrix, continuously optimizing network parameters by using a gradient descent method, and obtaining a final network after 700 times of training by adopting a cross entropy loss function as a loss function.
In step (7), the segmented picture is output, which specifically comprises the following steps:
7.1 testing was performed using 300 test pictures of 1024 x 2048 pixels size to crop each picture into 2 pictures of 1024 x 1024 pixels size.
7.2, sending the cut pictures obtained in step 7.1 into a network at one time to obtain 1 matrix with the size of 600 × 1024 × 9, and using onehot coding to reduce dimension in the last dimension to obtain a matrix with the size of 600 × 1024, namely 600 pictures with the size of 1024 × 1024, wherein each pixel on the pictures is a label from 0 to 9 and represents 9 classifications of background, automobile, people, sky, road, grassland, wall, building and pedestrian road.
TABLE 1 comparison of the method of the invention with other methods
Classification method Deeplabv1 Deeplabv2 Deeplabv3+ The method of the invention
Rate of accuracy 79.5% 83.32% 88.48% 90.02%
As can be seen from Table 1, the method of the present invention has better accuracy in road image segmentation than the existing mainstream segmentation network. Especially, the recognition capability to small targets is stronger, and the edge segmentation to the object is more accurate. The method of the invention finds different characteristics of different objects by using a deep learning method, classifies the different characteristics, and can be widely applied to the directions of road identification, road scene segmentation and the like.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A road image segmentation method, comprising:
preprocessing the acquired multiple road images to obtain segmentation pictures;
inputting the segmented picture into a constructed Xception model to extract a deep layer feature map and a shallow layer feature map;
inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion;
inputting the deep characteristic map into a constructed ASPP pyramid module for pooling;
and fusing the deep layer feature map or the shallow layer feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling step by 2 times to amplify the deep layer feature map or the shallow layer feature map back to the original image size.
2. The road image segmentation method according to claim 1, wherein the preprocessing the acquired road images to obtain segmented pictures comprises:
cutting the road image into 1024 x 1024 pixel pictures, and uniformly storing the pictures into a jpg format;
performing semantic annotation on each picture to obtain a segmented picture;
the semantic annotation content comprises a background, an automobile, a person, the sky, a road, a grassland, a wall, a building and a pedestrian road.
3. The road image segmentation method according to claim 1, wherein the construction of the Xception model comprises:
building a block1 intermediate feature layer, which consists of a 3 × 3 convolution layer with 32 channels, a relu active layer, a 3 × 3 convolution layer with 64 channels and a relu active layer;
building a block2 intermediate feature layer, which consists of 2 128-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;
building a block3 intermediate feature layer, which consists of 2 256-channel 3 x 3 depth separable convolutional layers, a relu active layer and a maximum pooling layer;
building a block4 intermediate feature layer, which consists of 3 x 3 depth separable convolutional layers of 2 728 channels, a relu active layer and a maximum pooling layer;
building block5-block13 intermediate feature layers, which are composed of 3 x 3 depth separable convolution layers of 728 channels and 3 relu activation layers;
wherein, after the output of the block1 middle characteristic layer, the output is simultaneously sent to a 1 × 1 convolution layer, and the result is added with the output of the block2 middle characteristic layer; simultaneously sending the output of the block2 intermediate characteristic layer to a 1 x 1 convolution layer, and adding the result with the output of the block3 intermediate characteristic layer; the output of block3 intermediate feature layers is simultaneously fed into a 1 x 1 convolutional layer, and the result is added to the output of block4 intermediate feature layers.
4. The road image segmentation method of claim 3, wherein the step of inputting the segmented picture into the constructed Xcaption model to extract the deep feature map and the shallow feature map comprises: the Xception model extracts deep feature maps in the segmented pictures in the block13 intermediate feature layer, and the Xception model extracts shallow feature maps in the segmented pictures in the block2, block3 and block4 intermediate feature layers.
5. The road image segmentation method according to claim 4, wherein the inputting the shallow feature map into the constructed CBAM attention model to amplify the features of the small targets, and inputting the output result into the constructed HRNet module for fusion comprises:
inputting a shallow layer feature diagram extracted from a feature layer among block2, block3 and block4 into a constructed CBAM attention model to amplify the features of small targets, and outputting out1, out2 and out 3;
performing cross fusion on out1, out2 and out3 in an up-sampling and down-sampling mode to obtain feature maps of 3 corresponding resolutions, namely, hrout1, hrout2 and hrout 3;
wherein, the small target is an object with an area smaller than 10 x 10 pixels in the segmented picture;
the size of hrout2 is 1/2 for hrout1, and the size of hrout3 is 1/2 for hrout 2.
6. The road image segmentation method according to claim 1, wherein the constructing of the CBAM attention model comprises: constructing a channel attention mechanism and a space attention mechanism;
the channel attention mechanism comprises:
respectively performing maximum pooling and average pooling on the input feature map on channel dimensions to extract the maximum weight and the average weight on each channel;
respectively sending the maximum weight and the average weight to two full-connection layers for classification;
adding the classification results and activating by using a sigmoid function to obtain an importance weight matrix of each channel;
multiplying the importance weight matrix of each channel with the input feature map to obtain the output of amplified channel features;
the maximum pooling refers to the maximum value of pixel points of each channel, the average pooling refers to the average value of the pixel points of each channel, and the sigmoid activation function is used for enabling a larger value in input to be larger and a smaller value in input to be smaller;
the spatial attention mechanism comprises:
performing primary maximum pooling and primary average pooling on the output of the amplified channel characteristics on the spatial dimension to extract the maximum weight and the average weight of each pixel point;
carrying out convolution operation on the maximum weight and the average weight through a 3-by-3 convolution layer, activating by using a sigmoid function, and outputting an importance weight matrix of each pixel point;
and multiplying the importance weight matrix of each pixel point by the output of the amplified channel characteristic to obtain the output of the amplified pixel point characteristic.
7. The road image segmentation method according to claim 4, wherein the ASPP pyramid module comprises 3 × 3 convolution layers with step sizes of 6, 12 and 18, respectively, and an average pooling layer with step size of 1; the step of inputting the deep feature map into the constructed ASPP pyramid module for pooling comprises the following steps:
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 6, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 x 3 convolution layer with the step length of 12, and then sending two 1 x 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into a 3 × 3 convolution layer with the step length of 18, and then sending two 1 × 1 convolution layers with the step length of 1 to output results;
sending the deep characteristic diagram extracted from the block13 middle characteristic layer into an average pooling layer with the step length of 1 to output a result;
and combining the output results to obtain the final ASPP pyramid module pooling output.
8. The road image segmentation method of claim 7, wherein the fusing the deep feature map or the shallow feature map with the same resolution in the fusion result and the pooling result, and performing up-sampling by 2 times in a stage to enlarge the deep feature map or the shallow feature map back to the original size comprises:
convolving the pooled output of the ASPP pyramid module once, upsampling by 2 times once, and combining with hrout 3;
convolve the combined result once, upsample by 2 times once, and combine with hrout 2;
convolve the combined result once, upsample by 2 times once, and combine with hrout 1;
and (4) convolving the merging result twice, performing 2 times of upsampling once, and activating by using a softmax function to obtain final output.
CN202110706637.2A 2021-06-24 2021-06-24 Road image segmentation method fusing context progressive sampling Active CN113436210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110706637.2A CN113436210B (en) 2021-06-24 2021-06-24 Road image segmentation method fusing context progressive sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110706637.2A CN113436210B (en) 2021-06-24 2021-06-24 Road image segmentation method fusing context progressive sampling

Publications (2)

Publication Number Publication Date
CN113436210A true CN113436210A (en) 2021-09-24
CN113436210B CN113436210B (en) 2022-10-11

Family

ID=77754090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110706637.2A Active CN113436210B (en) 2021-06-24 2021-06-24 Road image segmentation method fusing context progressive sampling

Country Status (1)

Country Link
CN (1) CN113436210B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842333A (en) * 2022-04-14 2022-08-02 湖南盛鼎科技发展有限责任公司 Remote sensing image building extraction method, computer equipment and storage medium
CN116935226A (en) * 2023-08-01 2023-10-24 西安电子科技大学 HRNet-based improved remote sensing image road extraction method, system, equipment and medium
CN117789153A (en) * 2024-02-26 2024-03-29 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163449A (en) * 2020-08-21 2021-01-01 同济大学 Lightweight multi-branch feature cross-layer fusion image semantic segmentation method
CN112418027A (en) * 2020-11-11 2021-02-26 青岛科技大学 Remote sensing image road extraction method for improving U-Net network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163449A (en) * 2020-08-21 2021-01-01 同济大学 Lightweight multi-branch feature cross-layer fusion image semantic segmentation method
CN112418027A (en) * 2020-11-11 2021-02-26 青岛科技大学 Remote sensing image road extraction method for improving U-Net network

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842333A (en) * 2022-04-14 2022-08-02 湖南盛鼎科技发展有限责任公司 Remote sensing image building extraction method, computer equipment and storage medium
CN114842333B (en) * 2022-04-14 2022-10-28 湖南盛鼎科技发展有限责任公司 Remote sensing image building extraction method, computer equipment and storage medium
CN116935226A (en) * 2023-08-01 2023-10-24 西安电子科技大学 HRNet-based improved remote sensing image road extraction method, system, equipment and medium
CN117789153A (en) * 2024-02-26 2024-03-29 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision
CN117789153B (en) * 2024-02-26 2024-05-03 浙江驿公里智能科技有限公司 Automobile oil tank outer cover positioning system and method based on computer vision

Also Published As

Publication number Publication date
CN113436210B (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN112541503B (en) Real-time semantic segmentation method based on context attention mechanism and information fusion
CN113362223B (en) Image super-resolution reconstruction method based on attention mechanism and two-channel network
CN113436210B (en) Road image segmentation method fusing context progressive sampling
CN111915592B (en) Remote sensing image cloud detection method based on deep learning
CN112163449B (en) Lightweight multi-branch feature cross-layer fusion image semantic segmentation method
CN111461083A (en) Rapid vehicle detection method based on deep learning
CN110956094A (en) RGB-D multi-mode fusion personnel detection method based on asymmetric double-current network
CN111126379A (en) Target detection method and device
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN111160205B (en) Method for uniformly detecting multiple embedded types of targets in traffic scene end-to-end
CN109344818B (en) Light field significant target detection method based on deep convolutional network
CN111640116B (en) Aerial photography graph building segmentation method and device based on deep convolutional residual error network
CN111353544A (en) Improved Mixed Pooling-Yolov 3-based target detection method
CN113554032B (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN113313180B (en) Remote sensing image semantic segmentation method based on deep confrontation learning
CN114022408A (en) Remote sensing image cloud detection method based on multi-scale convolution neural network
CN112819000A (en) Streetscape image semantic segmentation system, streetscape image semantic segmentation method, electronic equipment and computer readable medium
CN112560701B (en) Face image extraction method and device and computer storage medium
CN114187520B (en) Building extraction model construction and application method
CN116343043B (en) Remote sensing image change detection method with multi-scale feature fusion function
CN112149526B (en) Lane line detection method and system based on long-distance information fusion
CN115984574B (en) Image information extraction model and method based on cyclic transducer and application thereof
CN113139489A (en) Crowd counting method and system based on background extraction and multi-scale fusion network
CN113139551A (en) Improved semantic segmentation method based on deep Labv3+
CN111583265A (en) Method for realizing phishing behavior detection processing based on codec structure and corresponding semantic segmentation network system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant