CN113658189B - Cross-scale feature fusion real-time semantic segmentation method and system - Google Patents

Cross-scale feature fusion real-time semantic segmentation method and system Download PDF

Info

Publication number
CN113658189B
CN113658189B CN202111021027.5A CN202111021027A CN113658189B CN 113658189 B CN113658189 B CN 113658189B CN 202111021027 A CN202111021027 A CN 202111021027A CN 113658189 B CN113658189 B CN 113658189B
Authority
CN
China
Prior art keywords
module
output
convolution
stage
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111021027.5A
Other languages
Chinese (zh)
Other versions
CN113658189A (en
Inventor
许庭兵
魏振忠
罗启峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202111021027.5A priority Critical patent/CN113658189B/en
Publication of CN113658189A publication Critical patent/CN113658189A/en
Application granted granted Critical
Publication of CN113658189B publication Critical patent/CN113658189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a cross-scale feature fusion real-time semantic segmentation method and a cross-scale feature fusion real-time semantic segmentation system. The method comprises the following steps: training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; and inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map. According to the invention, the semantic segmentation network model with a network architecture comprising a backbone network and a feature fusion network is arranged to process the image to be segmented in the image data set to be segmented, so that the segmentation precision can be improved, and the reasoning speed can be increased.

Description

Cross-scale feature fusion real-time semantic segmentation method and system
Technical Field
The invention relates to the field of semantic segmentation, in particular to a cross-scale feature fusion real-time semantic segmentation method and system.
Background
Semantic segmentation is a pixel-by-pixel classification technique for images, and is widely applied to virtual reality, automatic driving and robot technologies. In recent years, a semantic segmentation model based on deep learning is greatly developed, and the segmentation progress is greatly improved. However, many application scenarios require real-time semantic segmentation, i.e. at least 30 FPS/s. Many depth models fail to meet this requirement. Although a plurality of network models oriented to real-time semantic segmentation tasks are also proposed at present, the segmentation precision and the segmentation speed do not reach a good balance. Real-time semantic segmentation remains a challenging problem.
The high-precision semantic segmentation model obtains a better segmentation effect by constructing a depth model with huge parameter quantity, but the model has high calculation complexity and low processing speed, so that the application requirement of an actual scene cannot be met. In recent years, the research on lightweight and real-time semantic segmentation models is increasing, and a dual-path structure and a lightweight codec structure are two main model structures. The dual path structure: one path progressively downsamples the aggregated contextual semantic information, and the other path always maintains high resolution to maintain spatial detail. Although the structure achieves a good balance between the segmentation precision and the segmentation speed, the high-resolution feature path can cause higher time complexity and space complexity, and the improvement of the real-time semantic segmentation performance is limited. A downsampling path extracts deep semantic information, and a symmetric upsampling path propagates deep semantic features to the shallow layer. The unidirectional feature transfer of the codec structure cannot effectively fuse the detail information and the semantic information.
In summary, the existing semantic segmentation models have the defects of low segmentation precision or low inference speed.
Disclosure of Invention
The invention aims to provide a cross-scale feature fusion real-time semantic segmentation method and a cross-scale feature fusion real-time semantic segmentation system, which can improve the segmentation precision and the reasoning speed.
In order to achieve the purpose, the invention provides the following scheme:
a cross-scale feature fusion real-time semantic segmentation method comprises the following steps:
training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; the training data set is a cityscape data set; the semantic segmentation network model comprises: a backbone network and a feature fusion network; the backbone network includes: the device comprises a convolution layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module; the feature fusion network comprises 3 fusion paths;
and inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.
Preferably, the training of the semantic segmentation network model by using the training data set to obtain the trained semantic segmentation network model further includes:
testing the trained semantic segmentation network model by adopting a test set; the test set is a cityscape dataset.
Preferably, the training of the semantic segmentation network model by using the training data set to obtain the trained semantic segmentation network model specifically includes:
initializing network parameters of the semantic segmentation network model to obtain an initialized network model;
processing images in a training data set, inputting the images into the initialization network model, and iterating for a first preset time to obtain a first training network model; processing the images in the training dataset includes image compression and image augmentation;
inputting the images in the training data set into the first training network model, and iterating for a second preset time to obtain a second training network model; and the second training network model is the trained semantic segmentation network model.
Preferably, phase 1 of the backbone network contains 2 standard convolutions; the input of the standard convolution is an image to be segmented;
the 2 nd stage, the 3 rd stage and the 4 th stage of the backbone network all comprise convolution modules with the same structure; the convolution module is formed by 1 down-sampling module followed by 2 cascaded residual modules; the input of the downsampling module in the 2 nd stage is the output of the standard convolution in the 1 st stage; the input of the down-sampling module in the 3 rd stage is the output of the second residual error module in the 2 nd stage; the input of the down-sampling module in the 4 th stage is the output of the second residual error module in the 3 rd stage; the input of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the output of the down sampling module corresponding to the input; the output of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the input of the corresponding second residual error module;
the 5 th stage and the 6 th stage of the backbone network are maximum pooling layers; the input of the maximum pooling layer in the 5 th stage is the output of the residual error module in the 4 th stage; the input of the maximum pooling layer in the 6 th stage is the output of the maximum pooling layer in the 5 th stage;
the 3 fusion paths in the feature fusion network comprise a first convolution module, a second convolution module, a third convolution module and a fourth convolution module; the input of the first convolution module in the 1 st fusion path is the output of the maximum pooling layer in the 5 th stage and the output of the maximum pooling layer in the 6 th stage; the input of the second convolution module in the 1 st fusion path is the output of the first convolution module in the 1 st fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module in the 1 st fusion path is the output of the second convolution module in the 1 st fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module in the 1 st fusion path is the output of the third convolution module in the 1 st fusion path and the output of the second residual error module in the 2 nd stage; the input of the first convolution module in the 2 nd fusion path is the output of the fourth convolution module in the 1 st fusion path and the output of the third convolution module in the 1 st fusion path; the input of the second convolution module in the 2 nd fusion path is the output of the first convolution module in the 2 nd fusion path and the output of the second convolution module in the 1 st fusion path; the input of the third convolution module in the 2 nd fusion path is the output of the second convolution module in the 2 nd fusion path and the output of the first convolution module in the 1 st fusion path; the input of the fourth convolution module in the 2 nd fusion path is the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 6 th stage; the input of the first convolution module of the 3 rd fusion path is the output of the fourth convolution module in the 2 nd fusion path, the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 5 th stage; the input of the second convolution module of the 3 rd fusion path is the output of the first convolution module of the 3 rd fusion path, the output of the second convolution module in the 2 nd fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module of the 3 rd fusion path is the output of the second convolution module of the 3 rd fusion path, the output of the first convolution module in the 2 nd fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module of the 3 rd fusion path is the output of the third convolution module of the 3 rd fusion path and the output of the fourth convolution module in the 1 st fusion path;
the input of a dividing head in the dividing network is the output of a fourth convolution module of the 3 rd fusion path; and after 8 times of upsampling is carried out on the feature map output by the fourth convolution module of the 3 rd fusion path by the segmentation head, and a maximum value is obtained through a softmax function to obtain a semantic segmentation map.
Preferably, the 2 cascaded residual modules are all lightweight residual modules;
the lightweight residual error module is sequentially provided with the following components in the characteristic image transmission direction: a first 1x1 convolution, a first 3x3 channel-by-channel convolution, a second 1x1 convolution, a second 3x3 channel-by-channel convolution, and a third 1x1 convolution; the first 1x1 convolution, the first 3x3 channel-by-channel convolution, the second 1x1 convolution, the second 3x3 channel-by-channel convolution, and the third 1x1 convolution are all followed by a batch normalization operation;
an SE module is arranged after the first channel-by-channel convolution; setting a residual concatenation and addition operation after the third 1x1 convolution batch normalization operation; the ReLU activation function is set after the residual concatenation addition operation and the second 1x1 convolution batch operation.
Preferably, hole convolutions with cores of 2, 4 and 8 are added to the lightweight residual error module.
Preferably, the size of the kernel of the maximum pooling layer is 3, and the step size of the maximum pooling layer is 2.
Preferably, the step size of each of the 2 standard convolutions is 2.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
according to the cross-scale feature fusion real-time semantic segmentation method, the semantic segmentation network model with the network architecture comprising the backbone network and the feature fusion network is arranged to process the image to be segmented in the image data set to be segmented, so that the segmentation precision can be improved, and the reasoning speed can be improved.
Corresponding to the provided cross-scale feature fusion real-time semantic segmentation method, the invention also provides the following implementation system:
a cross-scale feature fused real-time semantic segmentation system, comprising:
the training module is used for training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; the training data set is a cityscape data set; the semantic segmentation network model comprises: a backbone network and a feature fusion network; the backbone network includes: the device comprises a convolution layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module; the feature fusion network comprises 3 fusion paths;
and the semantic segmentation module is used for inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.
The technical effect achieved by the cross-scale feature fusion real-time semantic segmentation system provided by the invention is the same as that achieved by the cross-scale feature fusion real-time semantic segmentation method provided by the invention, so that the details are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flow chart of a cross-scale feature fusion real-time semantic segmentation method provided by the present invention;
FIG. 2 is a schematic structural diagram of a semantic segmentation network model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a lightweight residual error module according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a downsampling module according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a feature fusion network according to an embodiment of the present invention;
FIG. 6 is a graph comparing the segmentation results provided by the embodiments of the present invention;
fig. 7 is a schematic structural diagram of a convolution module adopted in the feature fusion network according to the embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a cross-scale feature fusion real-time semantic segmentation system provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention aims to provide a cross-scale feature fusion real-time semantic segmentation method and a cross-scale feature fusion real-time semantic segmentation system, which can improve the segmentation precision and the reasoning speed.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the cross-scale feature fusion real-time semantic segmentation method provided by the present invention includes:
step 100: and training the semantic segmentation network model by adopting a training data set to obtain the trained semantic segmentation network model. The training data set is a public data set cityscape data set, for example, 5000 finely labeled street scene images of street scenes from 50 different cities can be selected from the data set, 5000 images are divided into 2975 images of the training set, 500 images of the test set and 1525 images of the verification set. The semantic segmentation network model comprises the following steps: a backbone network and a feature fusion network. The backbone network includes: convolutional layers (e.g., standard convolution), residual modules, max-pooling layers, and downsampling modules designed based on the residual modules (as shown in fig. 4). The feature fusion network includes 3 fusion paths, as shown in fig. 5, the 1 st path fuses features from the 6 th to the 2 nd stages of the backbone network from top to bottom. And the 2 nd path fuses the characteristics of the output of each convolution module of the first path from bottom to top. The 3 rd path fuses the output characteristics of the convolution modules of the second path from top to bottom, wherein the specific structure of the convolution modules is shown in fig. 7. While adding a lateral hop connection from the features of the backbone network input to path 3. When the features are fused, firstly, input features of all scales are reformed into the same size, then the features are spliced into a tensor, and the spliced features are output to the next layer after 1 × 1 convolution dimensionality reduction, 3 × 3 channel-by-channel convolution and 1 × 1 convolution.
In the specific implementation process, the specific implementation manner of the step 100 may be:
and initializing the network parameters of the semantic segmentation network model to obtain an initialized network model. The initialization here is random initialization.
After processing the images in the training data set, inputting the processed images into the initialization network model, and iterating for a first preset number of times (for example, 150 times) to obtain a first training network model. Processing the images in the training dataset includes image compression and image enlargement, e.g. 2 x size compression and enlargement batch processing of the training images.
And inputting the images in the training data set into the first training network model, and iterating for a second preset time (for example, 100 times) to obtain a second training network model. The second training network model is the trained semantic segmentation network model.
Step 101: and inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.
In the present invention, the specific structure of the employed semantic segmentation network model is shown in fig. 2, wherein, the 1 st stage of the backbone network includes 2 standard convolutions. The input of the standard convolution is the image to be segmented.
The 2 nd, 3 rd and 4 th stages of the backbone network all comprise convolution modules of the same structure. The convolution module is 1 down-sampling module followed by 2 cascaded residual modules. The input of the downsampling module in stage 2 is the output of the standard convolution in stage 1. The input of the down-sampling block in stage 3 is the output of the second residual block in stage 2. The input of the down-sampling module in stage 4 is the output of the second residual module in stage 3. The input of the first residual module in the 2 nd stage, the 3 rd stage and the 4 th stage is the output of the corresponding down sampling module. The output of the first residual module in the 2 nd stage, the 3 rd stage and the 4 th stage is the input of the corresponding second residual module.
Both stages 5 and 6 of the backbone network are max pooling layers. The input of the largest pooling layer in stage 5 is the output of the residual module in stage 4. The input of the largest pooling layer in stage 6 is the output of the largest pooling layer in stage 5.
The 3 fusion paths in the feature fusion network respectively comprise a first convolution module, a second convolution module, a third convolution module and a fourth convolution module. The input of the first convolution module in the 1 st fusion path is the output of the largest pooling layer in the 5 th stage and the output of the largest pooling layer in the 6 th stage. The input of the second convolution module in the 1 st fusion path is the output of the first convolution module in the 1 st fusion path and the output of the second residual error module in the 4 th stage. The input of the third convolution module in the 1 st fusion path is the output of the second convolution module in the 1 st fusion path and the output of the second residual error module in the 3 rd stage. The input of the fourth convolution module in the 1 st fusion path is the output of the third convolution module in the 1 st fusion path and the output of the second residual error module in the 2 nd stage. The input of the first convolution module in the 2 nd fusion path is the output of the fourth convolution module in the 1 st fusion path and the output of the third convolution module in the 1 st fusion path. The input of the second convolution module in the 2 nd fusion path is the output of the first convolution module in the 2 nd fusion path and the output of the second convolution module in the 1 st fusion path. The input of the third convolution module in the 2 nd fusion path is the output of the second convolution module in the 2 nd fusion path and the output of the first convolution module in the 1 st fusion path. The input of the fourth convolution module in the 2 nd fusion path is the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 6 th stage. The input of the first convolution module of the 3 rd fusion path is the output of the fourth convolution module in the 2 nd fusion path, the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 5 th stage. The input of the second convolution module of the 3 rd fusion path is the output of the first convolution module of the 3 rd fusion path, the output of the second convolution module in the 2 nd fusion path and the output of the second residual error module in the 4 th stage. The input of the third convolution module of the 3 rd fusion path is the output of the second convolution module of the 3 rd fusion path, the output of the first convolution module in the 2 nd fusion path and the output of the second residual error module in the 3 rd stage. The input of the fourth convolution module of the 3 rd fusion path is the output of the third convolution module of the 3 rd fusion path and the output of the fourth convolution module in the 1 st fusion path.
The input of the dividing head in the dividing network is the output of the fourth convolution module of the 3 rd fusion path. And the segmentation head performs 8 times of upsampling on the feature map output by the fourth convolution module of the 3 rd fusion path, and then obtains a semantic segmentation map by taking a maximum value through a loss function.
In order to further improve the accuracy and timeliness of image semantic segmentation, the method is based on depth separable convolution to design 2 cascaded residual modules into light residual modules. As shown in fig. 3, the structure of the light-weight residual error module is sequentially provided with: a first 1x1 convolution, a first 3x3 channel-by-channel convolution, a second 1x1 convolution, a second 3x3 channel-by-channel convolution, and a third 1x1 convolution. The batch normalization operation is performed after the convolution of the first 1x1 convolution, the first 3x3 channel-by-channel convolution, the second 1x1 convolution, the second 3x3 channel-by-channel convolution and the third 1x1 convolution. Based on the light residual structure of the design, the downsampling module is formed by adding branch 1x1 convolution-channel-by-channel convolution with the step size of 2-1 x1 convolution.
After a backbone network is constructed from the designed lightweight residual module and downsampling module, in the backbone network, the input image is first processed using 2 conventional convolutions of 3 × 3 with a step size of 2. The performance of conventional convolution is better because the input image has a larger resolution and a smaller number of channels. The subsequent 3 stages (stage 2, stage 3 and stage 4) are identical in structure, i.e. each comprises 1 down-sampling module and 2 lightweight residual modules. And respectively adding kernels of 2, 4 and 8 hole convolutions into the 2 nd lightweight residual error module of each stage, and adding an SE module after the 1 st channel-by-channel convolution. The last 2 layers of the backbone network are the maximum pooling layers, the pooling core size is 3, and the step length is 2. The final feature map size is 1/128 for the input image. Specific parameters of each structure in the backbone network are shown in table 1.
TABLE 1
Figure BDA0003241383440000081
Figure BDA0003241383440000091
Based on the specific structure of the semantic segmentation network model provided above, in the process of testing the trained semantic segmentation network model, a cityscape data set is selected as a test set, and the specific test process is as follows:
the image in the test data set is input into a semantic segmentation network model, and a backbone network is divided into 6 stages to process the input image. The stage 1 of the backbone network comprises 2 standard convolutions, the step size is 2, and the downsampling operation is carried out while the input image is processed, so that the size of the characteristic image is reduced. Stages 2 to 4 have the same convolution module, i.e. 1 down-sampling module followed by two lightweight residual modules. Each downsampling module compresses the feature map size output by the previous module and enlarges the number of feature channels by a factor of 2. The lightweight residual error module is formed by adopting deep separable convolution, and can give consideration to the processing speed and precision of the network. The last 2 stages of the backbone network are the maximum pooling layers, the maximum pooling layer kernel size is 3, the step size is 2, the size of the final feature map is 1/128 of the input image, the image has a large enough receptive field, and the local maximum response can be obtained. Then the inputs of the 2 nd to 6 th stages of the backbone convolutional network are respectively sent to a cross-scale feature fusion module. The cross-scale feature fusion module outputs fusion features after passing through three paths from top to bottom, from bottom to top and from top to bottom, up-sampling is carried out for 8 times through a final segmentation head, and a corresponding label category information is obtained by taking a maximum value through a softmax function to obtain a semantic segmentation graph.
The following describes a specific implementation process of the cross-scale feature fusion real-time semantic segmentation method provided by the invention by taking public data sets cityscaps as evaluation data sets respectively as an example.
Step 1: the training for large-size images is divided into two steps: the first step, initializing network parameters randomly, compressing the training image by 2 times, increasing batch processing, and iterating for 150K times to obtain the training result of the convolution network. And secondly, training in small batches by adopting the original image size, and iterating for 100K times to obtain a final convolution network training result.
Step 2: a3-channel color image is input into a semantic segmentation network model, and a backbone network is divided into 6 stages to process an input image. The stage 1 of the backbone network comprises 2 standard convolutions, the step size is 2, and the downsampling operation is carried out while the input image is processed, so that the size of the characteristic image is reduced. Stages 2 to 4 have the same convolution module, i.e. 1 down-sampling module followed by two lightweight residual modules. Each downsampling module compresses the feature map size output by the previous module and enlarges the number of feature channels by a factor of 2. The lightweight residual error module is formed by adopting deep separable convolution, and can give consideration to the processing speed and precision of the network. The last 2 stages of the backbone network are the maximum pooling layers, the kernel size of the maximum pooling layer is 3, the step size is 2, the size of the final feature map is 1/128 of the input image, the domain of the input image is large enough, and the local maximum response can be obtained. Then the inputs of the 2 nd to 6 th stages of the backbone convolutional network are respectively sent to a cross-scale feature fusion module. The cross-scale feature fusion module outputs fusion features after passing through three paths from top to bottom, from bottom to top and from top to bottom, 8 times of upsampling is carried out through the final segmentation module, and the maximum value is obtained through the softmax function to obtain corresponding label category information so as to obtain a semantic segmentation graph.
In the present example, the above-noted airplane data set and public data set cityscaps are respectively used as evaluation data sets. All experimental results were run on a single nvidiagefore RTX 2080Ti GPU.
The most common real-time semantic segmentation index is used in this example: the segmentation precision is measured by mean intersection ratio (mIoU) and the inference speed is measured by how many frames of images are processed per second (FPS), and meanwhile, the parameter quantity and the calculated quantity can also be used as indexes to participate in comparison. For the real-time semantic segmentation model, the inference speed is at least 30FPS, on the basis, the higher the segmentation precision is, the faster the inference speed is, and the smaller the parameter number and the calculated amount are, the better the overall performance of the model is.
In order to prove that the cross-scale feature fusion real-time semantic segmentation method provided by the invention is a method with excellent performance and speed, and is compared with some real-time semantic segmentations published recently on a Cityscapes data set.
Table 2 shows the results of testing the invention on the cityscaps dataset with other real-time semantic segmentation methods. It can be seen that the segmentation accuracy of the present invention is optimal, and at the same time, the requirement of at least 30 frames/second for real-time semantic segmentation is met, reaching 46.5 fps. Although the processing speed of the models SFNet and CABINet is faster than that of the invention, the segmentation precision, the parameter quantity and the calculated quantity are inferior to that of the invention, and although the parameter quantity and the calculated quantity of other models are less than that of the invention and the reasoning speed is fast, the segmentation precision cannot be compared with that of the invention, so that the invention achieves the optimal balance in speed and precision.
TABLE 2 comparison results Table on Cityscapes dataset
Figure BDA0003241383440000111
Comparing the light residual module:
the performance of the designed lightweight residual error module and the performance of the designed inverted residual error module in the MobilenetV2 are compared on the semantic segmentation network model structure designed by the invention, namely, the lightweight residual error module and the inverted residual error module are respectively used for constructing a backbone network in the model, the same round of training is carried out under the same hyper-parameter configuration, and the comparison result of the segmentation precision MIoU, the calculated quantity GFLOPs, the parameter quantity and the inference speed of the two models is shown in the table 3.
TABLE 3
Figure BDA0003241383440000112
Effectiveness of the cross-scale feature fusion module:
the cross-scale feature fusion module is mainly characterized by having jump connection from input to the 3 rd path from top to bottom. For this structure, compared to the structure with no jump connection, only jump connection to item 2, and jump connection structure containing to items 2 and 3, as shown in table 4, the designed cross-scale feature fusion module has the best performance when only jump connection to item 3 (adopted).
TABLE 4
Figure BDA0003241383440000121
The segmentation effect obtained based on the cross-scale feature fusion real-time semantic segmentation method provided by the invention is shown in fig. 6, wherein the 1 st column in fig. 6 is an input image, the 2 nd column is a cross-scale feature fusion network output result, the 3 rd column is a network output result without cross-scale connection, and the 4 th column is an annotated segmentation image.
In summary, the technical scheme provided by the invention adopts a lightweight real-time semantic segmentation convolutional neural network model, a lightweight residual error module composed of a depth separable convolutional layer and a linear bottleneck layer, and a cross-scale feature fusion module. The method has the characteristics of real-time high precision in semantic segmentation tasks of large-size images of urban street scenes and large airplanes.
In addition, corresponding to the above-mentioned provided real-time semantic segmentation method by cross-scale feature fusion, the present invention further provides a real-time semantic segmentation system by cross-scale feature fusion, as shown in fig. 8, the system includes: a training module 1 and a semantic segmentation module 2.
The training module 1 is configured to train the semantic segmentation network model by using a training data set to obtain a trained semantic segmentation network model. The training dataset is the cityscape dataset. The semantic segmentation network model comprises the following steps: a backbone network and a feature fusion network. The backbone network includes: the device comprises a convolutional layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module. The feature fusion network includes 3 fusion paths.
The semantic segmentation module 2 is used for inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A cross-scale feature fusion real-time semantic segmentation method is characterized by comprising the following steps:
training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; the training data set is a cityscape data set; the semantic segmentation network model comprises: a backbone network and a feature fusion network; the backbone network includes: the device comprises a convolution layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module; the feature fusion network comprises 3 fusion paths;
inputting an image data set to be segmented into a trained semantic segmentation network model to obtain a semantic segmentation map;
wherein stage 1 of the backbone network comprises 2 standard convolutions; the input of the standard convolution is an image to be segmented;
the 2 nd stage, the 3 rd stage and the 4 th stage of the backbone network all comprise convolution modules with the same structure; the convolution module is formed by 1 down-sampling module followed by 2 cascaded residual modules; the input of the downsampling module in the 2 nd stage is the output of the standard convolution in the 1 st stage; the input of the down-sampling module in the 3 rd stage is the output of the second residual error module in the 2 nd stage; the input of the down-sampling module in the 4 th stage is the output of the second residual error module in the 3 rd stage; the input of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the output of the down sampling module corresponding to the input; the output of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the input of the corresponding second residual error module;
the 5 th stage and the 6 th stage of the backbone network are maximum pooling layers; the input of the maximum pooling layer in the 5 th stage is the output of the residual error module in the 4 th stage; the input of the maximum pooling layer in the 6 th stage is the output of the maximum pooling layer in the 5 th stage;
the 3 fusion paths in the feature fusion network comprise a first convolution module, a second convolution module, a third convolution module and a fourth convolution module; the input of the first convolution module in the 1 st fusion path is the output of the maximum pooling layer in the 5 th stage and the output of the maximum pooling layer in the 6 th stage; the input of the second convolution module in the 1 st fusion path is the output of the first convolution module in the 1 st fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module in the 1 st fusion path is the output of the second convolution module in the 1 st fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module in the 1 st fusion path is the output of the third convolution module in the 1 st fusion path and the output of the second residual error module in the 2 nd stage; the input of the first convolution module in the 2 nd fusion path is the output of the fourth convolution module in the 1 st fusion path and the output of the third convolution module in the 1 st fusion path; the input of the second convolution module in the 2 nd fusion path is the output of the first convolution module in the 2 nd fusion path and the output of the second convolution module in the 1 st fusion path; the input of the third convolution module in the 2 nd fusion path is the output of the second convolution module in the 2 nd fusion path and the output of the first convolution module in the 1 st fusion path; the input of the fourth convolution module in the 2 nd fusion path is the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 6 th stage; the input of the first convolution module of the 3 rd fusion path is the output of the fourth convolution module in the 2 nd fusion path, the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 5 th stage; the input of the second convolution module of the 3 rd fusion path is the output of the first convolution module of the 3 rd fusion path, the output of the second convolution module in the 2 nd fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module of the 3 rd fusion path is the output of the second convolution module of the 3 rd fusion path, the output of the first convolution module in the 2 nd fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module of the 3 rd fusion path is the output of the third convolution module of the 3 rd fusion path and the output of the fourth convolution module in the 1 st fusion path;
the input of a dividing head in the dividing network is the output of a fourth convolution module of the 3 rd fusion path; and after 8 times of upsampling is carried out on the feature map output by the fourth convolution module of the 3 rd fusion path by the segmentation head, and a maximum value is obtained through a softmax function to obtain a semantic segmentation map.
2. The cross-scale feature fusion real-time semantic segmentation method according to claim 1, wherein the training of the semantic segmentation network model with the training data set is performed to obtain a trained semantic segmentation network model, and then further comprising:
testing the trained semantic segmentation network model by adopting a test set; the test set is a cityscape dataset.
3. The cross-scale feature fusion real-time semantic segmentation method according to claim 1, wherein training the semantic segmentation network model with a training data set to obtain a trained semantic segmentation network model specifically comprises:
initializing network parameters of the semantic segmentation network model to obtain an initialized network model;
processing images in a training data set, inputting the images into the initialization network model, and iterating for a first preset time to obtain a first training network model; processing the images in the training dataset includes image compression and image augmentation;
inputting the images in the training data set into the first training network model, and iterating for a second preset time to obtain a second training network model; and the second training network model is the trained semantic segmentation network model.
4. The cross-scale feature fusion real-time semantic segmentation method according to claim 1, wherein 2 cascaded residual modules are all lightweight residual modules;
the lightweight residual error module is sequentially provided with the following components in the characteristic image transmission direction: a first 1x1 convolution, a first 3x3 channel-by-channel convolution, a second 1x1 convolution, a second 3x3 channel-by-channel convolution, and a third 1x1 convolution; the first 1x1 convolution, the first 3x3 channel-by-channel convolution, the second 1x1 convolution, the second 3x3 channel-by-channel convolution, and the third 1x1 convolution are all followed by a batch normalization operation;
an SE module is arranged after the first 3x3 channel-by-channel convolution; setting a residual concatenation and addition operation after the second 1x1 convolution batch normalization operation; the ReLU activation function is set after the residual concatenation addition operation and the second 1x1 convolution batch operation.
5. The cross-scale feature fusion real-time semantic segmentation method according to claim 4, wherein hole convolutions with cores of 2, 4 and 8 are added to the lightweight residual error module.
6. The method according to claim 1, wherein the size of the kernel of the largest pooling layer is 3, and the step size of the largest pooling layer is 2.
7. The method according to claim 1, wherein the step size of each of the 2 standard convolutions is 2.
8. A cross-scale feature fused real-time semantic segmentation system, comprising:
the training module is used for training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; the training data set is a cityscape data set; the semantic segmentation network model comprises: a backbone network and a feature fusion network; the backbone network includes: the device comprises a convolution layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module; the feature fusion network comprises 3 fusion paths;
the semantic segmentation module is used for inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map;
wherein stage 1 of the backbone network comprises 2 standard convolutions; the input of the standard convolution is an image to be segmented;
the 2 nd stage, the 3 rd stage and the 4 th stage of the backbone network all comprise convolution modules with the same structure; the convolution module is formed by 1 down-sampling module followed by 2 cascaded residual modules; the input of the downsampling module in the 2 nd stage is the output of the standard convolution in the 1 st stage; the input of the down-sampling module in the 3 rd stage is the output of the second residual error module in the 2 nd stage; the input of the down-sampling module in the 4 th stage is the output of the second residual error module in the 3 rd stage; the input of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the output of the down sampling module corresponding to the input; the output of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the input of the corresponding second residual error module;
the 5 th stage and the 6 th stage of the backbone network are maximum pooling layers; the input of the maximum pooling layer in the 5 th stage is the output of the residual error module in the 4 th stage; the input of the maximum pooling layer in the 6 th stage is the output of the maximum pooling layer in the 5 th stage;
the 3 fusion paths in the feature fusion network comprise a first convolution module, a second convolution module, a third convolution module and a fourth convolution module; the input of the first convolution module in the 1 st fusion path is the output of the maximum pooling layer in the 5 th stage and the output of the maximum pooling layer in the 6 th stage; the input of the second convolution module in the 1 st fusion path is the output of the first convolution module in the 1 st fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module in the 1 st fusion path is the output of the second convolution module in the 1 st fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module in the 1 st fusion path is the output of the third convolution module in the 1 st fusion path and the output of the second residual error module in the 2 nd stage; the input of the first convolution module in the 2 nd fusion path is the output of the fourth convolution module in the 1 st fusion path and the output of the third convolution module in the 1 st fusion path; the input of the second convolution module in the 2 nd fusion path is the output of the first convolution module in the 2 nd fusion path and the output of the second convolution module in the 1 st fusion path; the input of the third convolution module in the 2 nd fusion path is the output of the second convolution module in the 2 nd fusion path and the output of the first convolution module in the 1 st fusion path; the input of the fourth convolution module in the 2 nd fusion path is the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 6 th stage; the input of the first convolution module of the 3 rd fusion path is the output of the fourth convolution module in the 2 nd fusion path, the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 5 th stage; the input of the second convolution module of the 3 rd fusion path is the output of the first convolution module of the 3 rd fusion path, the output of the second convolution module in the 2 nd fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module of the 3 rd fusion path is the output of the second convolution module of the 3 rd fusion path, the output of the first convolution module in the 2 nd fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module of the 3 rd fusion path is the output of the third convolution module of the 3 rd fusion path and the output of the fourth convolution module in the 1 st fusion path;
the input of a dividing head in the dividing network is the output of a fourth convolution module of the 3 rd fusion path; and after 8 times of upsampling is carried out on the feature map output by the fourth convolution module of the 3 rd fusion path by the segmentation head, and a maximum value is obtained through a softmax function to obtain a semantic segmentation map.
CN202111021027.5A 2021-09-01 2021-09-01 Cross-scale feature fusion real-time semantic segmentation method and system Active CN113658189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111021027.5A CN113658189B (en) 2021-09-01 2021-09-01 Cross-scale feature fusion real-time semantic segmentation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111021027.5A CN113658189B (en) 2021-09-01 2021-09-01 Cross-scale feature fusion real-time semantic segmentation method and system

Publications (2)

Publication Number Publication Date
CN113658189A CN113658189A (en) 2021-11-16
CN113658189B true CN113658189B (en) 2022-03-11

Family

ID=78481649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111021027.5A Active CN113658189B (en) 2021-09-01 2021-09-01 Cross-scale feature fusion real-time semantic segmentation method and system

Country Status (1)

Country Link
CN (1) CN113658189B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114120154B (en) * 2021-11-23 2022-10-28 宁波大学 Automatic detection method for breakage of glass curtain wall of high-rise building
CN114612456B (en) * 2022-03-21 2023-01-10 北京科技大学 Billet automatic semantic segmentation recognition method based on deep learning
CN114943835B (en) * 2022-04-20 2024-03-12 西北工业大学 Real-time semantic segmentation method for yellow river ice unmanned aerial vehicle aerial image

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062395A (en) * 2019-11-27 2020-04-24 北京理工大学 Real-time video semantic segmentation method
CN111080648A (en) * 2019-12-02 2020-04-28 南京理工大学 Real-time image semantic segmentation algorithm based on residual learning
CN111666948A (en) * 2020-05-27 2020-09-15 厦门大学 Real-time high-performance semantic segmentation method and device based on multi-path aggregation
CN112381097A (en) * 2020-11-16 2021-02-19 西南石油大学 Scene semantic segmentation method based on deep learning
CN113256649A (en) * 2021-05-11 2021-08-13 国网安徽省电力有限公司经济技术研究院 Remote sensing image station selection and line selection semantic segmentation method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062395A (en) * 2019-11-27 2020-04-24 北京理工大学 Real-time video semantic segmentation method
CN111080648A (en) * 2019-12-02 2020-04-28 南京理工大学 Real-time image semantic segmentation algorithm based on residual learning
CN111666948A (en) * 2020-05-27 2020-09-15 厦门大学 Real-time high-performance semantic segmentation method and device based on multi-path aggregation
CN112381097A (en) * 2020-11-16 2021-02-19 西南石油大学 Scene semantic segmentation method based on deep learning
CN113256649A (en) * 2021-05-11 2021-08-13 国网安徽省电力有限公司经济技术研究院 Remote sensing image station selection and line selection semantic segmentation method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Multi-Path Dilated Residual Network for Nuclei Segmentation and Detection;Eric Ke Wang 等;《cells》;20190523;第1-19页 *

Also Published As

Publication number Publication date
CN113658189A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN113658189B (en) Cross-scale feature fusion real-time semantic segmentation method and system
CN110782462B (en) Semantic segmentation method based on double-flow feature fusion
CN107704866B (en) Multitask scene semantic understanding model based on novel neural network and application thereof
CN111259983B (en) Image semantic segmentation method based on deep learning and storage medium
CN111598183B (en) Multi-feature fusion image description method
CN110569851B (en) Real-time semantic segmentation method for gated multi-layer fusion
CN113870422B (en) Point cloud reconstruction method, device, equipment and medium
CN116721334B (en) Training method, device, equipment and storage medium of image generation model
CN114119975A (en) Language-guided cross-modal instance segmentation method
CN111899169B (en) Method for segmenting network of face image based on semantic segmentation
CN110782458B (en) Object image 3D semantic prediction segmentation method of asymmetric coding network
CN113066089A (en) Real-time image semantic segmentation network based on attention guide mechanism
KR102128789B1 (en) Method and apparatus for providing efficient dilated convolution technique for deep convolutional neural network
CN116703947A (en) Image semantic segmentation method based on attention mechanism and knowledge distillation
CN111160378A (en) Depth estimation system based on single image multitask enhancement
CN112418235A (en) Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement
CN116485860A (en) Monocular depth prediction algorithm based on multi-scale progressive interaction and aggregation cross attention features
Yu et al. A review of single image super-resolution reconstruction based on deep learning
CN116993987A (en) Image semantic segmentation method and system based on lightweight neural network model
WO2020093210A1 (en) Scene segmentation method and system based on contenxtual information guidance
CN113255675B (en) Image semantic segmentation network structure and method based on expanded convolution and residual path
CN111553921B (en) Real-time semantic segmentation method based on channel information sharing residual error module
CN112488115B (en) Semantic segmentation method based on two-stream architecture
CN113139463A (en) Method, apparatus, device, medium and program product for training a model
Gan et al. Image super-resolution reconstruction based on deep residual network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant