CN113658189B

CN113658189B - Cross-scale feature fusion real-time semantic segmentation method and system

Info

Publication number: CN113658189B
Application number: CN202111021027.5A
Authority: CN
Inventors: 许庭兵; 魏振忠; 罗启峰
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-09-01
Filing date: 2021-09-01
Publication date: 2022-03-11
Anticipated expiration: 2041-09-01
Also published as: CN113658189A

Abstract

The invention relates to a cross-scale feature fusion real-time semantic segmentation method and a cross-scale feature fusion real-time semantic segmentation system. The method comprises the following steps: training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; and inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map. According to the invention, the semantic segmentation network model with a network architecture comprising a backbone network and a feature fusion network is arranged to process the image to be segmented in the image data set to be segmented, so that the segmentation precision can be improved, and the reasoning speed can be increased.

Description

Cross-scale feature fusion real-time semantic segmentation method and system

Technical Field

The invention relates to the field of semantic segmentation, in particular to a cross-scale feature fusion real-time semantic segmentation method and system.

Background

Semantic segmentation is a pixel-by-pixel classification technique for images, and is widely applied to virtual reality, automatic driving and robot technologies. In recent years, a semantic segmentation model based on deep learning is greatly developed, and the segmentation progress is greatly improved. However, many application scenarios require real-time semantic segmentation, i.e. at least 30 FPS/s. Many depth models fail to meet this requirement. Although a plurality of network models oriented to real-time semantic segmentation tasks are also proposed at present, the segmentation precision and the segmentation speed do not reach a good balance. Real-time semantic segmentation remains a challenging problem.

The high-precision semantic segmentation model obtains a better segmentation effect by constructing a depth model with huge parameter quantity, but the model has high calculation complexity and low processing speed, so that the application requirement of an actual scene cannot be met. In recent years, the research on lightweight and real-time semantic segmentation models is increasing, and a dual-path structure and a lightweight codec structure are two main model structures. The dual path structure: one path progressively downsamples the aggregated contextual semantic information, and the other path always maintains high resolution to maintain spatial detail. Although the structure achieves a good balance between the segmentation precision and the segmentation speed, the high-resolution feature path can cause higher time complexity and space complexity, and the improvement of the real-time semantic segmentation performance is limited. A downsampling path extracts deep semantic information, and a symmetric upsampling path propagates deep semantic features to the shallow layer. The unidirectional feature transfer of the codec structure cannot effectively fuse the detail information and the semantic information.

In summary, the existing semantic segmentation models have the defects of low segmentation precision or low inference speed.

Disclosure of Invention

The invention aims to provide a cross-scale feature fusion real-time semantic segmentation method and a cross-scale feature fusion real-time semantic segmentation system, which can improve the segmentation precision and the reasoning speed.

In order to achieve the purpose, the invention provides the following scheme:

a cross-scale feature fusion real-time semantic segmentation method comprises the following steps:

training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; the training data set is a cityscape data set; the semantic segmentation network model comprises: a backbone network and a feature fusion network; the backbone network includes: the device comprises a convolution layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module; the feature fusion network comprises 3 fusion paths;

and inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.

Preferably, the training of the semantic segmentation network model by using the training data set to obtain the trained semantic segmentation network model further includes:

testing the trained semantic segmentation network model by adopting a test set; the test set is a cityscape dataset.

Preferably, the training of the semantic segmentation network model by using the training data set to obtain the trained semantic segmentation network model specifically includes:

initializing network parameters of the semantic segmentation network model to obtain an initialized network model;

processing images in a training data set, inputting the images into the initialization network model, and iterating for a first preset time to obtain a first training network model; processing the images in the training dataset includes image compression and image augmentation;

inputting the images in the training data set into the first training network model, and iterating for a second preset time to obtain a second training network model; and the second training network model is the trained semantic segmentation network model.

Preferably, phase 1 of the backbone network contains 2 standard convolutions; the input of the standard convolution is an image to be segmented;

the 2 nd stage, the 3 rd stage and the 4 th stage of the backbone network all comprise convolution modules with the same structure; the convolution module is formed by 1 down-sampling module followed by 2 cascaded residual modules; the input of the downsampling module in the 2 nd stage is the output of the standard convolution in the 1 st stage; the input of the down-sampling module in the 3 rd stage is the output of the second residual error module in the 2 nd stage; the input of the down-sampling module in the 4 th stage is the output of the second residual error module in the 3 rd stage; the input of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the output of the down sampling module corresponding to the input; the output of the first residual error module in the 2 nd stage, the 3 rd stage and the 4 th stage is the input of the corresponding second residual error module;

the 5 th stage and the 6 th stage of the backbone network are maximum pooling layers; the input of the maximum pooling layer in the 5 th stage is the output of the residual error module in the 4 th stage; the input of the maximum pooling layer in the 6 th stage is the output of the maximum pooling layer in the 5 th stage;

the 3 fusion paths in the feature fusion network comprise a first convolution module, a second convolution module, a third convolution module and a fourth convolution module; the input of the first convolution module in the 1 st fusion path is the output of the maximum pooling layer in the 5 th stage and the output of the maximum pooling layer in the 6 th stage; the input of the second convolution module in the 1 st fusion path is the output of the first convolution module in the 1 st fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module in the 1 st fusion path is the output of the second convolution module in the 1 st fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module in the 1 st fusion path is the output of the third convolution module in the 1 st fusion path and the output of the second residual error module in the 2 nd stage; the input of the first convolution module in the 2 nd fusion path is the output of the fourth convolution module in the 1 st fusion path and the output of the third convolution module in the 1 st fusion path; the input of the second convolution module in the 2 nd fusion path is the output of the first convolution module in the 2 nd fusion path and the output of the second convolution module in the 1 st fusion path; the input of the third convolution module in the 2 nd fusion path is the output of the second convolution module in the 2 nd fusion path and the output of the first convolution module in the 1 st fusion path; the input of the fourth convolution module in the 2 nd fusion path is the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 6 th stage; the input of the first convolution module of the 3 rd fusion path is the output of the fourth convolution module in the 2 nd fusion path, the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 5 th stage; the input of the second convolution module of the 3 rd fusion path is the output of the first convolution module of the 3 rd fusion path, the output of the second convolution module in the 2 nd fusion path and the output of the second residual error module in the 4 th stage; the input of the third convolution module of the 3 rd fusion path is the output of the second convolution module of the 3 rd fusion path, the output of the first convolution module in the 2 nd fusion path and the output of the second residual error module in the 3 rd stage; the input of the fourth convolution module of the 3 rd fusion path is the output of the third convolution module of the 3 rd fusion path and the output of the fourth convolution module in the 1 st fusion path;

the input of a dividing head in the dividing network is the output of a fourth convolution module of the 3 rd fusion path; and after 8 times of upsampling is carried out on the feature map output by the fourth convolution module of the 3 rd fusion path by the segmentation head, and a maximum value is obtained through a softmax function to obtain a semantic segmentation map.

Preferably, the 2 cascaded residual modules are all lightweight residual modules;

the lightweight residual error module is sequentially provided with the following components in the characteristic image transmission direction: a first 1x1 convolution, a first 3x3 channel-by-channel convolution, a second 1x1 convolution, a second 3x3 channel-by-channel convolution, and a third 1x1 convolution; the first 1x1 convolution, the first 3x3 channel-by-channel convolution, the second 1x1 convolution, the second 3x3 channel-by-channel convolution, and the third 1x1 convolution are all followed by a batch normalization operation;

an SE module is arranged after the first channel-by-channel convolution; setting a residual concatenation and addition operation after the third 1x1 convolution batch normalization operation; the ReLU activation function is set after the residual concatenation addition operation and the second 1x1 convolution batch operation.

Preferably, hole convolutions with cores of 2, 4 and 8 are added to the lightweight residual error module.

Preferably, the size of the kernel of the maximum pooling layer is 3, and the step size of the maximum pooling layer is 2.

Preferably, the step size of each of the 2 standard convolutions is 2.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the cross-scale feature fusion real-time semantic segmentation method, the semantic segmentation network model with the network architecture comprising the backbone network and the feature fusion network is arranged to process the image to be segmented in the image data set to be segmented, so that the segmentation precision can be improved, and the reasoning speed can be improved.

Corresponding to the provided cross-scale feature fusion real-time semantic segmentation method, the invention also provides the following implementation system:

a cross-scale feature fused real-time semantic segmentation system, comprising:

the training module is used for training the semantic segmentation network model by adopting a training data set to obtain a trained semantic segmentation network model; the training data set is a cityscape data set; the semantic segmentation network model comprises: a backbone network and a feature fusion network; the backbone network includes: the device comprises a convolution layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module; the feature fusion network comprises 3 fusion paths;

and the semantic segmentation module is used for inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.

The technical effect achieved by the cross-scale feature fusion real-time semantic segmentation system provided by the invention is the same as that achieved by the cross-scale feature fusion real-time semantic segmentation method provided by the invention, so that the details are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a cross-scale feature fusion real-time semantic segmentation method provided by the present invention;

FIG. 2 is a schematic structural diagram of a semantic segmentation network model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a lightweight residual error module according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a downsampling module according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a feature fusion network according to an embodiment of the present invention;

FIG. 6 is a graph comparing the segmentation results provided by the embodiments of the present invention;

fig. 7 is a schematic structural diagram of a convolution module adopted in the feature fusion network according to the embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a cross-scale feature fusion real-time semantic segmentation system provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the cross-scale feature fusion real-time semantic segmentation method provided by the present invention includes:

step 100: and training the semantic segmentation network model by adopting a training data set to obtain the trained semantic segmentation network model. The training data set is a public data set cityscape data set, for example, 5000 finely labeled street scene images of street scenes from 50 different cities can be selected from the data set, 5000 images are divided into 2975 images of the training set, 500 images of the test set and 1525 images of the verification set. The semantic segmentation network model comprises the following steps: a backbone network and a feature fusion network. The backbone network includes: convolutional layers (e.g., standard convolution), residual modules, max-pooling layers, and downsampling modules designed based on the residual modules (as shown in fig. 4). The feature fusion network includes 3 fusion paths, as shown in fig. 5, the 1 st path fuses features from the 6 th to the 2 nd stages of the backbone network from top to bottom. And the 2 nd path fuses the characteristics of the output of each convolution module of the first path from bottom to top. The 3 rd path fuses the output characteristics of the convolution modules of the second path from top to bottom, wherein the specific structure of the convolution modules is shown in fig. 7. While adding a lateral hop connection from the features of the backbone network input to path 3. When the features are fused, firstly, input features of all scales are reformed into the same size, then the features are spliced into a tensor, and the spliced features are output to the next layer after 1 × 1 convolution dimensionality reduction, 3 × 3 channel-by-channel convolution and 1 × 1 convolution.

In the specific implementation process, the specific implementation manner of the step 100 may be:

and initializing the network parameters of the semantic segmentation network model to obtain an initialized network model. The initialization here is random initialization.

After processing the images in the training data set, inputting the processed images into the initialization network model, and iterating for a first preset number of times (for example, 150 times) to obtain a first training network model. Processing the images in the training dataset includes image compression and image enlargement, e.g. 2 x size compression and enlargement batch processing of the training images.

And inputting the images in the training data set into the first training network model, and iterating for a second preset time (for example, 100 times) to obtain a second training network model. The second training network model is the trained semantic segmentation network model.

Step 101: and inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.

In the present invention, the specific structure of the employed semantic segmentation network model is shown in fig. 2, wherein, the 1 st stage of the backbone network includes 2 standard convolutions. The input of the standard convolution is the image to be segmented.

The 2 nd, 3 rd and 4 th stages of the backbone network all comprise convolution modules of the same structure. The convolution module is 1 down-sampling module followed by 2 cascaded residual modules. The input of the downsampling module in stage 2 is the output of the standard convolution in stage 1. The input of the down-sampling block in stage 3 is the output of the second residual block in stage 2. The input of the down-sampling module in stage 4 is the output of the second residual module in stage 3. The input of the first residual module in the 2 nd stage, the 3 rd stage and the 4 th stage is the output of the corresponding down sampling module. The output of the first residual module in the 2 nd stage, the 3 rd stage and the 4 th stage is the input of the corresponding second residual module.

Both

stages

5 and 6 of the backbone network are max pooling layers. The input of the largest pooling layer in stage 5 is the output of the residual module in stage 4. The input of the largest pooling layer in stage 6 is the output of the largest pooling layer in stage 5.

The 3 fusion paths in the feature fusion network respectively comprise a first convolution module, a second convolution module, a third convolution module and a fourth convolution module. The input of the first convolution module in the 1 st fusion path is the output of the largest pooling layer in the 5 th stage and the output of the largest pooling layer in the 6 th stage. The input of the second convolution module in the 1 st fusion path is the output of the first convolution module in the 1 st fusion path and the output of the second residual error module in the 4 th stage. The input of the third convolution module in the 1 st fusion path is the output of the second convolution module in the 1 st fusion path and the output of the second residual error module in the 3 rd stage. The input of the fourth convolution module in the 1 st fusion path is the output of the third convolution module in the 1 st fusion path and the output of the second residual error module in the 2 nd stage. The input of the first convolution module in the 2 nd fusion path is the output of the fourth convolution module in the 1 st fusion path and the output of the third convolution module in the 1 st fusion path. The input of the second convolution module in the 2 nd fusion path is the output of the first convolution module in the 2 nd fusion path and the output of the second convolution module in the 1 st fusion path. The input of the third convolution module in the 2 nd fusion path is the output of the second convolution module in the 2 nd fusion path and the output of the first convolution module in the 1 st fusion path. The input of the fourth convolution module in the 2 nd fusion path is the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 6 th stage. The input of the first convolution module of the 3 rd fusion path is the output of the fourth convolution module in the 2 nd fusion path, the output of the third convolution module in the 2 nd fusion path and the output of the maximum pooling layer in the 5 th stage. The input of the second convolution module of the 3 rd fusion path is the output of the first convolution module of the 3 rd fusion path, the output of the second convolution module in the 2 nd fusion path and the output of the second residual error module in the 4 th stage. The input of the third convolution module of the 3 rd fusion path is the output of the second convolution module of the 3 rd fusion path, the output of the first convolution module in the 2 nd fusion path and the output of the second residual error module in the 3 rd stage. The input of the fourth convolution module of the 3 rd fusion path is the output of the third convolution module of the 3 rd fusion path and the output of the fourth convolution module in the 1 st fusion path.

The input of the dividing head in the dividing network is the output of the fourth convolution module of the 3 rd fusion path. And the segmentation head performs 8 times of upsampling on the feature map output by the fourth convolution module of the 3 rd fusion path, and then obtains a semantic segmentation map by taking a maximum value through a loss function.

In order to further improve the accuracy and timeliness of image semantic segmentation, the method is based on depth separable convolution to design 2 cascaded residual modules into light residual modules. As shown in fig. 3, the structure of the light-weight residual error module is sequentially provided with: a first 1x1 convolution, a first 3x3 channel-by-channel convolution, a second 1x1 convolution, a second 3x3 channel-by-channel convolution, and a third 1x1 convolution. The batch normalization operation is performed after the convolution of the first 1x1 convolution, the first 3x3 channel-by-channel convolution, the second 1x1 convolution, the second 3x3 channel-by-channel convolution and the third 1x1 convolution. Based on the light residual structure of the design, the downsampling module is formed by adding branch 1x1 convolution-channel-by-channel convolution with the step size of 2-1 x1 convolution.

After a backbone network is constructed from the designed lightweight residual module and downsampling module, in the backbone network, the input image is first processed using 2 conventional convolutions of 3 × 3 with a step size of 2. The performance of conventional convolution is better because the input image has a larger resolution and a smaller number of channels. The subsequent 3 stages (stage 2, stage 3 and stage 4) are identical in structure, i.e. each comprises 1 down-sampling module and 2 lightweight residual modules. And respectively adding kernels of 2, 4 and 8 hole convolutions into the 2 nd lightweight residual error module of each stage, and adding an SE module after the 1 st channel-by-channel convolution. The last 2 layers of the backbone network are the maximum pooling layers, the pooling core size is 3, and the step length is 2. The final feature map size is 1/128 for the input image. Specific parameters of each structure in the backbone network are shown in table 1.

TABLE 1

Based on the specific structure of the semantic segmentation network model provided above, in the process of testing the trained semantic segmentation network model, a cityscape data set is selected as a test set, and the specific test process is as follows:

the image in the test data set is input into a semantic segmentation network model, and a backbone network is divided into 6 stages to process the input image. The stage 1 of the backbone network comprises 2 standard convolutions, the step size is 2, and the downsampling operation is carried out while the input image is processed, so that the size of the characteristic image is reduced. Stages 2 to 4 have the same convolution module, i.e. 1 down-sampling module followed by two lightweight residual modules. Each downsampling module compresses the feature map size output by the previous module and enlarges the number of feature channels by a factor of 2. The lightweight residual error module is formed by adopting deep separable convolution, and can give consideration to the processing speed and precision of the network. The last 2 stages of the backbone network are the maximum pooling layers, the maximum pooling layer kernel size is 3, the step size is 2, the size of the final feature map is 1/128 of the input image, the image has a large enough receptive field, and the local maximum response can be obtained. Then the inputs of the 2 nd to 6 th stages of the backbone convolutional network are respectively sent to a cross-scale feature fusion module. The cross-scale feature fusion module outputs fusion features after passing through three paths from top to bottom, from bottom to top and from top to bottom, up-sampling is carried out for 8 times through a final segmentation head, and a corresponding label category information is obtained by taking a maximum value through a softmax function to obtain a semantic segmentation graph.

The following describes a specific implementation process of the cross-scale feature fusion real-time semantic segmentation method provided by the invention by taking public data sets cityscaps as evaluation data sets respectively as an example.

Step 1: the training for large-size images is divided into two steps: the first step, initializing network parameters randomly, compressing the training image by 2 times, increasing batch processing, and iterating for 150K times to obtain the training result of the convolution network. And secondly, training in small batches by adopting the original image size, and iterating for 100K times to obtain a final convolution network training result.

Step 2: a3-channel color image is input into a semantic segmentation network model, and a backbone network is divided into 6 stages to process an input image. The stage 1 of the backbone network comprises 2 standard convolutions, the step size is 2, and the downsampling operation is carried out while the input image is processed, so that the size of the characteristic image is reduced. Stages 2 to 4 have the same convolution module, i.e. 1 down-sampling module followed by two lightweight residual modules. Each downsampling module compresses the feature map size output by the previous module and enlarges the number of feature channels by a factor of 2. The lightweight residual error module is formed by adopting deep separable convolution, and can give consideration to the processing speed and precision of the network. The last 2 stages of the backbone network are the maximum pooling layers, the kernel size of the maximum pooling layer is 3, the step size is 2, the size of the final feature map is 1/128 of the input image, the domain of the input image is large enough, and the local maximum response can be obtained. Then the inputs of the 2 nd to 6 th stages of the backbone convolutional network are respectively sent to a cross-scale feature fusion module. The cross-scale feature fusion module outputs fusion features after passing through three paths from top to bottom, from bottom to top and from top to bottom, 8 times of upsampling is carried out through the final segmentation module, and the maximum value is obtained through the softmax function to obtain corresponding label category information so as to obtain a semantic segmentation graph.

In the present example, the above-noted airplane data set and public data set cityscaps are respectively used as evaluation data sets. All experimental results were run on a single nvidiagefore RTX 2080Ti GPU.

The most common real-time semantic segmentation index is used in this example: the segmentation precision is measured by mean intersection ratio (mIoU) and the inference speed is measured by how many frames of images are processed per second (FPS), and meanwhile, the parameter quantity and the calculated quantity can also be used as indexes to participate in comparison. For the real-time semantic segmentation model, the inference speed is at least 30FPS, on the basis, the higher the segmentation precision is, the faster the inference speed is, and the smaller the parameter number and the calculated amount are, the better the overall performance of the model is.

In order to prove that the cross-scale feature fusion real-time semantic segmentation method provided by the invention is a method with excellent performance and speed, and is compared with some real-time semantic segmentations published recently on a Cityscapes data set.

Table 2 shows the results of testing the invention on the cityscaps dataset with other real-time semantic segmentation methods. It can be seen that the segmentation accuracy of the present invention is optimal, and at the same time, the requirement of at least 30 frames/second for real-time semantic segmentation is met, reaching 46.5 fps. Although the processing speed of the models SFNet and CABINet is faster than that of the invention, the segmentation precision, the parameter quantity and the calculated quantity are inferior to that of the invention, and although the parameter quantity and the calculated quantity of other models are less than that of the invention and the reasoning speed is fast, the segmentation precision cannot be compared with that of the invention, so that the invention achieves the optimal balance in speed and precision.

TABLE 2 comparison results Table on Cityscapes dataset

Comparing the light residual module:

the performance of the designed lightweight residual error module and the performance of the designed inverted residual error module in the MobilenetV2 are compared on the semantic segmentation network model structure designed by the invention, namely, the lightweight residual error module and the inverted residual error module are respectively used for constructing a backbone network in the model, the same round of training is carried out under the same hyper-parameter configuration, and the comparison result of the segmentation precision MIoU, the calculated quantity GFLOPs, the parameter quantity and the inference speed of the two models is shown in the table 3.

TABLE 3

Effectiveness of the cross-scale feature fusion module:

the cross-scale feature fusion module is mainly characterized by having jump connection from input to the 3 rd path from top to bottom. For this structure, compared to the structure with no jump connection, only jump connection to item 2, and jump connection structure containing to

items

2 and 3, as shown in table 4, the designed cross-scale feature fusion module has the best performance when only jump connection to item 3 (adopted).

TABLE 4

The segmentation effect obtained based on the cross-scale feature fusion real-time semantic segmentation method provided by the invention is shown in fig. 6, wherein the 1 st column in fig. 6 is an input image, the 2 nd column is a cross-scale feature fusion network output result, the 3 rd column is a network output result without cross-scale connection, and the 4 th column is an annotated segmentation image.

In summary, the technical scheme provided by the invention adopts a lightweight real-time semantic segmentation convolutional neural network model, a lightweight residual error module composed of a depth separable convolutional layer and a linear bottleneck layer, and a cross-scale feature fusion module. The method has the characteristics of real-time high precision in semantic segmentation tasks of large-size images of urban street scenes and large airplanes.

In addition, corresponding to the above-mentioned provided real-time semantic segmentation method by cross-scale feature fusion, the present invention further provides a real-time semantic segmentation system by cross-scale feature fusion, as shown in fig. 8, the system includes: a training module 1 and a semantic segmentation module 2.

The training module 1 is configured to train the semantic segmentation network model by using a training data set to obtain a trained semantic segmentation network model. The training dataset is the cityscape dataset. The semantic segmentation network model comprises the following steps: a backbone network and a feature fusion network. The backbone network includes: the device comprises a convolutional layer, a residual error module, a maximum pooling layer and a down-sampling module designed based on the residual error module. The feature fusion network includes 3 fusion paths.

The semantic segmentation module 2 is used for inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A cross-scale feature fusion real-time semantic segmentation method is characterized by comprising the following steps:

inputting an image data set to be segmented into a trained semantic segmentation network model to obtain a semantic segmentation map;

wherein stage 1 of the backbone network comprises 2 standard convolutions; the input of the standard convolution is an image to be segmented;

2. The cross-scale feature fusion real-time semantic segmentation method according to claim 1, wherein the training of the semantic segmentation network model with the training data set is performed to obtain a trained semantic segmentation network model, and then further comprising:

3. The cross-scale feature fusion real-time semantic segmentation method according to claim 1, wherein training the semantic segmentation network model with a training data set to obtain a trained semantic segmentation network model specifically comprises:

4. The cross-scale feature fusion real-time semantic segmentation method according to claim 1, wherein 2 cascaded residual modules are all lightweight residual modules;

an SE module is arranged after the first 3x3 channel-by-channel convolution; setting a residual concatenation and addition operation after the second 1x1 convolution batch normalization operation; the ReLU activation function is set after the residual concatenation addition operation and the second 1x1 convolution batch operation.

5. The cross-scale feature fusion real-time semantic segmentation method according to claim 4, wherein hole convolutions with cores of 2, 4 and 8 are added to the lightweight residual error module.

6. The method according to claim 1, wherein the size of the kernel of the largest pooling layer is 3, and the step size of the largest pooling layer is 2.

7. The method according to claim 1, wherein the step size of each of the 2 standard convolutions is 2.

8. A cross-scale feature fused real-time semantic segmentation system, comprising:

the semantic segmentation module is used for inputting the image data set to be segmented into the trained semantic segmentation network model to obtain a semantic segmentation map;