CN114022858A

CN114022858A - Semantic segmentation method, system, electronic device and medium for automatic driving

Info

Publication number: CN114022858A
Application number: CN202111361700.XA
Authority: CN
Inventors: 韩先锋; 程辉先; 肖国强
Original assignee: Southwest University
Current assignee: Southwest University
Priority date: 2021-10-18
Filing date: 2021-11-17
Publication date: 2022-02-08

Abstract

The invention is suitable for the technical field of deep learning and automatic driving, and provides a semantic segmentation method, a semantic segmentation system, electronic equipment and a semantic segmentation medium for automatic driving, wherein the method comprises the following steps: acquiring three-dimensional point cloud data, mapping the three-dimensional point cloud data into a two-dimensional depth map comprising a plurality of channel data, and forming a sample data set according to the plurality of channel data; constructing a semantic segmentation network initial model comprising a first model and a second model, training the semantic segmentation network initial model by adopting the sample data set, and acquiring a target model, wherein the second model comprises an encoder and a decoder; acquiring target three-dimensional point cloud data, mapping the target three-dimensional point cloud data into a target two-dimensional depth map, inputting the target two-dimensional depth map into the target model, and acquiring a target semantic segmentation result; by adopting the method, the problem of low semantic segmentation performance of the three-dimensional point cloud is solved.

Description

Semantic segmentation method, system, electronic device and medium for automatic driving

Technical Field

The invention relates to the technical field of deep learning and automatic driving, in particular to a semantic segmentation method, a semantic segmentation system, electronic equipment and a semantic segmentation medium for automatic driving.

Background

With the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important value. Semantic segmentation is one of important applications in the field of artificial intelligence, and has wide applications in automatic driving, video understanding, face recognition systems, intelligent hardware and the like. In the field of automatic driving, accurate, robust, reliable and real-time perception and understanding of the traffic environment can be realized by performing accurate semantic segmentation on the traffic environment.

Currently, many types of sensors with complementary characteristics, such as cameras, radars, etc., are used in automatic driving systems. Because the radar can provide distance measurement and high-precision three-dimensional geometric information, the radar plays a crucial role in semantic scene perception, particularly in a key task aiming at three-dimensional point cloud semantic segmentation. However, three-dimensional point cloud data generated by radar is generally sparse, irregular and disordered, thereby resulting in low semantic segmentation performance of the three-dimensional point cloud.

Disclosure of Invention

The invention provides a semantic segmentation method, a semantic segmentation system, electronic equipment and a semantic segmentation medium for automatic driving, and aims to solve the problem of low semantic segmentation performance of three-dimensional point cloud in the prior art.

The invention provides a semantic segmentation method for automatic driving, which comprises the following steps:

acquiring three-dimensional point cloud data, mapping the three-dimensional point cloud data into a two-dimensional depth map comprising a plurality of channel data, and forming a sample data set according to the plurality of channel data;

constructing a semantic segmentation network initial model comprising a first model and a second model, training the semantic segmentation network initial model by adopting the sample data set, and acquiring a target model, wherein the second model comprises an encoder and a decoder;

and acquiring target three-dimensional point cloud data, mapping the target three-dimensional point cloud data into a target two-dimensional depth map, inputting the target two-dimensional depth map into the target model, and acquiring a target semantic segmentation result.

Optionally, the training of the semantic segmentation network initial model by using the sample data set to obtain a target model includes:

inputting the sample data set into the feature extraction model, respectively extracting features of each channel data in the sample data set, and obtaining the channel features of each channel, wherein the feature extraction model at least comprises two first components and one second component, the first components comprise a first convolution module and a residual error module, and the second components comprise a second convolution module and a residual error module.

Optionally, the training the semantic segmentation network initial model by using the sample data set to obtain a target model further includes:

and inputting the channel characteristics of each channel into the characteristic fusion model, and performing characteristic fusion processing on the characteristic fusion model to obtain fusion characteristics, wherein the characteristic fusion model is a spatial channel attention module.

Optionally, the training the semantic segmentation network initial model by using the sample data set to obtain a target model includes:

and acquiring an output result of the first model, inputting the output result into the encoder to perform first convolution processing, and acquiring a first processing result, wherein the encoder comprises a third convolution module and a residual error module.

Optionally, the decoder includes a converter module, the training the initial model of the semantic segmentation network by using the sample data set to obtain a target model, and the method further includes:

acquiring an output result of the encoder, and performing first up-sampling processing on the output result to acquire a second processing result;

and after the second processing result is input into the converter module, performing second convolution processing on the second processing result by adopting a non-square sliding window to obtain a target semantic segmentation result, wherein the converter module is constructed based on a multi-head attention submodule and a multilayer perceptron submodule.

Optionally, the encoder further includes a sub-encoder and a partition head network, and the training of the semantic partition network initial model by using the sample data set to obtain a target model further includes:

and acquiring a processing result of the second convolution processing, performing second up-sampling on the processing result, and sequentially inputting the processing result into the sub-encoder and the segmentation head network to acquire a target semantic segmentation result.

Optionally, the mathematical expression of the loss function L of the target model is:

L＝λ₁L₁+λ₂L₂+λ₃L₃；

wherein L is a loss function of the semantic segmentation network, L₁Is a first loss function, λ₁Is the weight of the first loss function, L₂Is a second loss function, λ₂Is the weight of the second loss function, L₃As a third loss function, λ₃Is the weight of the third loss function, C is the total number of object classes corresponding to the three-dimensional point cloud data, C is the class label, f_tIs the median of all class frequencies, f_cFrequency of class c, W_cIs the class weight of class c, i is the label of the pixel, y_cIn order to be the true value of the value,

to predict value, Δ J_ncLov-sz expansion for Jacobian index, m_i(c) As a function of the probability of the ith pixel of class c,

the predicted probability of the ith pixel of class c,

true probability value, y, for the ith pixel of class c_pdTo predict the boundary map, y_gtIs a categoryTrue value of c, P^CFor predicting boundary maps with respect to class c truth y_gtPrecision ratio of (R)^CFor predicting boundary maps with respect to class c truth y_gtOf (a) recall ratio theta₀For the size of the sliding window, pool () is applied to the size of θ₀Maximum pooling over a sliding window.

The invention also provides a semantic segmentation system for automatic driving, which comprises the following steps:

the mapping module is used for acquiring three-dimensional point cloud data, mapping the three-dimensional point cloud data into a two-dimensional depth map comprising a plurality of channel data, and forming a sample data set according to the plurality of channel data;

the target model establishing module is used for establishing a semantic segmentation network initial model comprising a first model and a second model, training the semantic segmentation network initial model by adopting the sample data set and acquiring a target model, wherein the second model comprises an encoder and a decoder;

and the target result acquisition module is used for acquiring target three-dimensional point cloud data, mapping the target three-dimensional point cloud data into a target two-dimensional depth map, inputting the target two-dimensional depth map into the target model and acquiring a target semantic segmentation result, and the mapping module, the target model establishment module and the target result acquisition module are connected.

The present invention also provides an electronic device comprising: a processor and a memory;

the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the semantic segmentation method for autonomous driving.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the semantic segmentation method for autopilot as described above.

As described above, the present invention provides a semantic segmentation method, system, electronic device, and medium for automatic driving, which have the following advantages: firstly, mapping the acquired three-dimensional point cloud data into a two-dimensional depth map comprising a plurality of channel data, constructing a semantic segmentation network model with excellent performance on the basis, then acquiring target point cloud data, mapping the target point cloud data into a target two-dimensional depth map, and finally inputting the target two-dimensional depth map into the trained semantic segmentation network model to acquire a target semantic segmentation result; by mapping the three-dimensional point cloud data into the two-dimensional depth map, the problem of converting the three-dimensional point cloud segmentation into the two-dimensional depth map segmentation by adopting a spherical projection strategy is realized, so that the purpose of reducing the computational complexity and the memory requirement by fully utilizing the excellent performance of the two-dimensional convolutional neural network is achieved, and the three-dimensional semantic segmentation effect is also improved; the problem of low semantic segmentation performance of the three-dimensional point cloud in the prior art is solved, so that the three-dimensional scene is accurately and efficiently understood, and technical support is provided for automatic driving.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a semantic segmentation method for automatic driving according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a feature extraction model in an embodiment of the invention;

FIG. 3 is a schematic diagram of an encoder according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a converter module according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a semantic segmentation system for automatic driving according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Fig. 1 is a flow chart illustrating a semantic segmentation method for automatic driving according to an embodiment of the present invention.

As shown in fig. 1, the semantic segmentation method for automatic driving includes steps S110 to S130:

s110, acquiring three-dimensional point cloud data, mapping the three-dimensional point cloud data into a two-dimensional depth map comprising a plurality of channel data, and forming a sample data set according to the plurality of channel data;

s120, constructing a semantic segmentation network initial model comprising a first model and a second model, training the semantic segmentation network initial model by adopting the sample data set, and acquiring a target model, wherein the second model comprises an encoder and a decoder;

s130, acquiring target three-dimensional point cloud data, mapping the target three-dimensional point cloud data into a target two-dimensional depth map, inputting the target two-dimensional depth map into the target model, and acquiring a target semantic segmentation result.

In step S110 of this embodiment, the three-dimensional point cloud data may be three-dimensional point cloud data scanned by a radar under an autopilot task, and the three-dimensional point cloud data is mapped into a two-dimensional depth map, so that a spherical projection strategy is adopted to convert a three-dimensional point cloud segmentation problem into a two-dimensional depth map segmentation problem, thereby achieving the purpose of sufficiently utilizing the excellent performance of a two-dimensional convolutional neural network to reduce the computational complexity and the memory requirement, further achieving accurate and efficient understanding of a three-dimensional scene, and providing technical support for autopilot. The mathematical expression of mapping the three-dimensional point cloud data into a two-dimensional depth map is as follows:

wherein the coordinates of the point cloud are (x, y, z), the coordinates of the two-dimensional depth map are (u, v), W is the height of the two-dimensional depth map, H is the width of the two-dimensional depth map, d is the depth of each point, f_upDenotes the upper limit of the viewing angle, f_downRepresents a lower viewing angle limit;

the two-dimensional depth map comprises a plurality of channel data, the mathematical expression of channel data m being:

m＝(H,W,n)；

wherein n is one of x, y, z, d and r, and r is reflection intensity information; further, n may be one of xyz, d, and r, and xyz is a coordinate of one point cloud data.

In step S120 of this embodiment, the first model includes a feature extraction model and a feature fusion model, where the feature extraction model includes at least two first components and one second component, the first component includes a first convolution module and a residual error module, the second component includes a second convolution module and a residual error module, and the feature fusion model is a spatial channel attention module. Training the initial model of the semantic segmentation network by adopting the sample data set, and acquiring a specific implementation method of a target model; inputting the sample data set into the feature extraction model, respectively extracting the features of each channel data in the sample data set, and acquiring the channel features of each channel; and inputting the channel characteristics of each channel into the characteristic fusion model, and performing characteristic fusion processing on the characteristic fusion model to obtain fusion characteristics.

In an embodiment, different channel data are contained in the projected two-dimensional depth image, the different channel data corresponding to different modality information, the values of which follow different distributions, which have different contributions to the learning of the feature. Therefore, the method of respectively extracting the channel characteristics and then fusing the channel characteristics is adopted, so that the characteristic space corresponding to each independent channel can be learned, and the semantic segmentation network model established on the basis can understand the three-dimensional scene more accurately and efficiently. In the feature extraction model, the input end of the second component is connected with one first component, the output end of the second component is connected with the other first component, and the output end of the feature extraction model is connected with the input end of the feature fusion model. The first convolution module can be formed by connecting convolution layers of a plurality of different convolution kernels in parallel and then connecting the convolution layers in series with a convolution layer of 1 x 1, and the first convolution module is connected with the residual error module in parallel in the first assembly. In the second component, the second convolution module is a hole convolution, and the second convolution module and the residual error module are connected in parallel. Wherein, the residual module is a structural feature with 1 × 1 convolution. The fusion model is a spatial channel attention module, channel characteristics are obtained through independent channel data, and then the channel characteristics are fused, so that the purpose of outputting context characteristics from local to global with spatial fine-grained information is achieved.

In a specific embodiment, a schematic structural diagram of a feature extraction model may refer to fig. 2, where the feature extraction model is composed of a first component 1, a second component 2, and a first component 3, the first component 1 includes a first convolution module 11 and a residual module 12, the first convolution module 11 includes a convolution layer 111 with a convolution kernel of 3 × 3, a convolution layer 112 with a convolution kernel of 5 × 5, a convolution layer 113 with a convolution kernel of 7 × 7, and a convolution layer 114 with a convolution kernel of 1 ×; the second component 2 comprises a second convolution module 21 and a residual module 22, the second convolution module 21 comprising convolution layers 211 with convolution kernels 3 x 3 and hole convolutions 212 with convolution kernels 3 x 3, wherein the hole rate of the hole convolutions is 2; the first component 3 comprises a first convolution module 31 and a residual module 32, the first convolution module 31 comprises a convolution layer 311 with a convolution kernel of 3 × 3, a convolution layer 312 with a convolution kernel of 5 × 5, a convolution layer 313 with a convolution kernel of 7 × 7 and a convolution layer 314 with a convolution kernel of 1 × 1; where C denotes cascade and + denotes residual addition.

In step S130 of this embodiment, training the initial semantic segmentation network model by using the sample data set to obtain a specific implementation method of a target model, further includes: and acquiring an output result of the first model, inputting the output result into the encoder to perform first convolution processing, and acquiring a first processing result, wherein the encoder comprises a third convolution module and a residual error module. Specifically, referring to fig. 3, the encoder 4 includes a third convolution module 41, a residual module 42, and a residual module 43, where the residual module has a convolution structure of 1 × 1. The third convolution module includes convolution layer 411 with a convolution kernel of 3 x 3, convolution layer 412 with a convolution kernel of 3 x 3, and hole convolution 413 with a convolution kernel of 3 x 3, where the hole rate of the hole convolution is 2. By the design, the encoder can fully utilize the information of the multi-level and multi-scale multi-channel

In one implementation, there is a missing point problem in the two-dimensional depth map, which is defined by the present embodiment as the missing noise, which comes mainly from the limitations of the sensor, specular reflection, occlusion of the object itself, and spherical projection. These noises have a negative impact on the performance of the convolutional neural network model. To solve this problem, the present embodiment chooses to use the context aggregation module in the SqueezeSegV2 method and constructs a residual context aggregation module after the base encoder module to enhance the robustness of the context features to such noise.

In an embodiment, the decoder includes a converter module that is built based on a multi-head attention submodule and a multi-layered perceptron submodule. The implementation method for training the initial model of the semantic segmentation network by adopting the sample data set and acquiring the target model comprises the following steps: acquiring an output result of the encoder, and performing first up-sampling processing on the output result to acquire a second processing result; and after the second processing result is input into the converter module, performing second convolution processing on the second processing result by adopting a non-square sliding window to obtain a target semantic segmentation result.

In a specific embodiment, a schematic structural diagram of the converter module can be seen in fig. 4, where LN denotes Layer Normalization, MSA denotes Multi-head Self-Attention, SMSA denotes Shifted window Multi-head Self-Attention, and MLP denotes a Multi Layer perceiver. y is_lThe output of the multi-headed self-attention module, z, for the ith module_lThe output of the multi-layer perceptron module which is the l-th module; y is_l-1The output of the multi-headed self-attention module, z, for the l-1 st module_l-1The output of the multi-layer perceptron module which is the l-1 module; y is_l+1The output of the multi-headed self-attention module, z, for the l +1 th module_l+1The output of the multi-layer perceptron module which is the l +1 th module; the formalization of the converter module is expressed as:

z_l+1＝MLP(LN(y_l+1))+y_l+1。

in an embodiment, compared with a common two-dimensional color image, the depth image has a width much larger than its height, and this embodiment defines it as a size imbalance problem, and in order to solve this problem, this embodiment adopts a non-square sliding window instead of a square sliding window to adapt to the depth image size imbalance problem. To enable better feature learning, each decoder adds four stacked converter modules after the upsampling layer. Converter modules with different resolutions are used to build a hierarchical decoder structure for learning multi-scale information. Specifically, the implementation method of the balanced non-square conversion module comprises the following steps: and performing 2 times of upsampling processing on a first processing result (characteristic diagram) of the encoder by the encoder, and then connecting the upsampling processing result with the corresponding characteristic diagram output by the encoder to complete the fusion of multi-scale information. Finally, the number of channels is adjusted using the linear layer. The whole encoder stage consists of three equal non-square converter modules of different scales.

In one embodiment, although the equalizing non-square converter module has the ability to achieve superior performance, it is difficult to decode feature maps with greater resolution. Therefore, a convolution-based decoder module is added at the end of the decoder flow to solve this problem. Specifically, the encoder further includes a sub-encoder and a partition head network, and the specific implementation method for training the semantic partition network initial model by using the sample data set to obtain the target model further includes: and acquiring a processing result of the second convolution processing, performing second up-sampling on the processing result, and sequentially inputting the processing result into the sub-encoder and the segmentation head network to acquire a target semantic segmentation result.

In step S130 of this embodiment, the target three-dimensional point cloud data may be three-dimensional point cloud data scanned by a radar in an autopilot, the mapping of the target three-dimensional point cloud data into a target two-dimensional depth map is the same as the implementation method in step S110, and then the target two-dimensional depth map is input into the target model to obtain a target semantic segmentation result. By mapping the target three-dimensional point cloud data into the target two-dimensional depth map, the three-dimensional point cloud segmentation problem is converted into the two-dimensional depth map segmentation problem by adopting a spherical projection strategy, so that the purpose of reducing the computational complexity and the memory requirement by fully utilizing the excellent performance of a two-dimensional convolutional neural network is achieved, and the three-dimensional semantic segmentation effect is also improved; and then, the target two-dimensional depth map is input into a semantic segmentation network model to obtain a target semantic segmentation result, so that the problem of low semantic segmentation performance of three-dimensional point cloud in the prior art is solved, the three-dimensional scene is accurately and efficiently understood, and technical support is provided for automatic driving.

In one embodiment, the mathematical expression of the loss function L of the semantic segmentation network model is:

L＝λ₁L₁+λ₂L₂+λ₃L₃；

the predicted probability of the ith pixel of class c,

true probability value, y, for the ith pixel of class c_pdTo predict the boundary map, y_gtIs a true value of class c, P^CFor predicting boundary maps with respect to class c truth y_gtPrecision ratio of (R)^CFor predicting boundary maps with respect to class c truth y_gtOf (a) recall ratio theta₀For the size of the sliding window, pool () is applied to the size of θ₀Maximum pooling over a sliding window.

The following is a description of a specific embodiment:

example 1

(1) Selection of data sets

In order to verify the performance of the semantic segmentation network model of the embodiment, the three-dimensional point cloud data of the embodiment is derived from a large-scale sendatintti dataset and a small-scale sendatinposs dataset. The SemanticKITTI data set is constructed in a mode of providing dense point semantic annotation for KITTI odometric Benchmark of full 360-degree scanning. This data set contains 21 sequences of 43000 scans. Where 21000 scans for sequences 00 through 10 were used for training, 08 sequences were used for validation, and 11 through 21 sequences were used for testing. The SemanticPOSS data is a small-scale dataset created by Beijing university that contains 2988 complex scenes with high-quality dynamic objects. It follows the same data format specification as the SemanticKITTI. There are 6 parts of the data set, with the 2 nd and 3 rd parts used for testing and the remainder for training.

(2) Performance evaluation metrics

In order to evaluate the performance of the semantic segmentation network and make a fair comparison with the mainstream mode, the present embodiment adopts a standard mean value IOU as a performance metric index, which is defined as follows:

mIOU is a performance measurement index, N is the total number of object categories corresponding to the three-dimensional point cloud data, c is a category label, and TPc, FPc and FNc represent the number of true positive examples, false positive examples and true negative examples of the category c.

(3) Implementation details

In this embodiment, a TransRVNet framework is implemented by using PyTorch, and all models are optimized by using a stochastic gradient descent algorithm, where momentum is set to 9 and weight attenuation ratio is 0.0001. In the training process, the data is enhanced by adopting random rotation, random point dropping and random positive and negative changing of X and Y values to avoid overfitting. For the two specific data sets, the corresponding settings are as follows:

for the SemanticKITTI dataset, the model of this example uses a training block of size 4 to train the model for 100 rounds, with an initial learning rate set to 0.005 and a decay of the learning rate of 0.01 after each round of training is completed. The projected depth map is set to a height of 64 and a width of 1024. The sliding window size of the converter modules in the model is set to [4,64 ]. For the K nearest neighbor post-processing stage, a window with the size of 7 is adopted to search the neighborhood, and the truncation is 2 meters.

For the sematic poss dataset, the initial learning rate is set to 0.0025, and the learning rate is dynamically adjusted by adopting a cosine simulated annealing strategy. The training block size is 2 for 50 rounds of training. The depth map has a height of 64 and a width of 1600. The sliding window size of the converter is set to [4,100 ]. As the SemanticPOSS database is sparse, the window size is set to 11 and the truncation is 5 meters in the post-processing stage.

(4) Results

And (4) quantification results: table 1 shows the quantitative comparison results of the data set SemantickiTTI of the TransRVNet and the mainstream method in this example. The mainstream methods are point-based methods and image-based methods; from the experimental results, the following conclusions can be drawn:

it is more desirable to obtain the value of the split mlou using the depth map of size 64 × 2048 as input than using 64 × 1024. This is mainly because the larger depth map means that the learning ability of the network to segment small-scale and fine-grained objects can be improved. In addition, large scale input also means that more points are back projected into the three-dimensional point cloud. The TransRVNet of this embodiment achieves the best performance over not only the point-based method but also the image-based method. It is further noted that the performance of the TransRVNet method in this embodiment is far better than that of the image-based reference method RangeNet + +, and the mliou score is improved by 12.7%, even by 8.5% over that of RangeNet + + when a 64 × 2048 depth map is used as input.

Table 2 shows the comparison result between the TransRVNet of this example and the 02 th sequence of the main stream method on the SemanticPOSS, and Table 3 shows the comparison result between the TransRVNet of this example and the 03 th sequence of the main stream method on the SemanticPOSS. From the table, it is clear that TransRVNet achieves the best performance in both test sequences. Above sequence 02, 16.6% and 4.3% respectively exceeded the previously mainstream methods RangeNet + + and MINet. On the top of the sequence 03, the improvement is 11.6 percent and 6.9 percent compared with the RandLA-Net and the MinkNet respectively. The mIoU scores for all methods were lower because of the comparison

For the SemanticKITTI dataset, SemanticPOSS is smaller in size and the data is more sparse.

TABLE 1 quantitative comparison of the results of the present example TransRVNet and mainstream methods in the data set SemantickiTTI

TABLE 2 comparison of the 02 th sequence on the SemanticPOSS of this example TransRVNet with the mainstream procedure

TABLE 3 comparison of the 03 th sequence on the SemanticPOSS of this example with TransRVNet and the mainstream procedure

(5) Ablation experiment

To further understand the TransRVNet of this embodiment, this embodiment performs ablation experiments to discuss various techniques of network modeling, including the effect of network structure design, sliding window size, and the first and second models on segmentation performance. All experiments were performed on a sematic kitti with sequence 08 selected as the test set and the other sequences used as the training set.

The network structure is as follows: in order to explore the performance of the convolutional neural network-converter structure adopted in this embodiment on the semantic segmentation task, this embodiment performs the following comparative experiment, and based on the TransRVNet structure, two converter-convolutional neural network models are designed, in which all the converter decoding modules are replaced by the convolutional decoder modules used in this embodiment, and then high-performance sliding window converter modules and visual converter modules are respectively adopted as encoders. The specific parameters are set as follows:

1) the query dimension for each head in the multi-head self-attention mechanism is set to 32, and 4 expansion layers are integrated in each multi-layer perceptron. 2) For the sliding window converter module, the number of channels in the first stage is set to 96, and the number of converter modules in each encoder module is set to {2,2,6,2 }. 3) For the visual converter, the number of channels in the first stage is set to 64, and the number of converter modules in each encoder module is set to {4,4,4,4}

Table 4 gives a comparison between the three transducer-convolutional neural network structures and the convolutional neural network-transducer structure of the present invention. From the accuracy perspective, the convolutional neural network-balanced non-square converter structure of the embodiment obtains the best performance, and is more suitable for the radar semantic segmentation task.

TABLE 4 impact of network architecture on semantic segmentation task Performance

Effect of window size: in the foregoing discussion, it is pointed out that the non-square sliding window strategy designed by the present embodiment is more suitable for processing depth images. It is a crucial design of TransRVNet, and affects not only segmentation performance, but also parameters and computational complexity. Table 5 shows a comparison of the performance obtained using different window sizes. Obviously, 1) semantic segmentation performance is continuously improved as the sliding window is increased, mainly because a large window means a larger receptive field, and a wider range of context information can be aggregated. A large window then entails high computation and memory consumption. 2) Non-square windows achieve more desirable performance than square windows. For example, the mIoU scores of the corresponding square windows [4,4] are respectively increased by 0.2%, 0.9% and 1.2% through the sizes [4,8], [4,16] and [4,32 ]. This further indicates that non-square sliding windows are better suited to handle depth images with non-uniform sizes, which is a more desirable selection strategy.

TABLE 5 Effect of sliding Window size

The influence of the feature extraction model (multi-residual context learning module) and the feature fusion model (residual context aggregation module). Table 6 gives the ablation experiments on the feature extraction model (multi-residual context learning module) and the feature fusion model (residual context aggregation module). In the experiment, two modules are added into a backbone network one by one to verify the effectiveness of each module. It can be seen from the laboratory results that adding only the multi-residual context learning module and adding only the residual context aggregation module can increase the performance of the TransRVNet network to 57.2% and 56.6%, respectively, which exceed the backbone network by 2.2% and 1.6%, respectively. The two are added into the backbone network, and the overall performance is improved by 3.1%. On the other hand, to further verify the superiority of the multi-residual context learning module, table 7 gives the mlou performance over several categories (e.g., pedestrian, trunk, truck, building, fence, park, etc.) after adding this module. Compared with a backbone network, the performance of the network model added with the multi-residual context learning module is remarkably improved in all categories. This further illustrates that the ability of the module to capture multi-scale context information at the channel level and model inter-channel and intra-channel feature dependencies plays a crucial role in improving radar segmentation performance.

TABLE 6 Effect of feature extraction model and feature fusion model

TABLE 7 impact analysis of multiple residual context learning modules

Based on the same inventive concept as the semantic segmentation method for automatic driving, correspondingly, the embodiment also provides a semantic segmentation system. In this embodiment, the semantic segmentation system executes the semantic segmentation method described in any of the above embodiments, and specific functions and technical effects are described with reference to the above embodiments, which are not described herein again.

Fig. 5 is a schematic structural diagram of a semantic segmentation system for automatic driving according to the present invention.

As shown in fig. 5, the semantic segmentation system for automatic driving includes: a 51 mapping module, a 52 target model building module, and a 53 target result obtaining module.

and the target result acquisition module is used for acquiring target three-dimensional point cloud data, mapping the target three-dimensional point cloud data into a target two-dimensional depth map, inputting the target two-dimensional depth map into the target model and acquiring a target semantic segmentation result.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

In an embodiment, referring to fig. 6, the embodiment further provides an electronic device 600, which includes a memory 601, a processor 602, and a computer program stored on the memory and executable on the processor, and when the processor 602 executes the computer program, the steps of the method according to any one of the above embodiments are implemented.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In the above-described embodiments, references in the specification to "the present embodiment," "an embodiment," "another embodiment," "in some exemplary embodiments," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The various appearances of the phrase "the present embodiment," "one embodiment," or "another embodiment" are not necessarily all referring to the same embodiment.

In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A semantic segmentation method for automatic driving, characterized by comprising:

2. The semantic segmentation method for automatic driving according to claim 1, wherein the first model comprises a feature extraction model and a feature fusion model, and the training of the semantic segmentation network initial model with the sample data set to obtain a target model comprises:

3. The semantic segmentation method for automatic driving according to claim 2, wherein training the initial semantic segmentation network model by using the sample data set, obtaining a target model, further comprises:

4. The semantic segmentation method for automatic driving according to claim 1, wherein training the initial semantic segmentation network model by using the sample data set to obtain a target model comprises:

5. The semantic segmentation method for autonomous driving according to claim 4, wherein the decoder comprises a converter module for training the initial model of the semantic segmentation network with the sample data set to obtain a target model, and further comprising:

6. The semantic segmentation method for automatic driving according to claim 5, wherein the encoder further comprises a sub-encoder and a segmentation head network, and the training of the initial model of the semantic segmentation network with the sample data set to obtain a target model further comprises:

7. The semantic segmentation method for autonomous driving according to any of claims 1 to 6, characterized in that the mathematical expression of the loss function L of the target model is:

L＝λ₁L₁+λ₂L₂+λ₃L₃；

the predicted probability of the ith pixel of class c,

8. A semantic segmentation system for autopilot, the semantic segmentation system for autopilot comprising:

9. An electronic device comprising a processor, a memory, and a communication bus;

the communication bus is used for connecting the processor and the memory;

the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.