CN116206082A

CN116206082A - Semantic scene completion method, system, equipment and storage medium

Info

Publication number: CN116206082A
Application number: CN202310250553.1A
Authority: CN
Inventors: 侯跃南; 夏朝阳; 刘有权; 李鑫; 李怡康; 乔宇
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2023-03-15
Filing date: 2023-03-15
Publication date: 2023-06-02

Abstract

The embodiment of the application relates to the technical field of three-dimensional scene completion, in particular to a semantic scene completion method, a semantic scene completion system, semantic scene completion equipment and a semantic scene completion storage medium, wherein the semantic scene completion method comprises the following steps: constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises multipath blocks for fusing the multi-scale characteristics; constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; and correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object. The semantic scene completion method provided by the embodiment of the application realizes the accurate semantic scene completion of the whole scene.

Description

Semantic scene completion method, system, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of three-dimensional scene completion, in particular to a semantic scene completion method, a semantic scene completion system, semantic scene completion equipment and a semantic scene completion storage medium.

Background

The semantic scene completion task can be divided into two subtasks of scene completion and semantic segmentation, wherein the purpose of the scene completion task is to complete sparse and incomplete input scenes into dense and complete scenes, and the purpose of the semantic segmentation task is to further obtain the categories of all objects in the scene after completion. Early work of semantic scene completion focused mainly on indoor scenes. The point cloud in the indoor scene is dense, small in scale and uniform in density. In contrast, the point cloud in the outdoor scene is sparse, large in scale and variable in density, which provides a great challenge for the semantic scene complement algorithm.

The existing completion method deduces the geometric shape and the semantics of the scene from the single-frame input point cloud, but the single-frame input point cloud is sparse and incomplete, even lacks key scene information, and is difficult to realize accurate semantic scene completion of the whole scene. Moreover, existing complement methods involve both downsampling and upsampling operations. The downsampling operation inevitably loses information of the original point cloud, and the upsampling operation causes excessive expansion and shape distortion, thereby causing serious complementation and classification errors for small objects and crowded scenes. Therefore, the semantic scene complement task of outdoor large-scale traffic scenes presents several challenges, including: single frame lidar point clouds are often sparse and incomplete; the single-frame laser radar point cloud comprises a large number of objects with different scales; the problem of motion object smear exists in the semantic scene complement labels of the existing public data set.

Disclosure of Invention

The embodiment of the application provides a semantic scene completion method, a semantic scene completion system, semantic scene completion equipment and a storage medium, and the semantic scene completion method, the semantic scene completion equipment and the storage medium are used for realizing accurate semantic scene completion of the whole scene.

In order to solve the technical problems, in a first aspect, an embodiment of the present application provides a semantic scene completion method, including the following steps: constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and comprises multipath blocks for fusing the multi-scale characteristics; constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; and correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object.

In some exemplary embodiments, after constructing the completion sub-network, before completing the semantic scene of the different scale target object, comprising: modifying the split sub-network; the sparse voxel characteristics generated by the complement sub-network are sent into the modified segmentation sub-network, and voxel-by-voxel segmentation output is generated; modifying the split subnetwork, comprising: replacing the cylinder division of the split sub-network with a cube division; the point optimization module in the split sub-network is removed.

In some exemplary embodiments, a knowledge distillation method is employed to transfer dense, relational-based semantic knowledge from the multi-frame teacher network to a single-frame student network, comprising: respectively calculating the paired relation knowledge of the student model and the relevant knowledge of the teacher model; pair-wise similarity information for each pair of voxel features is extracted based on the pair-wise relationship knowledge and the correlation knowledge.

In some exemplary embodiments, the pairwise relationship knowledge of the student model is shown as follows:

wherein, teacherIs denoted as F, respectively, by the index of the voxel characteristic of the student, the index of the voxel characteristic of the teacher, and the voxel characteristic of the student _T ∈R ^Nm×Cf 、F _S ∈R ^Ns×Cf 、I _T ∈R ^Nm×3 And I _S ∈R ^Ns×3 The method comprises the steps of carrying out a first treatment on the surface of the Nm is the number of non-empty voxel features obtained by multi-frame input, ns is the number of non-empty voxel features obtained by single-frame input, and Cf is the number of channels of voxel features.

In some exemplary embodiments, the index of the teacher's voxel features, the index of the student's voxel features have been ordered, and I _S (i,j)＝I _T (i, j); where i e { 1., -, where, ns, and j e {1,2,3}.

In some exemplary embodiments, the modifying the complement label with the panorama segmentation label, and the complementing the semantic scene of the moving object, includes: voxelized is carried out on the labels to obtain voxel panoramic labels of each category; the category includes moving objects; calculating the boundary of each instance of the category to form a cube corresponding to each instance; and combining cubes corresponding to all the examples, adopting the combined cubes, carrying out complement processing on the original voxel complement label, and filtering voxels outside the cubes to remove the smear of the moving object.

In a second aspect, the embodiment of the application also provides a semantic scene completion system, which comprises a completion self-network construction module, a knowledge distillation module and a completion tag correction module which are connected in sequence; the completion self-network construction module is used for constructing a completion sub-network and completing semantic scenes of the target objects with different scales based on the completion sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and comprises multipath blocks for fusing the multi-scale characteristics; the knowledge distillation module is used for constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; the complement label correction module is used for correcting the complement label by adopting the panoramic segmentation label and completing the semantic scene of the moving object.

In some exemplary embodiments, the complement subnetwork includes an upper branch, a middle branch, and a lower branch connected in sequence; the upper branch includes a multipath block; the multipath blocks comprise convolution blocks of different convolution kernel sizes; the intermediate branches comprise a plurality of convolution blocks and a plurality of multipath blocks; the lower branch comprises a residual connection block.

In addition, the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the semantic scene completion method described above.

In addition, the application also provides a computer readable storage medium which stores a computer program, and the computer program realizes the semantic scene completion method when being executed by a processor.

The technical scheme provided by the embodiment of the application has at least the following advantages:

the embodiment of the application provides a semantic scene completion method, a semantic scene completion system, semantic scene completion equipment and a storage medium, wherein the semantic scene completion method comprises the following steps: constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises multipath blocks for fusing the multi-scale characteristics; constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; and correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object.

According to the semantic scene completion method, on one hand, the completion sub-network is constructed, the completion sub-network comprises the steps of keeping sparsity, no downsampling and fusion of multi-scale features, and the completion performance of the semantic scene completion task on different scale objects, especially on small-scale objects and long-distance objects, is improved by redesigning the completion sub-network; on the other hand, in order to cope with the sparsity and the incompleteness of input, the method and the device provide that dense knowledge based on the relationship is distilled from a multi-frame model, the knowledge distillation is applied to a semantic scene completion task, the dense semantic knowledge based on the relationship is transferred from a multi-frame teacher network to a single-frame student network, and the representation learning of the single-frame model is remarkably improved. In addition, in order to solve the smear problem of the moving object in the complement tag, the application also provides a complement tag correction strategy, and the smear of the moving object in the complement tag is removed by utilizing the panoramic segmentation tag, so that the performance of the depth model is greatly improved.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, which are not to be construed as limiting the embodiments unless specifically indicated otherwise.

FIG. 1 is a schematic flow chart of a semantic scene completion method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a semantic scene completion system according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating challenges in semantic scene completion tasks for a SemanticKITTI data set provided in an embodiment of the disclosed subject matter; wherein (a) is a schematic of sparse and incomplete input, a large number of objects of different dimensions; (b) is a schematic diagram of the inherent tag noise of a moving object;

FIG. 4 is a frame diagram of an SCPNet provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a complementary subnetwork according to an embodiment of the present application;

FIG. 6 is a diagram illustrating a comparison of a single-frame point cloud and a multi-frame splice point cloud according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of the complement labels before and after the label correction method according to an embodiment of the present disclosure;

FIG. 8 is a diagram of a comparison of the visualization of different methods on a SemanticKITTI validation set;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

As known from the background art, the existing semantic scene completion task has a plurality of challenges, including: single frame lidar point clouds are often sparse and incomplete; the single-frame laser radar point cloud comprises a large number of objects with different scales; the problem of motion object smear exists in the semantic scene complement labels of the existing public data set.

Semantic scene completion aims at deducing the geometry and semantics of a scene from incomplete and sparse observations, which is a key component of three-dimensional scene understanding. Early work of semantic scene completion focused mainly on indoor scenes. The point cloud in the indoor scene is dense, small in scale and uniform in density. In contrast, the point cloud in the outdoor scene is sparse, large in scale and variable in density, so that the method provides a great challenge for the semantic scene complement algorithm.

The main goal of Knowledge Distillation (KD) is to transfer black box knowledge from a large over parameterized teacher model to a small compact student model. Where the knowledge includes intermediate features, visual attention, similarity scores for different samples, and so on. Notably, most distillation processes focus on two-dimensional tasks. Only a few distillation methods are focused on the 3D domain.

Existing complement methods involve both downsampling and upsampling operations. The downsampling operation inevitably loses information of the original point cloud, and the upsampling operation causes excessive expansion and shape distortion, thereby causing serious complementation and classification errors for small objects and crowded scenes. In addition, in the prior art, the geometric shape and the semantics of the scene are deduced from the single-frame input point cloud, but the single-frame input point cloud is sparse and incomplete, even lacks key scene information, and accurate semantic scene completion of the whole scene is difficult to realize. In addition, the problem of motion object smear exists in the semantic scene completion label of the existing public data set, learning of a completion model can be hindered, and the problem is not solved by the existing method.

In order to solve the technical problems, an embodiment of the present application provides a semantic scene completion method, which includes the following steps: constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises multipath blocks for fusing the multi-scale characteristics; constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; and correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object. The embodiment of the application provides a semantic scene completion method which is used for respectively completing semantic scenes of target objects with different scales, semantic scenes of input single-frame point clouds and semantic scenes of moving objects, so that accurate semantic scene completion of the whole scene is realized.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, as will be appreciated by those of ordinary skill in the art, in the various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments.

Referring to fig. 1, an embodiment of the present application provides a semantic scene completion method, including the following steps:

s1, constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; wherein the complement subnetwork directly processes the voxelization process to produce sparse voxel features, and the complement subnetwork includes multipath blocks for fusing the multi-scale features.

S2, constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; and transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud.

And S3, correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object.

As mentioned previously, semantic scene completion in outdoor scenes is challenging due to sparse and incomplete inputs, a large number of objects of different scales, and the inherent tag noise of moving objects. Existing semantic scene completion methods rely heavily on voxel-by-voxel completion labels and exhibit poor completion performance on small, far objects and in crowded scenes. In addition, long smear of moving objects in the original complement labels can prevent learning of the complement model. Therefore, in order to solve the three problems, the application proposes three solutions from the aspects of complement sub-network redesign, dense to sparse knowledge distillation, complement label correction and the like.

1) Aiming at the problem that serious completion and classification errors are caused to small objects and crowded scenes in the existing completion tasks, the novel completion sub-network is provided, and the completion sub-network is composed of a plurality of multi-path blocks (MPBs) to aggregate multi-scale features, and has no lossy downsampling operation. The voxel features generated by the complement sub-network can maintain sparse characteristics, and the segmentation sub-network can process the sparse voxel features by using sparse convolution.

2) Aiming at the problem that sparse and incomplete single-frame input point cloud is difficult to realize accurate semantic scene completion of the whole scene, the method and the device complement the semantic scene of the input single-frame point cloud. To combat sparse and incomplete input signals, the present application devised a novel knowledge distillation goal, called dense to sparse knowledge distillation (DSKD). The method transfers dense semantic knowledge based on the relationship from a multi-frame teacher network to a single-frame student network, and remarkably improves the representation learning of a single-frame model.

3) The problem that motion objects are smeared in the semantic scene completion labels of the existing public data set can prevent the learning of the completion model, and aiming at the technical problem, the application provides a simple and effective label correction strategy, and the panoramic segmentation labels are used for removing the smears of the motion objects in the completion labels, so that the performance of the depth model is greatly improved, and particularly for the motion objects.

As shown in fig. 2, the embodiment of the present application further provides a semantic scene completion system, which includes a completion self-network construction module 101, a knowledge distillation module 102 and a completion tag correction module 103 that are sequentially connected; the complement self-network construction module 101 is used for constructing a complement sub-network and complementing semantic scenes of different scale target objects based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises multipath blocks for fusing the multi-scale characteristics; the knowledge distillation module 102 is used for constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; the complement tag correction module 103 is configured to correct the complement tag by using the panorama segmentation tag, and complement the semantic scene of the moving object.

FIG. 3 is a schematic diagram illustrating challenges in Semantic scene completion tasks for a Semantic KITTI data set provided in an embodiment of the disclosed subject matter; wherein (a) is a schematic of sparse and incomplete input, a large number of objects of different dimensions; (b) is a schematic diagram of the inherent tag noise of a moving object.

The problem in the semantic scene completion task shown in fig. 3 is solved, and the semantic scene completion system is provided, so that the semantic scene completion of the whole scene is realized accurately by providing the following three solutions. Firstly, redesigning a complement sub-network, and improving the complement performance of different scale objects, especially small scale objects and long-distance objects, in a semantic scene complement task; next, designing a dense-to-sparse knowledge distillation method for enabling a single-frame network to distill useful knowledge from a multi-frame model, and improving semantic scene complement expression of an input single-frame point cloud; finally, a simple and effective complement label correction strategy is designed, and the conventional panoramic segmentation label can be used for removing the smear of the moving object in the complement label, so that the semantic scene complement performance of the moving object is greatly improved.

First, a full sub-network is designed comprehensively. And adopting a complement priority principle to enable a complement module in the complement sub-network to directly process the original voxel characteristics. Furthermore, the present application is directed to avoiding the use of downsampling operations, which inevitably introduce information loss and cause serious complementation and classification errors for those small objects and crowded scenes. In order to improve the complement quality of different scale objects, the application designs Multi-Path Blocks (MPB) with different convolution kernel sizes, wherein the Multi-Path Blocks aggregate Multi-scale features and can fully utilize rich context information.

As shown in fig. 4, the SCPNet designed in the present application is composed of two Sub-networks, namely, a Completion Sub-network (Completion Sub-network) and a segmentation Sub-network (segmentation Sub-network). The application proposes design principles critical to building a powerful complement subnetwork, including maintaining sparsity, no downsampling, and fusing multi-scale features. The segmentation sub-network is established on the basis of a Cylinder3D, and is modified, so that the full sub-network has the characteristics of sparsity maintenance, no downsampling and multi-scale feature fusion.

Sparsity is maintained: the complement sub-network requires a common dense convolution to expand, while the split sub-network uses a sparse convolution for efficient processing. However, the variance and mean of the batch normalization (Batch Normalization, BN) layer may destroy the sparsity of the original voxel features, thereby greatly increasing the computational burden of the segmented sub-network. Thus, to reduce overall computational cost and maintain the efficiency of sparse convolution, the present application removes all convolution bias and BN layers in the complement sub-network. In this case, the voxel features generated by the complement sub-network may still maintain sparse characteristics, and the segmentation sub-network may use sparse convolution to process these sparse voxel features.

No downsampling: in the popular complement networks such as S3CNet and JS3C-Net, the complement has multiple downsampling and upsampling blocks. The downsampling operation inevitably loses information of the original point cloud, causing serious complementation and classification errors for small objects and crowded scenes. Therefore, the complement sub-network of the application gives up all downsampling and upsampling operations to reduce information loss and maximally retain the information of the original point cloud. In addition, unlike JS3C-Net, which uses split priority baselines, the present application uses the complement priority principle. In particular, the present application lets the complement sub-network directly handle the original voxel features generated by the voxelization process, and the complement sub-network can also benefit from the large number of parameters of the segmented sub-network.

In some embodiments, the complement subnetwork includes an upper branch, a middle branch, and a lower branch connected in sequence; the upper branch includes a multipath block; the multipath blocks comprise convolution blocks of different convolution kernel sizes; the intermediate branches comprise a plurality of convolution blocks and a plurality of multipath blocks; the lower branch comprises a residual connection block.

In order to fuse the multi-scale features, the application designs is composed of 3×3×3 from 3X 3 x 3. As shown in fig. 5, there are 3 branches in the complement subnetwork (3D Completion Sub-network in fig. 5). The upper branch contains a multipath block MPB and the lower branch is a residual connection. The middle branch consists of a 3 x 3 convolution block two multipath blocks MPB and one 3×3×3 the convolution blocks. After complementing the sub-network, dense complete voxel features are obtained. The present application extracts non-empty voxel features from complete voxel features and their voxel indices. The generated sparse voxel features are sent to a segmentation sub-network to produce voxel-by-voxel segmentation output.

It should be noted that, the multi-path blocks composed of convolution blocks with different convolution kernel sizes designed in the application realize the complement, without downsampling and upsampling operations, and the complement sub-network can be modified, for example, the artificial neural network with the convolution kernel number, multi-path block number or branch number changed and without downsampling and upsampling operations can be realized.

In some embodiments, after the completion subnetwork is constructed in step S1, before the completion of the semantic scene of the different scale target object, it includes: modifying the split sub-network; the sparse voxel characteristics generated by the complement sub-network are sent into the modified segmentation sub-network, and voxel-by-voxel segmentation output is generated; modifying the split subnetwork, comprising: replacing the cylinder division of the split sub-network with a cube division; the point optimization module in the split sub-network is removed.

Specifically, the process of modifying the split subnetwork is: for the segmentation, the present application takes Cylinder3D as the backbone. Since the voxel completion labels are defined based on cube partitions, the present application replaces the Cylinder partition of the Cylinder3D with a traditional cube partition. In addition, the original point optimization module consumes a large amount of GPU memory and brings limited gain, and the point optimization module is discarded to save memory.

In some embodiments, a knowledge distillation method is employed to transfer dense, relationship-based semantic knowledge from the multi-frame teacher network to a single-frame student network, comprising: respectively calculating the paired relation knowledge of the student model and the relevant knowledge of the teacher model; pair-wise similarity information for each pair of voxel features is extracted based on the pair-wise relationship knowledge and the correlation knowledge.

In some embodiments, the pairwise relationship knowledge of the student model is shown as follows:

wherein the voxel characteristics of the teacher, the voxel characteristics of the student, and the indexes of the voxel characteristics of the teacher and the student are respectively denoted as F _T ∈R ^Nm×Cf 、F _S ∈R ^Ns×Cf 、I _T ∈R ^Nm×3 And I _S ∈R ^Ns×3 The method comprises the steps of carrying out a first treatment on the surface of the Nm is the number of non-empty voxel features obtained by multi-frame input, ns is the number of non-empty voxel features obtained by single-frame input, and Cf is the number of channels of voxel features.

In some embodiments, the index of the teacher's voxel features, the index of the student's voxel features have been ordered, and I _S (i,j)＝I _T (i, j); where i e { 1., -, where, ns, and j e {1,2,3}.

In order to combat sparse and incomplete input signals, the present application lets a single-frame student model extract knowledge from a multi-frame teacher model. However, modeling the probabilistic knowledge of each point/voxel brings marginal benefits. In contrast, the present application suggests extracting pair-wise similarity information. Considering sparsity and disorder of features, the present application uses their index to align features and then forces consistency between pairs of similarity graphs of student features and teacher features, benefiting students from teacher's knowledge of relationships. The resulting Dense-to-sparse knowledge distillation (Dense-to-Sparse Knowledge Distillation) target is called DSKD, which is specifically designed for scene completion tasks.

As is evident from fig. 6, the multi-frame input contains more points in the scene and more easily identifies objects, which can significantly reduce the difficulty of complement. The completion difficulty decreases as the number of input point cloud frames increases. Therefore, the application constructs the multi-frame teacher network which takes denser point clouds as input and obtains better complement performance.

Inspired by PVKD, the application enables a single-frame model to extract structural knowledge based on the relationship from a multi-frame teacher network. Since the original voxel features are in a sparse form, knowledge distillation is performed by using the sparse features and indexes thereof. The application first calculates pairwise relationship knowledge of a student model. Teacher model P _T Is calculated in a similar manner. The correlation knowledge captures the similarity of each pair of voxel characteristics, and can be used as an important clue of the surrounding environment and can be used as high-level knowledge learned by a single-frame student model. The proposed dense to sparse knowledge distillation (DSKD) penalty is as follows:

the overall loss function of training the deep learning network consists of three terms, namely cross entropy loss, lovasz-softmax loss, and proposed distillation loss

Where α and β are loss coefficients for balancing the effect of each loss term.

It should be noted that, the application designs a dense-to-sparse distillation (DSKD) method for enabling a student network inputting a single-frame point cloud to distill useful knowledge from a teacher network inputting a multi-frame point cloud, and improving the completion performance of the single-frame network, and the alternative scheme can be a calculation mode for changing a distillation loss function.

In some embodiments, the modifying the complement label with the panorama segmentation label to complement the semantic scene of the moving object includes: voxelized is carried out on the labels to obtain voxel panoramic labels of each category; the category includes moving objects; calculating the boundary of each instance of the category to form a cube corresponding to each instance; and combining cubes corresponding to all the examples, adopting the combined cubes, carrying out complement processing on the original voxel complement label, and filtering voxels outside the cubes to remove the smear of the moving object.

In order to solve the long smear of moving objects in the complement tag, the application provides a simple and effective tag correction strategy. The core idea is to remove long smear of moving objects in the complement label by using the existing panoramic segmentation label. The corrected complement label is more accurate and reliable, and the complement quality of the depth model on the moving object is greatly improved.

For outdoor semantic scene completion, a real completion label is obtained by splicing the segmentation labels of a plurality of continuous point cloud frames. Specifically, for the t-th frame, the corresponding complement label L _t ^c The construction mode of (2) is as follows:

wherein L is _t ^s Is the segmentation label of the T frame, T is the frame number used for splicing, T _t+1→t Is a transformation matrix that transforms coordinates from the (t+1) th frame to the t-th frame, concat [..; ...]Is a splicing operation. Multi-frame stitching will result in long smear of those moving objects, such as cars and people. As shown in fig. 7, the two upper framesThe long smear of the moving object circled in the figure is obviously unreasonable and can hinder the learning of the depth model.

In order to solve the problem, the application proposes a method for removing long smear of a moving object in a complement label by using a panoramic segmentation label. Specifically, given a class i panoramic label, the present application first voxels the label and obtains a class i voxel panoramic label. For each instance of category i, the present application computes the boundaries of each instance, forming a cube. The present application combines the cubes of all examples and uses them to process the original voxel completion labels, filter those voxels outside the cube, and repeat the process for all classes containing moving objects. The pseudo code of the designed tag correction method is shown in table 1.

TABLE 1 pseudo code for tag correction method

In table 1, getInd (a, b) is an index to obtain a value b in matrix a; bound (M) is the boundary for obtaining the triplet; difference (A, B) is the difference of A minus B.

As shown in fig. 7, compared with the upper two diagrams, the lower two diagrams have long smear of moving objects removed, so that the label correction method provided by the application can effectively remove long smear of moving objects, and make the complement label more accurate.

In summary, the present application first redesigns the full complement subnetwork comprehensively, and the constructed full complement subnetwork includes the following key factors, including sparsity maintenance, no downsampling, and multi-scale feature fusion. The application designs a novel complement sub-network which consists of a plurality of multi-path blocks (MPB) to aggregate multi-scale features, has no lossy downsampling operation, and can improve the complement performance of different scale objects, especially small-scale objects and long-distance objects, in semantic scene complement tasks.

In order to cope with sparsity and incompleteness of input, the application also proposes to distill dense relational-based knowledge from a multi-frame model, and apply knowledge distillation to a semantic scene completion task for the first time. The present application devised a novel knowledge distillation goal, called dense to sparse knowledge distillation (DSKD). The method can transfer dense semantic knowledge based on the relationship from a multi-frame teacher network to a single-frame student network, and remarkably improves the representation learning of a single-frame model.

In order to solve the long smear of a moving object in the complement tag, the application provides a complement tag correction strategy, and the panoramic segmentation tag is utilized to remove the long smear of the moving object in the complement tag. The application provides a simple and effective label correction strategy, which can use the existing panoramic segmentation label to remove the smear of moving objects in the complement label, thereby greatly improving the performance of a depth model, especially for the moving objects.

And (3) effect verification:

the application proves to be feasible through experiments, simulation and use. SCPNet designed by the application ranks first in SemanticKITTI semantic scene completion challenge, 36.7mIoU is obtained on SemanticKITTI test set, and the SCPNet exceeds competitive S3CNet7.2mIoU. SCPNet has performance over S3CNet of more than 10IoU in the categories of automobiles, other vehicles, roads, parking lots, sidewalks, fences, terrains and other floors. SCPNet also performs better than existing complement algorithms on SemanticPOSS datasets, being at least 5IoU higher than JS3C-Net on the categories of automobiles, luggage, utility poles, fences, bicycles, and the like. In addition, the method also obtains competitive results on the SemanticKITTI semantic segmentation task, and the Cylinder3D initialized according to the training weight of the completion task is 2.6mIoU higher than the original Cylinder3D model, which shows that the knowledge learned in scene completion is favorable for segmentation of the task.

Compared with FitNet which directly imitates the characteristics of teachers, the DSKD method provided by the application can bring improvement of 2.8mIoU, and the effectiveness of the distillation algorithm based on the relation is shown.

The complement tag correction method provided by the application can bring IoU improvements of 8.1, 24.5, 26.1 and 17.6 to automobiles, people, cyclists and motorcyclists respectively, and effectively proves the effectiveness of the tag correction algorithm.

FIG. 8 shows a visual comparison of the partial semantic scene completion results versus the Truth (group Truth) for JS3C-Net, SCPNet (single frame), and SCPNet (multi frame) in a SemanticKITTI validation set. SCPNet (single frame) makes more accurate complement predictions on roads and vegetation than JS 3C-Net. On equal length and thin objects of the utility pole, the single frame SCPNet method produces higher quality complement results than JS 3C-Net. The prediction of the single frame SCPNet method is also similar to that of the multi-frame SCPNet method, proving the effectiveness of the proposed DSKD algorithm.

Referring to fig. 9, another embodiment of the present application provides an electronic device, including: at least one processor 110; and a memory 111 communicatively coupled to the at least one processor; the memory 111 stores instructions executable by the at least one processor 110, the instructions being executable by the at least one processor 110 to enable the at least one processor 110 to perform any one of the method embodiments described above.

Where the memory 111 and the processor 110 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 110 and the memory 111 together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 110 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 110.

The processor 110 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 111 may be used to store data used by processor 110 in performing operations.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described above. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

By the above technical solution, the embodiments of the present application provide a semantic scene completion method, system, device and storage medium, where the method includes the following steps: constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises multipath blocks for fusing the multi-scale characteristics; constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from a multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud; and correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of implementing the present application and that various changes in form and details may be made therein without departing from the spirit and scope of the present application. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention shall be defined by the appended claims.

Claims

1. A semantic scene completion method, comprising:

constructing a complement sub-network, and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises a multi-path block for fusing multi-scale characteristics;

constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from the multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud;

and correcting the complement label by adopting the panoramic segmentation label, and complementing the semantic scene of the moving object.

2. The semantic scene completion method according to claim 1, characterized by comprising, after constructing the completion sub-network, before completing the semantic scene of the target object of different scales:

modifying the split sub-network; the sparse voxel characteristics generated by the complement sub-network are sent into the modified segmentation sub-network, and voxel-by-voxel segmentation output is generated;

the modifying the split sub-network includes:

replacing the cylinder division of the split sub-network with a cube division;

the point optimization module in the split sub-network is removed.

3. The semantic scene completion method according to claim 1, wherein transferring dense, relational-based semantic knowledge from the multi-frame teacher network to a single-frame student network using a knowledge distillation method comprises:

respectively calculating the paired relation knowledge of the student model and the relevant knowledge of the teacher model;

and extracting pair-wise similarity information of each pair of voxel characteristics based on the pair-wise relationship knowledge and the related knowledge.

4. A semantic scene completion method according to claim 3, wherein the pairwise relationship knowledge of the student model is represented by the formula:

5. The semantic scene completion method according to claim 4, wherein the teacher's voxel characteristicsIndex, index of voxel features of student has been ordered, and I _S (i,j)＝I _T (i,j)；

Where i e { 1., -, where, ns, and j e {1,2,3}.

6. The semantic scene completion method according to claim 1, wherein the modifying the completion tag by using the panorama segmentation tag to complete the semantic scene of the moving object comprises:

voxelized is carried out on the labels to obtain voxel panoramic labels of each category; the category includes moving objects;

calculating the boundary of each instance of the category to form a cube corresponding to each instance;

and combining cubes corresponding to all the examples, adopting the combined cubes to complement the original voxel complement label, and filtering voxels outside the cubes to remove the smear of the moving object.

7. The semantic scene complement system is characterized by comprising a complement self-network construction module, a knowledge distillation module and a complement label correction module which are connected in sequence;

the complement self-network construction module is used for constructing a complement sub-network and complementing semantic scenes of target objects with different scales based on the complement sub-network; the complement sub-network directly processes the voxelization process to generate sparse voxel characteristics, and the complement sub-network comprises a multi-path block for fusing multi-scale characteristics;

the knowledge distillation module is used for constructing a multi-frame teacher network; the multi-frame teacher network takes dense point clouds as input; transferring dense semantic knowledge based on a relation from the multi-frame teacher network to a single-frame student network by adopting a knowledge distillation method, and complementing the semantic scene of the input single-frame point cloud;

the complement label correction module is used for correcting the complement label by adopting the panoramic segmentation label and completing the semantic scene of the moving object.

8. The semantic scene completion system according to claim 7, wherein the completion subnetwork comprises an upper branch, a middle branch, and a lower branch connected in sequence;

the upper branch includes a multipath block; the multipath blocks comprise convolution blocks of different convolution kernel sizes;

the intermediate branches comprise a plurality of convolution blocks and a plurality of multipath blocks;

the lower branch includes a residual connection block.

9. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the semantic scene completion method according to any of claims 1 to 6.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the semantic scene completion method of any of claims 1 to 6.