CN117911662B

CN117911662B - Digital twin scene semantic segmentation method and system based on depth hough voting

Info

Publication number: CN117911662B
Application number: CN202410318275.3A
Authority: CN
Inventors: 易程; 汪俊; 吴巧云; 黄同裕; 何军
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2024-03-20
Filing date: 2024-03-20
Publication date: 2024-05-14
Anticipated expiration: 2044-03-20
Also published as: CN117911662A

Abstract

The invention relates to the technical field of digital twinning, and solves the technical problem of low semantic segmentation precision caused by insufficient extraction of space point context information in the current three-dimensional point cloud semantic segmentation method based on deep learning, in particular to a digital twinning scene semantic segmentation method and system based on deep Hough voting. The digital twin scene semantic segmentation method provided by the invention can realize rapid and high-precision semantic segmentation of three-dimensional measurement data of any industrial digital twin scene, thereby providing high-precision three-dimensional semantic information for digital twin industrial application, facilitating subsequent rapid positioning of the position of key production elements in the industrial scene, performing intelligent editing and interaction on objects in the scene, and enhancing user experience.

Description

Digital twin scene semantic segmentation method and system based on depth hough voting

Technical Field

The invention relates to the technical field of digital twinning, in particular to a digital twinning scene semantic segmentation method and system based on depth Hough voting.

Background

In the field of intelligent factory construction in manufacturing industry, the digital twin technology is utilized to realize virtual-real mapping of production service scenes, service processes, processing equipment and the like, and the virtual-real mapping has become a hot spot for research of domestic and foreign scholars. The three-dimensional semantic segmentation technology can effectively extract and analyze geometric structure information in the digital twin scene, divide objects in the scene into different semantic categories and allocate labels, and is widely applied to digital twin modeling and analysis. Based on the segmentation result, high-quality three-dimensional semantic information can be provided for the digital twin workshop, so that the position of an interested object in a scene can be rapidly positioned, intelligent editing and interaction can be performed on the object in the scene, and the intelligent level of workshop management is improved.

In the prior art, three-dimensional point cloud deep learning methods can be mainly divided into three types: a projection-based depth learning method, a voxel-based depth learning method, and a spatial point-based depth learning method. For dealing with irregular three-dimensional point clouds, the most intuitive approach is to convert the irregular representation into a regular representation. In a projection-based depth learning framework, geometric information inside a three-dimensional point cloud is folded during the projection phase. When a dense grid of pixels is formed on the projection plane, the sparseness of the point cloud may be ignored. In addition, the selection of the projection plane can also seriously influence the three-dimensional point cloud feature extraction; another method of converting irregular point clouds into regular representations is three-dimensional voxelization, and using three-dimensional convolution to extract features, the application of conventional three-dimensional discrete convolution generally incurs a large amount of computation and memory overhead; the deep learning method based on the space points designs a deep network structure and directly acts on the point cloud set embedded into the continuous space. Although these approaches have achieved impressive results in three-dimensional object recognition and semantic segmentation, there is still a tremendous exploration space to enhance twin scene understanding based on contextual information analysis.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a digital twin scene semantic segmentation method and a digital twin scene semantic segmentation system based on depth Hough voting, which solve the technical problem of low semantic segmentation precision caused by insufficient extraction of spatial point context information in the current three-dimensional point cloud semantic segmentation method based on depth learning.

In order to solve the technical problems, the invention provides the following technical scheme: a digital twin scene semantic segmentation method based on depth hough voting comprises the following steps:

S1, extracting context information of different scales of a neighborhood of a three-dimensional measurement point from three-dimensional measurement data in an obtained industrial scene through a U-shaped feature coding module to form a first-level feature expression;

s2, capturing context information of object levels in the industrial scene through a Hough voting module and a vote aggregation module to form a second-level feature expression;

S3, splicing the first-level feature expression and the corresponding second-level feature expression of each three-dimensional measurement point by adopting a semantic prediction module, and predicting semantic information of each three-dimensional measurement point;

s4, determining central regression loss and three-dimensional semantic segmentation loss of object objects in the industrial scene, training on an industrial scene data set, and updating parameters of all the modules to form a three-dimensional semantic segmentation system oriented to the industrial digital twin scene;

s5, taking three-dimensional measurement data of the industrial digital twin scene as input, and outputting high-quality three-dimensional semantic information of the scene by the three-dimensional semantic segmentation system.

Further, in step S1, the specific process includes the following steps:

s11, preprocessing the data, and obtaining three-dimensional measurement data Voxelization to convert unordered spatial points into ordered three-dimensional voxels, wherein each three-dimensional measurement point/>From its coordinates/>N represents the number of spatial points;

s12, taking ordered three-dimensional voxels as input, obtaining a first-level feature vector of each space point based on the mapping relation between the space points and the three-dimensional voxels through a U-shaped feature coding module, and forming a first-level feature set 。

Further, in step S2, the specific process includes the following steps:

S21, three-dimensional measurement data And corresponding constitution first-level feature set/>As input, the hough voting module outputs each three-dimensional measurement point/>European spatial offset/>And feature offset/>And generates vote information/>, for that point；

S22, based on each three-dimensional measuring pointVote information/>Constitute vote set/>；

S23, utilizing the furthest point sampling method to collect the spatial position of the ballotSampling M vote positions to obtain/>；

S24, for eachClustering/>, by finding votes adjacent to it in European space；

S25, clustering formed by processing three full-connection layers in the vote aggregation moduleTo obtain a cluster feature set/>。

Further, in step S3, specifically includes: for each three-dimensional measurement pointFor the first-level feature expression/>, which is output by the U-shaped feature coding moduleSecond level feature expression/>, of the cluster feature with which it is locatedAnd splicing and sending the spliced features into a semantic prediction module to predict the three-dimensional semantic category corresponding to the spatial point.

Further, in step S4, the specific process includes the following steps:

S41, determining the central regression loss of the object in the industrial scene ；

S42, determining three-dimensional semantic segmentation loss of object objects in industrial scene；

S43, according to center regression lossAnd three-dimensional semantic segmentation loss/>The joint loss function L is calculated, and the calculation formula of the joint loss function L is as follows:

；

In the above-mentioned method, the step of, Is a hyper-parameter used to balance two different loss function terms;

S44, training on the industrial scene data set based on the joint loss function L;

s45, updating parameters of a feature encoding module, a Hough voting module, a vote aggregation module and a semantic prediction module by adopting a random gradient descent optimization algorithm.

Further, in step S41, the expression of the center regression loss is:

；

In the above formula, N represents the number of spatial points; Is an indication function indicating spatial point/> Whether belonging to a semantic category included in the scene; /(I)Is a space point/>From the initial position/>European spatial deviation truth value to the center of the object to which it belongs; /(I)Is the hough voting module pair space point/>Predicted European spatial offset.

Further, in step S42, three-dimensional semantic segmentation is lostThe expression of (2) is:

；

in the above formula, K represents the number of semantic categories in the dataset; is a sign function, i.e. when the spatial point/> When the true class of (1) equals j,/>In other cases,/>；/>Is the pair of space points/>, of the semantic prediction moduleA predictive probability of belonging to category j; /(I)Is the weight of the j-th category in the dataset, as determined by the dataset.

Further, in step S5, specifically: inputting three-dimensional measurement data of the industrial digital twin scene into a three-dimensional semantic segmentation system, and extracting a first-level feature expression of each three-dimensional measurement point by a U-shaped feature coding module in the three-dimensional semantic segmentation system;

Based on the above, the hough voting module and the ballot aggregation module in the system extract the second level feature expression; and splicing the two levels of feature expressions, and transmitting the feature expressions to a semantic prediction module to acquire the semantic category of each three-dimensional measurement point, so that the high-quality three-dimensional semantic information of the whole scene can be obtained.

The technical scheme also provides a system for realizing the digital twin scene semantic segmentation method, the system is integrated by a plurality of modules to form a three-dimensional semantic segmentation frame facing the industrial digital twin scene, and the modules integrated by the system comprise:

The U-shaped feature coding module is built based on sub-manifold sparse convolution and is used for extracting context information of different scales of a three-dimensional measurement point neighborhood from three-dimensional measurement data in an industrial scene to form a first-level feature expression;

The Hough voting module and the vote aggregation module are formed based on a Hough voting mechanism and are used for capturing context information of object levels in a scene to form a second-level feature expression;

The three-dimensional measurement point semantic prediction module is used for splicing the first-level feature expression and the second-level feature expression of each three-dimensional measurement point and predicting semantic information of each three-dimensional measurement point.

Further, the U-shaped feature coding module is a U-shaped framework built based on a combination of a conventional sparse convolution layer, a conventional sparse inverse convolution layer and a sub-manifold sparse convolution block;

The Hough voting module is based on a multi-layer perceptron and consists of a full connection layer FC, an activation function ReLU and batch normalization BN;

the vote aggregation module consists of a full connection layer FC, an activation function ReLU, batch normalization BN and maximum pooling MaxPooling;

the semantic prediction module consists of two fully connected layers FC.

By means of the technical scheme, the digital twin scene semantic segmentation method and system based on the depth Hough voting provided by the invention have the following beneficial effects:

1. The digital twin scene semantic segmentation method provided by the invention can realize rapid and high-precision semantic segmentation of three-dimensional measurement data of any industrial digital twin scene, thereby providing high-precision three-dimensional semantic information for digital twin industrial application, facilitating subsequent rapid positioning of the position of key production elements in the industrial scene, performing intelligent editing and interaction on objects in the scene, and enhancing user experience.

2. Compared with the existing three-dimensional semantic segmentation method, the digital twin scene semantic segmentation method provided by the invention can effectively capture and fuse context information of different levels, and improves the semantic segmentation precision of the three-dimensional scene.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a digital twin scene semantic segmentation method of the present invention;

FIG. 2 is a network block diagram of a U-shaped feature encoding module of the present invention;

Fig. 3 is a network structure diagram of the hough voting module according to the present invention;

FIG. 4 is a network block diagram of the ballot aggregation module of the present invention;

FIG. 5 is a schematic representation of three-dimensional measurement data of a typical industrial power plant scenario of the present invention;

FIG. 6 is a schematic diagram of the semantic segmentation result of the digital twin industrial scene of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. Therefore, the realization process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

The digital twin technology is to create a virtual model of the physical entity in a digital mode, and by means of virtual-real interaction feedback, data fusion analysis, decision iteration optimization and the like, the model, data, intelligent and integrated multidisciplinary technology is fully utilized, new capacity is added or expanded for the physical entity, and virtual representation of the physical entity on a computer is established. The digital twin model is a three-dimensional digital model of a workshop physical entity and is used as a twin carrier and a digital base for production process optimization, and twin data are generated in real time in the production process by the workshop entity. The digital twin model is a digital reconstruction of a production field, is a virtual mapping of the characteristics of workshop entity geometry, physics, behavior, rules and the like, and realizes the accurate mapping of the physical world to the virtual space. In the field of intelligent factory construction in manufacturing industry, the digital twin technology is utilized to realize virtual-real mapping of production service scenes, service processes, processing equipment and the like, and the virtual-real mapping has become a hot spot for research of domestic and foreign scholars. The three-dimensional semantic segmentation technology can effectively extract and analyze geometric structure information in the digital twin scene, divide objects in the scene into different semantic categories and allocate labels, and is widely applied to digital twin modeling and analysis. Based on the segmentation result, high-quality three-dimensional semantic information can be provided for the digital twin workshop, so that the position of an interested object in a scene can be rapidly positioned, intelligent editing and interaction can be performed on the object in the scene, and the intelligent level of workshop management is improved.

In the intelligent factory application based on digital twin, three-dimensional semantic segmentation can be realized through three-dimensional point cloud measurement data, so that a high-fidelity digital twin model is established. Unlike a regular two-dimensional grid of pixels, a three-dimensional point cloud is a discrete set embedded in a continuous space, with irregular and disordered features. Thus, while deep convolutional networks exhibit excellent performance in structured two-dimensional computer vision tasks, they cannot be directly applied to such unstructured data.

To address this challenge, a variety of three-dimensional point cloud deep learning approaches have emerged. The main categories can be divided into three types: a projection-based depth learning method, a voxel-based depth learning method, and a spatial point-based depth learning method. For dealing with irregular three-dimensional point clouds, the most intuitive approach is to convert the irregular representation into a regular representation. In a projection-based depth learning framework, geometric information inside a three-dimensional point cloud is folded during the projection phase. When a dense grid of pixels is formed on the projection plane, the sparseness of the point cloud may be ignored. In addition, the selection of the projection plane may also seriously affect the three-dimensional point cloud feature extraction. Another method of converting irregular point clouds into regular representations is three-dimensional voxelization and feature extraction using three-dimensional convolution. The application of conventional three-dimensional discrete convolution typically involves significant computational and memory overhead. The deep learning method based on the space points designs a deep network structure and directly acts on the point cloud set embedded into the continuous space. Although these approaches have achieved impressive results in three-dimensional object recognition and semantic segmentation, there is still a tremendous exploration space to enhance twin scene understanding based on contextual information analysis.

Based on the technical defects existing in the prior art, please refer to fig. 1-6, which illustrate a specific implementation of the present embodiment, in which three-dimensional measurement data of an industrial digital twin scene is input into a three-dimensional semantic segmentation frame, and a U-shaped feature encoding module inside the three-dimensional semantic segmentation frame extracts a first-level feature expression of each three-dimensional measurement point; based on the above, the hough voting module and the ballot aggregation module in the framework extract the second level feature expression; and splicing the two levels of feature expressions, and transmitting the feature expressions to a semantic prediction module to acquire the semantic category of each three-dimensional measurement point, so that the high-quality three-dimensional semantic information of the whole scene can be obtained.

Referring to fig. 1, the present embodiment provides a digital twin scene semantic segmentation method based on depth hough voting, which includes the following steps:

S1, extracting context information of different scales of a neighborhood of a three-dimensional measurement point from three-dimensional measurement data in an obtained industrial scene through a U-shaped feature coding module to form a first-level feature expression; in step S1, the specific process includes the following steps:

s11, preprocessing the data, and obtaining three-dimensional measurement data Voxelization to convert unordered spatial points into ordered three-dimensional voxels, wherein each three-dimensional measurement point/>From its coordinates/>N represents the number of spatial points, and the voxel size is set to 0.02m;

S12, taking ordered three-dimensional voxels as input, obtaining a first-level feature vector of each space point based on the mapping relation between the space points and the three-dimensional voxels through a U-shaped feature coding module, and forming a feature set Wherein/>For the first-level feature vector corresponding to the ith space point, namely the first-level feature expression, N represents the number of space points, as shown in fig. 2, 5 layers in the U-shaped feature coding module are designed in the same combination mode, only the first layer and the fifth layer are shown in the figure, and the middle three layers are omitted; the U-shaped feature coding module can efficiently process ordered three-dimensional voxels, and finally obtains a first-level feature vector of each spatial point based on the mapping relation between the spatial points and the three-dimensional voxels to form a first-level feature set/>。

In the step, the U-shaped feature encoding module can effectively extract context information of different scales in the three-dimensional measurement data, so that the local semantics of the measurement data can be understood more deeply, and the robustness to noise data is improved.

S2, capturing context information of object levels in the industrial scene through a Hough voting module and a vote aggregation module to form a second-level feature expression; in step S2, the specific process includes the following steps:

S21, three-dimensional measurement data Corresponding feature set/>As input, the hough voting module outputs each three-dimensional measurement point/>European spatial offset/>And feature offset/>And generating ballot information for the point; As shown in fig. 3, a network structure diagram of the hough voting module is composed of three full connection layers FC, two batch normalization BN layers, and an activation function ReLU. Wherein given three-dimensional measurement data/>And corresponding first-level feature set/>The module outputs each three-dimensional measurement point/>European spatial offset/>And feature offset/>I.e.And generates vote information/>, for that point；

S23, utilizing the furthest point sampling method to collect the spatial position of the ballotSampling M vote positions to obtain/>，/>Is from above vote set/>A set formed by M elements is sampled;

S24, for each Clustering/>, by finding votes adjacent to it in European space，Wherein/>The representation belongs to the/>Number/>, of clustersA ballot; /(I)Representing votes/>Is a three-dimensional spatial location of (2); /(I)Representing cluster center/>Is a three-dimensional spatial location of (2); /(I)A vote aggregation radius threshold value is represented, set by the user, and in this embodiment set to 0.2;

s25, clustering formed by processing three full-connection layers in the vote aggregation module To obtain a cluster feature set/>As shown in fig. 4, which is a network structure diagram of the ticket aggregation module, after sampling and clustering, the clustering feature set is obtained through the full connection layer FC, the batch normalization BN, the activation function ReLU, and the maximum pooling MaxPooling. Wherein/>For the second-level cluster feature corresponding to the kth vote position, that is, the second-level feature expression, the value of M is irrelevant to the number of object objects in the industrial scene, and m=128 is set in this embodiment.

In this step, the hough voting module and the ballot aggregation module may capture contextual relationships between the object objects, such as their relative positions, correlations, etc., to enhance understanding of the overall scene.

S3, splicing the first-level feature expression and the corresponding second-level feature expression of each three-dimensional measurement point by adopting a semantic prediction module, and predicting semantic information of each three-dimensional measurement point to obtain high-quality three-dimensional semantic information of the whole scene so as to assist subsequent digital twin application, such as: and (3) rapidly positioning the position of the object of interest in the scene, and performing intelligent editing and interaction on the object in the scene.

In step S3, specifically: for each three-dimensional measurement pointFirst-level feature expression/>, output by the U-shaped feature encoding moduleSecond level feature expression/>, of the cluster feature with which it is locatedAnd splicing and sending the spliced features into a semantic prediction module to predict the three-dimensional semantic category corresponding to the spatial point.

In the step, the multi-level and multi-scale semantic information fusion of each three-dimensional measurement point is realized by splicing the feature expressions of the first level and the second level. The semantic prediction module predicts semantic information by utilizing the fused feature expression, so that finer and accurate semantic label prediction for each measurement point can be realized. By utilizing the feature expressions of two different layers, the model can more comprehensively utilize information from different modules, so that the understanding and analyzing capability of the whole system to industrial scenes is improved.

S4, determining central regression loss and three-dimensional semantic segmentation loss of object objects in the industrial scene, training on an industrial scene data set, and updating parameters of all the modules to form a three-dimensional semantic segmentation frame oriented to the industrial digital twin scene; in step S4, the specific process includes the following steps:

S41, determining the central regression loss of the object in the industrial scene Determining center regression loss/>The goal of (a) is to supervise the hough voting module to learn the contextual characteristics of the object level, thus constraining each three-dimensional spatial point/>, hereCan sense the central position/>, of the object. Center regression loss/>The expression of (2) is:

；

S42, determining three-dimensional semantic segmentation loss of object objects in industrial sceneDetermining three-dimensional semantic segmentation lossThe aim of the method is to monitor the whole system to accurately predict each three-dimensional space point/>Semantic category/>Three-dimensional semantic segmentation loss/>The expression of (2) is:

；

In the above-mentioned method, the step of, Is a hyper-parameter used to balance two different loss function terms,/>May be set to 0.6.

S44, training is carried out on the industrial scene data set based on a joint loss function L, and compared with a traditional semantic segmentation loss function, the joint loss function utilizes complementarity of a scene object center regression loss function and a semantic segmentation cross entropy loss function, and senses the center position of an object where a three-dimensional space point is located while carrying out semantic prediction, so that the overall semantic segmentation performance of the model is enhanced.

S45, updating parameters of a feature encoding module (such as various network layers involved in fig. 2), a Hough voting module (such as various network layers involved in fig. 3), a vote aggregation module (such as various network layers involved in fig. 4) and a semantic prediction module (two fully connected layers) by adopting a random gradient descent optimization algorithm.

S5, taking three-dimensional measurement data of the industrial digital twin scene as input, and outputting high-quality three-dimensional semantic information of the scene by the three-dimensional semantic segmentation system. The method comprises the following steps: inputting three-dimensional measurement data of the industrial digital twin scene into a three-dimensional semantic segmentation system, and extracting a first-level feature expression of each three-dimensional measurement point by a U-shaped feature coding module in the three-dimensional semantic segmentation system; based on the above, the hough voting module and the ballot aggregation module in the system extract the second level feature expression; splicing the two levels of feature expressions, and transmitting the feature expressions to a semantic prediction module to acquire semantic categories of each three-dimensional measurement point, so that high-quality three-dimensional semantic information of the whole scene can be obtained, and as shown in fig. 5, three-dimensional measurement data of a typical industrial power station scene can be used as input of the three-dimensional semantic segmentation system; FIG. 6 is an output visualization result of the three-dimensional semantic segmentation system of the present invention, the result containing 8 semantic categories of: reactor 1, wire 2, filter 3, pole 4, voltage divider 5, ground 6, overhead 7, miscellaneous 8, different depths representing different semantic categories.

The digital twin scene semantic segmentation method based on the depth Hough voting can realize rapid and high-precision semantic segmentation of three-dimensional measurement data of any industrial digital twin scene, thereby providing high-precision three-dimensional semantic information for digital twin industrial application, facilitating subsequent rapid positioning of the position of key production elements in the industrial scene, and performing intelligent editing and interaction on objects in the scene. Compared with the existing three-dimensional semantic segmentation method, the method can effectively capture and fuse context information of different levels, and improves the semantic segmentation precision of the three-dimensional scene.

The system provided by this embodiment corresponds to the digital twin scene semantic segmentation method provided by the foregoing embodiment, so that implementation of the foregoing digital twin scene semantic segmentation method is also applicable to the system provided by this embodiment, and will not be described in detail in this embodiment.

Referring to fig. 2-4, which show a network architecture diagram of a three-dimensional semantic segmentation framework provided by the present embodiment, the system is integrated by a plurality of modules to form a three-dimensional semantic segmentation framework facing an industrial digital twin scene, and the modules integrated by the system include:

As shown in fig. 2, a network architecture diagram of a U-shaped feature encoding module is constructed based on sub-manifold sparse convolution, and is used for extracting context information of different scales of a three-dimensional measurement point neighborhood from three-dimensional measurement data in an industrial scene to form a first-level feature expression, and the U-shaped feature encoding module is a U-shaped architecture constructed based on a combination of a conventional sparse convolution layer, a conventional sparse inverse convolution layer and a sub-manifold sparse convolution block.

The Hough voting module and the vote aggregation module are formed based on a Hough voting mechanism and are used for capturing context information of object levels in a scene to form a second-level feature expression; as shown in fig. 3, a network architecture diagram of a hough voting module, which is based on a multi-layer perceptron and consists of a full connection layer FC, an activation function ReLU and a batch normalization BN; as shown in fig. 4, a network architecture diagram of a vote aggregation module is provided, where the vote aggregation module is composed of a full connection layer FC, an activation function ReLU, a batch normalization BN, and a maximum pooling MaxPooling.

The semantic prediction module is used for splicing the first-level feature expression and the second-level feature expression of each three-dimensional measurement point and predicting semantic information of each three-dimensional measurement point, and consists of two full-connection layers.

According to the embodiment, three-dimensional measurement data of an industrial digital twin scene are input into a three-dimensional semantic segmentation frame, and a U-shaped feature coding module in the three-dimensional semantic segmentation frame extracts a first-level feature expression of each three-dimensional measurement point; based on the above, the hough voting module and the ballot aggregation module in the framework extract the second level feature expression; and splicing the two levels of feature expressions, and transmitting the feature expressions to a semantic prediction module to acquire the semantic category of each three-dimensional measurement point, so that the high-quality three-dimensional semantic information of the whole scene can be obtained.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For each of the above embodiments, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description of the method embodiment for relevant points.

The foregoing embodiments have been presented in a detail description of the invention, and are presented herein with a particular application to the understanding of the principles and embodiments of the invention, the foregoing embodiments being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The digital twin scene semantic segmentation method based on the depth hough voting is characterized by comprising the following steps of:

2. The digital twin scene semantic segmentation method according to claim 1, wherein in step S1, the specific process comprises the steps of:

3. The digital twin scene semantic segmentation method according to claim 1, wherein in step S2, the specific process comprises the steps of:

S24, for eachClustering/>, by finding votes adjacent to it in European space；

S25, clustering formed by processing three full-connection layers in the vote aggregation moduleTo obtain a cluster feature set。

4. The digital twin scene semantic segmentation method according to claim 1, characterized in that in step S3, it specifically comprises: for each three-dimensional measurement pointFor the first-level feature expression/>, which is output by the U-shaped feature coding moduleSecond level feature expression/>, of the cluster feature with which it is locatedAnd splicing and sending the spliced features into a semantic prediction module to predict the three-dimensional semantic category corresponding to the spatial point.

5. The digital twin scene semantic segmentation method according to claim 1, wherein in step S4, the specific process comprises the steps of:

；

6. The digital twin scene semantic segmentation method according to claim 5, wherein in step S41, the expression of the central regression loss is:

；

In the above formula, N represents the number of spatial points; Is an indication function indicating spatial point/> Whether belonging to a semantic category included in the scene; /(I)Is a space point/>From the initial position/>European spatial deviation truth value to the center of the object to which it belongs; is the hough voting module pair space point/> Predicted European spatial offset.

7. The method of claim 5, wherein in step S42, three-dimensional semantic segmentation is lostThe expression of (2) is:

；

8. The digital twin scene semantic segmentation method according to claim 1, characterized in that in step S5, it is specifically: inputting three-dimensional measurement data of the industrial digital twin scene into a three-dimensional semantic segmentation system, and extracting a first-level feature expression of each three-dimensional measurement point by a U-shaped feature coding module in the three-dimensional semantic segmentation system;

9. A system for implementing the digital twin scene semantic segmentation method according to any of claims 1-8, characterized in that the system is integrated by several modules to form a three-dimensional semantic segmentation framework for industrial digital twin scenes, the modules integrated by the system comprising:

10. The system of claim 9, wherein the U-shaped feature encoding module is a U-shaped architecture built based on a combination of a conventional sparse convolution layer, a conventional sparse inverse convolution layer, and a sub-manifold sparse convolution block;

the semantic prediction module consists of two fully connected layers FC.