WO2023082588A1 - 一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品 - Google Patents

一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023082588A1
WO2023082588A1 PCT/CN2022/093649 CN2022093649W WO2023082588A1 WO 2023082588 A1 WO2023082588 A1 WO 2023082588A1 CN 2022093649 W CN2022093649 W CN 2022093649W WO 2023082588 A1 WO2023082588 A1 WO 2023082588A1
Authority
WO
WIPO (PCT)
Prior art keywords
initial
cluster
dimensional semantic
semantic annotation
dimensional
Prior art date
Application number
PCT/CN2022/093649
Other languages
English (en)
French (fr)
Inventor
段永利
周晓巍
鲍虎军
孙佳明
甄佳楠
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023082588A1 publication Critical patent/WO2023082588A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2012Colour editing, changing, or manipulating; Use of colour codes

Definitions

  • the embodiment of the present disclosure is based on the Chinese patent application with the application number 202111320552.7, the application date is November 09, 2021, and the application name is "a method, device, electronic equipment and storage medium for semantic annotation", and requires the Chinese patent Priority of the application, the entire content of this Chinese patent application is hereby incorporated into this disclosure by reference.
  • Embodiments of the present disclosure relate to the technical field of semantic annotation, and in particular, to a semantic annotation method, device, electronic equipment, storage medium, and computer program product.
  • Semantic segmentation technology is an understanding of the pixel level of an image, and it is a pixel-level classification of objects on the image, that is, pixels belonging to the same type of object are classified into one category, and are marked with a specified label, which is widely used in Various technical scenarios, such as autonomous driving, indoor navigation, virtual reality, image recognition, etc.
  • video images are mainly semantically segmented through a pre-trained image segmentation network.
  • the existing semantic annotation methods mainly use artificial methods to annotate two-dimensional video images pixel by pixel, and the efficiency of annotation is low.
  • Embodiments of the present disclosure at least provide a semantic tagging method, device, electronic equipment, storage medium, and computer program product.
  • an embodiment of the present disclosure provides a semantic tagging method, the method including:
  • a corrected two-dimensional semantic annotation result for each scene image is determined.
  • the embodiment of the present disclosure also provides a device for semantic tagging, the device includes:
  • An acquisition module configured to acquire a three-dimensional scene model reconstructed based on at least one scene image, and an initial two-dimensional semantic annotation result of each scene image;
  • a division module configured to divide each triangle patch included in the 3D scene model into a plurality of clusters
  • the first labeling module is configured to perform three-dimensional semantic labeling on each divided cluster based on the initial two-dimensional semantic labeling result, and determine the initial three-dimensional semantic labeling result of each cluster;
  • the second labeling module is configured to determine a corrected two-dimensional semantic labeling result for each scene image based on the initial three-dimensional semantic labeling result of each cluster.
  • an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the semantic tagging method described in any one of the first aspect and its various implementation manners are executed.
  • the embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor as in the first aspect and its various implementation modes The steps of any one of the semantic labeling methods.
  • an embodiment of the present disclosure provides a computer program product, including a computer-readable storage medium storing program code, and when instructions included in the program code are executed by a processor of a computer device, the above-mentioned semantic tagging is realized. method steps.
  • 3D semantic annotation can be performed on each divided cluster , determine the corresponding initial 3D semantic annotation result, based on the initial 3D semantic annotation result, the corrected 2D semantic annotation result for each scene image can be determined, that is, the embodiment of the present disclosure is based on the initial 3D semantic annotation of clusters in the 3D space
  • the results corrected the initial two-dimensional semantic annotation results of scene images in two-dimensional space because a cluster in three-dimensional space may point to multiple triangular patches, and each triangular patch may point to multiple scene images , based on the above corresponding relationship, the clusters can be aggregated and labeled, which will make the corrected two-dimensional semantic labeling results more accurate.
  • the above correction process is automatically completed, making the labeling more efficient.
  • FIG. 1 shows a flow chart of a semantic tagging method provided by an embodiment of the present disclosure
  • Fig. 2 shows a schematic diagram of a semantic tagging device provided by an embodiment of the present disclosure
  • Fig. 3 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
  • video images are usually semantically segmented through a pre-trained image segmentation network.
  • image segmentation network a large number of labeled sample images need to be obtained.
  • the existing semantic annotation methods mainly use artificial methods to annotate two-dimensional video images pixel by pixel, and the efficiency of annotation is low.
  • embodiments of the present disclosure provide a semantic tagging method, device, electronic device, storage medium, and computer program product, so as to improve tagging efficiency.
  • the execution subject of the semantic tagging method provided by the embodiments of the present disclosure is a computer device with certain computing capabilities.
  • the computer equipment includes, for example: terminal equipment or server or other processing equipment, and the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), Handheld devices, computing devices, in-vehicle devices, wearable devices, etc.
  • the method for semantic tagging may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 is a flow chart of a semantic labeling method provided by an embodiment of the present disclosure
  • the method includes steps S101 to S104, wherein:
  • S101 Obtain a 3D scene model reconstructed based on at least one scene image, and an initial 2D semantic annotation result of each scene image;
  • S103 Based on the initial two-dimensional semantic annotation result, perform three-dimensional semantic annotation on each divided cluster, and determine the initial three-dimensional semantic annotation result of each cluster;
  • S104 Based on the initial three-dimensional semantic annotation result of each cluster, determine a corrected two-dimensional semantic annotation result for each scene image.
  • semantic annotation can be applied to scenes that require semantic annotation, for example, it can be the annotation of vehicles, pedestrians, etc. in automatic driving scenarios, it can also be the annotation of obstacles, etc. in indoor navigation, and it can also be other scenarios
  • semantic annotation can be applied to scenes that require semantic annotation, for example, it can be the annotation of vehicles, pedestrians, etc. in automatic driving scenarios, it can also be the annotation of obstacles, etc. in indoor navigation, and it can also be other scenarios
  • the following notations are not limited in the embodiments of the present disclosure.
  • the related technology adopts a pixel-by-pixel semantic labeling method, which is manually completed. Considering that in the process of network training, a large number of labeled sample images are often required, resulting in the consumption of a lot of manpower and material resources.
  • the embodiment of the present disclosure provides a method for automatic semantic annotation, which can realize semantic annotation in three-dimensional space, and then based on the conversion relationship between three-dimensional space and two-dimensional space, semantic annotation in three-dimensional space The result is projected into a two-dimensional space, which can greatly improve the labeling efficiency.
  • the 3D scene model reconstructed based on one or more scene images and the initial 2D semantic annotation result of each scene image can be obtained first.
  • the above-mentioned 3D scene model may be determined based on 3D reconstruction of the scene image.
  • the 3D reconstruction here can generally be divided into three steps: sparse reconstruction, dense reconstruction, and surface reconstruction.
  • sparse reconstruction is used to complete the initialization calculation of all camera poses.
  • Its input can be various scene images, and the output can be the pose of the camera that captures the scene image; dense reconstruction is used on the premise that the camera pose is known.
  • the input can be a multi-view scene image and camera pose, and the output can be a dense point cloud; surface reconstruction is used for Complete the conversion from dense point cloud to 3D scene model.
  • colmap or other methods may be used to implement the above three-dimensional reconstruction process.
  • the initial two-dimensional semantic annotation result of the scene image in the embodiment of the present disclosure may be obtained by using the initial semantic annotation neural network annotation. Based on the initial two-dimensional semantic annotation results of the scene image and the coordinate transformation relationship between the three-dimensional space and the three-dimensional space, the three-dimensional semantic annotation for the triangular patches included in the three-dimensional scene model can be realized.
  • the above-mentioned initial semantic annotation neural network may be trained by using the labeled sample images stored in the existing database, and the training may be the corresponding relationship between the labeled sample images and the labeled information.
  • the 3D scene model may be spliced by triangular faces, that is, the triangular faces may be the smallest constituent unit of the 3D scene model.
  • the triangles can be divided into clusters according to the semantics, so that the triangles belonging to the same semantics in the cluster can be divided into clusters.
  • Unified semantic labeling of patches can improve the efficiency of labeling on the premise of improving the accuracy of labeling.
  • the initial 3D semantic annotation result of the cluster may be determined based on the above initial 2D semantic annotation result of the scene image. Considering the correspondence between a cluster and multiple triangle patches, and the correspondence between a triangle patch and multiple scene images, in this way, in the case of determining the initial two-dimensional semantic annotation results of the scene image, it can be based on The above correspondences determine the initial 3D semantic annotation results of the clusters.
  • the corrected 2D semantic annotation results of all scene images can be determined at one time based on the coordinate transformation relationship between the 3D space and the 2D space, Compared with the manual labeling method in two-dimensional space, the labeling efficiency is greatly improved.
  • the initial semantic annotation neural network can be trained based on the corrected two-dimensional semantic annotation result of the scene image. Since the accuracy of the two-dimensional semantic labeling results after correction is higher, the labeling accuracy of the trained semantic labeling neural network is better, which is convenient for subsequent applications in various scenarios.
  • the above-mentioned method for dividing clusters may include the following steps:
  • Step 1 Randomly select a preset number of triangular patches from each triangular patch included in the 3D scene model, and use the target vector of each selected triangular patch as the center vector of the cluster to be divided;
  • Step 2 Determine the distance between the target vector of each triangle patch in each triangle patch and each center vector, and divide the triangle patch into the cluster where the corresponding center vector with the smallest distance is located;
  • Step 3 Determine the new center vector corresponding to each cluster, and return to the step of dividing clusters based on the new center vector until the division cut-off condition is met.
  • the target vector here includes at least one of the following: target normal vector; target color vector, which can characterize the characteristics of the triangular patch from different dimensions.
  • target normal vector and the target color vector may be determined based on the 3D reconstruction result.
  • At least one division operation may be performed based on the distance between the center vector of each cluster and the target vector of each triangular patch, until the division cut-off condition is met to obtain a final division result.
  • the center vector of the relevant cluster may be the average vector determined by the target vectors of each triangular facet in the cluster, that is, it may be the mean value of the target vectors of all the triangular faces in the cluster.
  • the division cut-off condition in the embodiment of the present disclosure can be that for one cluster, the center vector calculated multiple times does not change, it can also be that the number of divisions reaches the preset number, or it can be other division cut-off conditions. No restrictions.
  • At least one cluster division can be realized based on the target vector of the triangular patch and the center vector of each cluster. Every time a division is performed, a new center vector can be determined for the cluster, so as to perform the next division. Since the multiple triangular patches divided into a cluster are largely correlated, the initial 3D semantic labeling results of the cluster can be determined by using the labeling results corresponding to the multiple triangular patches of a cluster, which has high accuracy , which further improves the accuracy of the subsequent revised two-dimensional semantic annotation results.
  • the two distances can be determined based on the two vectors, and the above distance can be determined in combination with the weighted summation method. In some embodiments, this can be achieved through the following steps:
  • Step 1 For each triangle patch in each triangle patch, based on the target normal vector of the triangle patch and the center normal vector in each center vector, determine the relationship between the triangle patch and the cluster where each center vector is located. The first distance between, and, based on the target color vector of the triangle patch and the center color vector in each center vector, determine the second distance between the triangle patch and the cluster where each center vector is located;
  • Step 2 Based on the first distance and its corresponding weight, and the second distance and its corresponding weight, determine the distance between the target vector of the triangular patch and each center vector respectively.
  • the weight corresponding to the first distance and the weight corresponding to the second distance may be set according to different application scenarios. For example, for a single-color scene, the consistency between the color vector and the central color vector is crucial for cluster division, and a higher weight can be assigned at this time.
  • the division method provided by the embodiments of the present disclosure aims to make triangular patches with the same semantics belong to the same cluster, and triangular patches with different semantics belong to different clusters.
  • the target vector of a triangular patch such as a cluster is represented by R i
  • R i includes the target normal vector n(x) and the center color vector c(x)
  • the center vector of the cluster is P i
  • P i includes the center normal vector n i and center color vector c i .
  • the target vectors of the above-mentioned triangular patches can be realized based on the weighted summation method. The determination of the distance between them is more applicable.
  • the initial three-dimensional semantic annotation result for the cluster may be determined based on the initial two-dimensional semantic annotation result corresponding to the scene image, including the following steps:
  • Step 1 For each triangular patch, based on the conversion relationship between the first coordinate system corresponding to the 3D scene model and the second coordinate system corresponding to the scene image, determine the corresponding projection of the triangular patch in at least one scene image image area;
  • Step 2 Based on the initial two-dimensional semantic annotation result of the projection image area corresponding to the triangular patch in at least one scene image, determine the initial three-dimensional semantic annotation result of the triangular patch;
  • Step 3 For each cluster obtained by division, based on the initial three-dimensional semantic annotation result of each triangular patch included in the cluster, determine the initial three-dimensional semantic annotation result of the cluster.
  • the corresponding position of the triangular patch in each scene image can be determined first. Projected image area.
  • the projected image area here may be a collection of pixels projected onto the scene image according to the center coordinates of the triangular patch, image pose, internal parameters and other data. Since the initial two-dimensional semantic annotation results of each scene image are pre-annotated, the initial two-dimensional semantic annotation results of the projected image area corresponding to the scene image can also be determined.
  • the number of labels corresponding to various initial two-dimensional semantic labeling results can be determined, that is, it can be obtained that the triangular patches belong to a certain label class
  • the number of votes, normalizing the number of votes of each label category can get the semantic probability that the triangle patch belongs to a certain label category.
  • the initial two-dimensional semantic labeling result with the highest semantic probability, that is, the largest number of labels can be selected as the triangle surface
  • the initial 3D semantic annotation results of the slice can be selected as the triangle surface.
  • the clusters here are combined based on triangular patches.
  • the number of triangular patches corresponding to various initial 3D semantic annotation results can be determined based on the initial 3D semantic annotation results of each triangular patch included in the cluster.
  • the greater the number of triangles in the initial 3D semantic annotation result the higher the possibility of corresponding to the initial 3D semantic annotation result of the cluster.
  • the initial three-dimensional semantic annotation result with the largest number of corresponding triangular patches may be selected as the initial three-dimensional semantic annotation result of the cluster.
  • the initial three-dimensional semantic annotation results of the corresponding triangular patches can be jointly determined, or based on the initial three-dimensional semantic annotation results of multiple triangular patches included in one cluster, Determine the initial 3D semantic annotation result of the cluster. It can be seen that the determination of the initial 3D semantic annotation result of the cluster depends on the 3D annotation results of multiple triangular patches, and the tagging results of each triangular patch depend on each scene image The two-dimensional annotation results of , which make the initial three-dimensional semantic annotation results of the finally determined clusters more accurate.
  • embodiment of the present disclosure may also determine the initial three-dimensional semantic annotation result of the cluster according to the following steps:
  • Step 1 Perform a ratio operation between the number of labels corresponding to various initial two-dimensional semantic labeling results and the total number of labels, and determine the probability value that the triangular patch belongs to each initial two-dimensional semantic labeling result;
  • Step 2 Based on the probability values belonging to various initial two-dimensional semantic annotation results, determine the initial three-dimensional semantic annotation result of the triangular patch;
  • Step 3 For each cluster obtained by division, based on the probability value corresponding to each triangular patch included in the cluster belonging to each initial two-dimensional semantic labeling result, and the pre-assigned pointing to each initial two-dimensional semantic labeling result for the triangular patch, The weight of the two-dimensional semantic labeling result determines the probability value that the cluster belongs to each initial two-dimensional semantic labeling result;
  • Step 4 Based on the probability values that the cluster belongs to various initial two-dimensional semantic annotation results, determine the initial three-dimensional semantic annotation result of the cluster.
  • the probability value belonging to each initial two-dimensional semantic annotation result can be determined for the triangular patch first, and then the initial three-dimensional semantic annotation result for each cluster can be determined in combination with the triangular patches contained in a cluster.
  • the probability value of a triangular patch belonging to each initial two-dimensional semantic labeling result is realized based on the ratio of the number of labels.
  • the number of votes (the number of corresponding labels) determined for each initial two-dimensional semantic annotation result determined by a triangular patch in the process of pointing to the corresponding projection image area in each scene image, and all votes can also be counted
  • Quantity corresponding to the total number of labels
  • the process of calculating the probability for a cluster it can be done based on the pre-assigned weights pointing to each initial two-dimensional semantic labeling result of each triangular patch included in the cluster.
  • the above-mentioned weights may be determined in conjunction with the area of the triangular facet.
  • a triangular facet with a larger area may be assigned a larger weight, and a triangular facet with a smaller area may be assigned a smaller weight. This is because large-area triangular patches often play a more critical role in the voting process.
  • the probability value of a cluster belonging to various initial two-dimensional semantic annotation results can be determined by weighted summation, and the initial three-dimensional semantic annotation result of a cluster can be regarded as a probability distribution.
  • the embodiments of the present disclosure can further divide the clusters, and realize the final labeling effect by labeling the divided sub-clusters.
  • the cluster when it is determined that a cluster has multiple semantic labeling possibilities, the cluster may first be divided into sub-clusters, and then semantic labeling may be implemented based on the divided sub-clusters, which will improve labeling accuracy.
  • the corrected 3D semantic labeling results manually corrected for the initial 3D semantic labeling results of at least some of the clusters can be obtained first, and then based on the obtained corrected 3D semantic labeling results
  • the corrected three-dimensional semantic annotation results for at least part of the clusters can be determined based on the manually corrected three-dimensional semantic annotation results for each scene image.
  • the revised two-dimensional semantic annotation results further improve the accuracy of annotation.
  • Embodiments of the present disclosure can be manually corrected based on a network (World Wide Web, Web) tagger, that is, the initial three-dimensional semantic annotation results of each cluster can be loaded into a web-based tagger, and manually corrected by this tagging tool .
  • a network World Wide Web, Web
  • this target cluster when it is determined that there is a target cluster with different semantics in multiple clusters connected in series, this target cluster can be called, and when it is determined that the target cluster is labeled incorrectly, it can be manually corrected to the correct labeling result.
  • the probability of the class to which the corrected cluster belongs is set to 1, and the probabilities of other classes are set to 0.
  • the 3D semantic annotation results of the clusters can be corrected based on manual correction, but also the 3D semantic annotation results of the clusters can be updated based on a predefined energy function, so that the determined corrected scene image Two-dimensional semantic annotation results are more accurate.
  • the update process of the above three-dimensional semantic annotation result may include the following steps:
  • Step 1 Based on the initial 3D semantic annotation results of each cluster, determine the probability distribution error term corresponding to each cluster at the semantic annotation level; and based on the angle between any adjacent two clusters that do not belong to the same 3D semantic annotation result Information, determine the angle error term corresponding to each cluster at the geometric feature level;
  • Step 2 Determine the energy function according to the probability distribution error term and the included angle error term
  • Step 3 Determine the updated three-dimensional semantic annotation result for each cluster under the condition that the value of the energy function is minimized.
  • e 1 ( ) and e 2 ( ) correspond to the probability distribution error term and the included angle error term respectively
  • F represents the set of all clusters
  • N represents the set of adjacent clusters
  • x f represents the three-dimensional semantics to which cluster f belongs Labeling results
  • x f ) represents the probability that cluster f belongs to the three-dimensional semantic labeling result x f
  • ⁇ (f, g) represents the angle between cluster f and cluster g.
  • the manually corrected annotation results can also be considered, for example, the manually corrected annotation results can be given the maximum semantic probability.
  • the updated 3D semantic annotation results for each cluster can be determined.
  • the correction of the two-dimensional semantic annotation result of the scene image can be realized based on the updated three-dimensional semantic annotation result of each cluster, and the automatic implementation can further improve the annotation efficiency.
  • each triangle surface included in the cluster can be determined from at least one scene image.
  • the embodiment of the present disclosure can render the 3D scene model with 3D semantics to 2D, so as to obtain the 2D semantic annotation corresponding to the scene image, which is automatically implemented with high efficiency and accuracy.
  • the above-mentioned rendering process from three-dimensional to two-dimensional can be realized based on Open Graphics Library (Open Graphics Library, opengl), and can also be realized based on other methods, which is not limited here.
  • Open Graphics Library Open Graphics Library, opengl
  • the embodiment of the present disclosure also provides a device for semantic tagging corresponding to the semantic tagging method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned method of semantic tagging in the embodiment of the present disclosure, therefore The implementation of the device can refer to the implementation of the method.
  • FIG. 2 it is a schematic diagram of a semantic tagging device provided by an embodiment of the present disclosure.
  • the device includes: an acquisition module 201, a division module 202, a first tagging module 203, and a second tagging module 204; wherein,
  • the obtaining module 201 is configured to obtain a three-dimensional scene model reconstructed based on at least one scene image, and an initial two-dimensional semantic annotation result of each scene image;
  • a division module 202 configured to divide each triangle patch included in the 3D scene model into a plurality of clusters
  • the first labeling module 203 is configured to perform three-dimensional semantic labeling on each divided cluster based on the initial two-dimensional semantic labeling result, and determine the initial three-dimensional semantic labeling result of each cluster;
  • the second labeling module 204 is configured to determine a corrected two-dimensional semantic labeling result for each scene image based on the determined initial three-dimensional semantic labeling result for each cluster.
  • 3D semantic labeling can be performed on each divided cluster, Determine the corresponding initial 3D semantic annotation result, and determine the corrected 2D semantic annotation result for each scene image based on the initial 3D semantic annotation result, that is, the embodiment of the present disclosure is based on the initial 3D semantic annotation result of the cluster in the 3D space
  • the initial two-dimensional semantic annotation results of scene images in two-dimensional space are corrected.
  • Clusters can be labeled based on the above correspondence, which will make the corrected two-dimensional semantic labeling results more accurate. At the same time, the above correction process is automatically completed, making labeling more efficient.
  • the division module 202 is configured to divide each triangle patch included in the 3D scene model into multiple clusters according to the following steps:
  • a new center vector corresponding to each cluster is determined, and based on the new center vector, return to the step of dividing clusters until the division cut-off condition is met.
  • the division module 202 is configured to determine a new center vector corresponding to each cluster according to the following steps:
  • an average vector of each triangular patch is determined, and the average vector is used as a new center vector corresponding to the cluster.
  • the target vector includes a target normal vector and a target color vector
  • the center vector includes a center normal vector and a center color vector
  • the division module 202 is configured to determine each triangular surface in each triangular patch according to the following steps The slice's target vector, respectively, and the distance between each center vector:
  • each triangle patch in each triangle patch based on the target normal vector of the triangle patch and the center normal vector in each center vector, determine the first position between the triangle patch and the cluster where each center vector is located.
  • a distance and, based on the target color vector of the triangle patch and the center color vector in each center vector, determine the second distance between the triangle patch and the cluster where each center vector is located;
  • the first labeling module 203 is configured to perform three-dimensional semantic labeling on each divided cluster based on the initial two-dimensional semantic labeling result according to the following steps, and determine the initial three-dimensional semantic labeling result of each cluster :
  • each triangular patch based on the conversion relationship between the first coordinate system corresponding to the three-dimensional scene model and the second coordinate system corresponding to the scene image, determine the projection image area corresponding to the triangular patch in at least one scene image; as well as,
  • the initial three-dimensional semantic annotation result of the cluster is determined.
  • the first tagging module 203 is configured to determine the initial 3D semantic tagging result of the triangular patch based on the initial 2D semantic tagging result corresponding to the projected image area of the triangular patch according to the following steps:
  • each triangular patch based on the initial two-dimensional semantic labeling result of the projection image area corresponding to the triangular patch in at least one scene image, determine the number of labels corresponding to various initial two-dimensional semantic labeling results;
  • the first labeling module 203 is configured to determine the initial 3D semantic labeling result of each triangular patch included in the cluster for each divided cluster according to the following steps: 3D semantic annotation results:
  • An initial two-dimensional semantic labeling result with the largest number of labels is determined as the initial three-dimensional semantic labeling result of the triangular patch
  • the initial three-dimensional semantic annotation result corresponding to the largest number of triangular patches is determined as the initial three-dimensional semantic annotation result of the cluster.
  • the first labeling module 203 is configured to determine the initial 3D semantic labeling result of each triangular patch included in the cluster for each divided cluster according to the following steps: 3D semantic annotation results:
  • each cluster obtained by division based on the probability value of each initial two-dimensional semantic labeling result corresponding to each triangular patch included in the cluster, and the pre-assigned pointing to each initial two-dimensional semantic labeling result for the triangular patch
  • the weight of the result determines the probability value that the cluster belongs to each initial two-dimensional semantic annotation result
  • the initial three-dimensional semantic annotation result of the cluster is determined.
  • the above-mentioned dividing module 202 is further configured to:
  • the first labeling module 203 is configured to perform three-dimensional semantic labeling on each divided cluster based on the initial two-dimensional semantic labeling result according to the following steps, and determine the initial three-dimensional semantic labeling result of each cluster:
  • the second labeling module 204 is configured to determine a corrected two-dimensional semantic labeling result for each scene image based on the determined three-dimensional semantic labeling result of each cluster according to the following steps:
  • the corrected 2D semantic annotation results for each scene image are determined.
  • the second labeling module 204 is configured to determine the corrected two-dimensional semantic labeling result for each scene image based on the initial three-dimensional semantic labeling result of each cluster according to the following steps:
  • the corrected three-dimensional semantic annotation result is the artificially corrected three-dimensional semantic annotation result for the initial three-dimensional semantic annotation results of at least some of the clusters;
  • the corrected 2D semantic annotation results for each scene image are determined.
  • the second labeling module 204 is configured to determine the corrected two-dimensional semantic labeling result for each scene image based on the initial three-dimensional semantic labeling result of each cluster according to the following steps:
  • a corrected two-dimensional semantic annotation result for each scene image is determined.
  • the second labeling module 204 is configured to determine the corrected two-dimensional semantic labeling result for each scene image based on the initial three-dimensional semantic labeling result of each cluster according to the following steps:
  • each cluster based on the conversion relationship between the first coordinate system corresponding to the three-dimensional scene model and the second coordinate system corresponding to the scene image, determine from at least one scene image the corresponding scene image;
  • the corrected two-dimensional semantic annotation result for the scene image is determined.
  • the initial two-dimensional semantic labeling result is obtained by using the initial semantic labeling neural network to perform semantic labeling on each scene image, and the above-mentioned device also includes:
  • the training module 205 is configured to, after determining the corrected two-dimensional semantic annotation result for each scene image, train the initial semantic annotation neural network based on the corrected two-dimensional semantic annotation result for each scene image, and obtain the trained semantic annotation Neural Networks.
  • the embodiment of the present disclosure also provides an electronic device, as shown in FIG. 3 , which is a schematic structural diagram of the electronic device provided by the embodiment of the present disclosure, including: a processor 301 , a memory 302 , and a bus 303 .
  • the memory 302 stores machine-readable instructions executable by the processor 301 (for example, the execution instructions corresponding to the acquisition module 201, the division module 202, the first labeling module 203, and the second labeling module 204 in the device in FIG.
  • the processor 301 communicates with the memory 302 through the bus 303, and when the machine-readable instructions are executed by the processor 301, the following processes are performed:
  • the corrected 2D semantic annotation results for each scene image are determined.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the semantic tagging method described in the foregoing method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides a computer program product, the computer program product is loaded with program code, and the instructions included in the program code can be used to execute the steps of the semantic labeling method described in the above method embodiment, please refer to the above method Example.
  • the above-mentioned computer program product may be implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
  • the products applying the disclosed technical solution have clearly notified the personal information processing rules and obtained the individual's independent consent before processing personal information.
  • the disclosed technical solution involves sensitive personal information the products applying the disclosed technical solution have obtained individual consent before processing sensitive personal information, and at the same time meet the requirement of "express consent". For example, at a personal information collection device such as a camera, a clear and prominent sign is set up to inform that it has entered the scope of personal information collection, and personal information will be collected.
  • the personal information processing rules may include Information processor, purpose of personal information processing, processing method, type of personal information processed and other information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本公开实施例提供了一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品,其中,该方法包括:获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;将三维场景模型包括的各个三角面片划分为多个簇;基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果;基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。本公开实施例基于三维空间内簇的初始三维语义标注结果对二维空间内场景图像的初始二维语义标注结果进行了修正,使得修正后二维语义标注结果更为准确,同时,上述修正过程自动完成,使得标注效率更高。

Description

一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品
相关申请的交叉引用
本公开实施例基于申请号为202111320552.7、申请日为2021年11月09日、申请名称为“一种语义标注的方法、装置、电子设备及存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本公开。
技术领域
本公开实施例涉及语义标注技术领域,尤其涉及一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品。
背景技术
语义分割技术是对图像像素级别的理解,是在图像上对物体进行像素级的分类,即将同属于同一类物体的像素归为一类,使用指定的标签(label)进行标记,被广泛应用于各种技术场景,如自动驾驶、室内导航、虚拟现实、图像识别等。
相关技术中,主要是通过预先训练好的图像分割网络对视频图像进行语义分割,在对图像分割网络的训练过程中,需要获取大量带标注的样本图像。现有的语义标注方法主要采用人工方式对二维的视频图像进行逐像素点的标注,标注的效率较低。
发明内容
本公开实施例至少提供一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品。
第一方面,本公开实施例提供了一种语义标注的方法,所述方法包括:
获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
将所述三维场景模型包括的各个三角面片划分为多个簇;
基于所述初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个所述簇的初始三维语义标注结果;
基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
第二方面,本公开实施例还提供了一种语义标注的装置,所述装置包括:
获取模块,配置为获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
划分模块,配置为将所述三维场景模型包括的各个三角面片划分为多个簇;
第一标注模块,配置为基于所述初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个所述簇的初始三维语义标注结果;
第二标注模块,配置为基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
第三方面,本公开实施例还提供了一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如第一方面及其各种实施方式任一所述的语义标注的方法的步骤。
第四方面,本公开实施例还提供了一种计算机可读存储介质,该计算机可读存储介 质上存储有计算机程序,该计算机程序被处理器运行时执行如第一方面及其各种实施方式任一所述的语义标注的方法的步骤。
第五方面,本公开实施例提供了一种计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令被计算机设备的处理器运行时,实现上述的语义标注的方法的步骤。
采用上述语义标注的方法,在对基于场景图像重建的三维场景模型进行簇的划分的情况下,可以基于每张场景图像对应的初始二维语义标注结果,对划分的每个簇进行三维语义标注,确定对应的初始三维语义标注结果,基于初始三维语义标注结果可以确定针对每张场景图像的修正后二维语义标注结果,也即,本公开实施例是基于三维空间内簇的初始三维语义标注结果对二维空间内场景图像的初始二维语义标注结果进行了修正,由于三维空间内一个簇指向的三角面片可能是多个,而每个三角面片所指向的场景图像可能是多个,基于上述对应关系可以对簇进行聚合性的标注,这将使得修正后二维语义标注结果更为准确,与此同时,上述修正过程自动完成,使得标注效率更高。
关于上述语义标注的装置、电子设备、计算机可读存储介质及计算机程序产品的效果描述参见上述语义标注的方法的说明。
应当理解的是,以上的一般描述和后文的细节描述是示例性和解释性的,而非限制本公开。为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种语义标注的方法的流程图;
图2示出了本公开实施例所提供的一种语义标注的装置的示意图;
图3示出了本公开实施例所提供的一种电子设备的示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
本文中术语“和/或”,仅仅是描述一种关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
相关技术中,通常通过预先训练好的图像分割网络对视频图像进行语义分割,在对图像分割网络的训练过程中,需要获取大量带标注的样本图像。而现有的语义标注方法主要采用人工方式对二维的视频图像进行逐像素点标注,标注的效率较低。
基于此,本公开实施例提供了一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品,以提升标注效率。
为便于对本公开实施例的理解,首先对本公开实施例所公开的一种语义标注的方法进行介绍,本公开实施例所提供的语义标注的方法的执行主体为具有一定计算能力的计算机设备,该计算机设备例如包括:终端设备或服务器或其它处理设备,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该语义标注的方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
参见图1所示,为本公开实施例提供的语义标注的方法的流程图,方法包括步骤S101至S104,其中:
S101:获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
S102:将三维场景模型包括的各个三角面片划分为多个簇;
S103:基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果;
S104:基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
为了便于理解本公开实施例提供的语义标注的方法,首先对该方法的应用场景进行简单说明。上述语义标注的方法可以应用于需要进行语义标注的场景中,例如,可以是自动驾驶场景中对于车辆、行人等的标注,还可以是室内导航中对于障碍物等的标注,还可以是其它场景下的标注,本公开实施例对此不做限制。
相关技术中采用的是逐像素点的语义标注方式,该标注方式人工完成。考虑到在进行网络训练的过程中,往往需要大量标注好的样本图像,导致耗费大量的人力物力。
为了解决上述问题,本公开实施例提供了一种自动进行语义标注的方法,可以在三维空间内实现语义标注,然后基于三维空间与二维空间之间的转换关系,将三维空间内的语义标注结果投影到二维空间内,这样可以大大提高标注效率。
为了实现在三维空间内的语义标注,这里可以首先获取基于一张或多张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果。
其中,上述三维场景模型可以是基于场景图像进行三维重建确定的。这里的三维重建一般可以分为三个步骤:稀疏重建、密集重建、表面重建。其中,稀疏重建用于完成所有相机位姿初始化计算,其输入可以是各张场景图像,输出的可以是拍摄场景图像的相机的位姿;密集重建用于在相机位姿已知的前提下,逐像素的计算图像中每一个像素点对应的三维点,得到场景物体表面密集的三维点云,其输入可以是多视角的场景图像和相机位姿,输出可以是稠密点云;表面重建用于完成由稠密点云到三维场景模型的转换。
在一些实施例中,可以采用colmap或者其它方式实现上述三维重建的过程。
另外,本公开实施例中有关场景图像的初始二维语义标注结果可以是利用初始语义标注神经网络标注得到的。基于有关场景图像的初始二维语义标注结果、以及三维空间与三维空间之间的坐标转换关系,可以实现针对三维场景模型包括的三角面片的三维语义标注。
上述初始语义标注神经网络可以是利用现有数据库中存储的已标注样本图像训练得到的,训练的可以是已标注样本图像与标注信息之间的对应关系。
本公开实施例中,三维场景模型可以是由三角面片拼接而成的,也即,三角面片可以是三维场景模型的最小组成单元。在实际应用中,考虑到不同三角面片所具备的语义可以是相同的,还可以是不同的,这里,可以按照语义对三角面片进行簇的划分,以便可以对簇内属于同一语义的三角面片进行统一的语义标注,从而可以在提高了标注准确性的前提下,提升标注效率。
本公开实施例中有关簇的初始三维语义标注结果可以是基于上述有关场景图像的初始二维语义标注结果来确定的。考虑到一个簇与多个三角面片之间的对应关系,一个三角面片与多张场景图像之间的对应关系,这样,在确定场景图像的初始二维语义标注结果的情况下,可以基于上述对应关系确定簇的初始三维语义标注结果。
本公开实施例中,在确定每个簇的初始三维语义标注结果的情况下,可以基于三维空间与二维空间的坐标转换关系,一次性的确定所有场景图像的修正后二维语义标注结果,相比二维空间下的人工标注方式,很大程度上提高了标注效率。
本公开实施例中,基于场景图像的修正后二维语义标注结果可以对初始语义标注神经网络进行训练。由于修正后二维语义标注结果的准确性更高,因而所训练得到的语义标注神经网络的标注准确性更好,便于后续进行各种场景下的应用。
考虑到簇的划分对于后续进行语义标注的关键作用,接下来可以对划分簇的方法进行介绍。上述有关簇的划分方法可以包括如下步骤:
步骤一、从三维场景模型包括的各个三角面片中,随机选取预设数量个三角面片,将选取的每个三角面片的目标向量分别作为待划分的簇的中心向量;
步骤二、确定各个三角面片中每个三角面片的目标向量,分别与每个中心向量之间的距离,并将该三角面片划分至对应的距离最小的中心向量所在的簇;
步骤三、确定每个簇对应的新的中心向量,并基于该新的中心向量,返回执行划分簇的步骤,直到满足划分截止条件。
这里的目标向量包括以下至少之一:目标法向量;目标颜色向量,可以从不同维度表征三角面片的特征。其中,目标法向量和目标颜色向量可以是基于三维重建结果确定的。
本公开实施例可以基于各簇的中心向量与各三角面片的目标向量之间的距离进行至少一次划分操作,直到满足划分截止条件可以得到最终的划分结果。
在每次划分完成后,可以确定簇对应的新的中心向量,并基于这一中心向量进行下一次划分。有关簇的中心向量可以是该簇的各个三角面片的目标向量所确定的平均向量,也即,可以是对簇中所有三角面片的目标向量求取平均值。
本公开实施例中的划分截止条件可以是针对一个簇,多次计算的中心向量不再发生变化,还可以是划分次数达到预设次数,还可以是其它划分截止条件,本公开实施例对此不做限制。
本公开实施例中,可以基于三角面片的目标向量与各簇的中心向量实现至少一次簇的划分。每进行一次划分,均可以针对簇确定新的中心向量,从而进行下一次划分。由于划分到一个簇的多个三角面片很大程度上存在相关性,可以利用一个簇的多个三角面片所对应的标注结果来确定簇的初始三维语义标注结果,具有较高的准确性,从而进一步提高了后续修正后二维语义标注结果的准确性。
考虑到法向量和颜色向量对于簇划分的影响程度并不相同,因而,这里可以基于两种向量确定两个距离,并结合加权求和方式确定上述距离。在一些实施例中,可以通过如下步骤来实现:
步骤一、针对各个三角面片中的每个三角面片,基于该三角面片的目标法向量以及每个中心向量中的中心法向量,确定该三角面片与每个中心向量所在的簇之间的第一距离,以及,基于该三角面片的目标颜色向量以及每个中心向量中的中心颜色向量,确定该三角面片与每个中心向量所在的簇之间的第二距离;
步骤二、基于第一距离及其对应的权重、以及第二距离及其对应的权重,确定该三角面片的目标向量,分别与每个中心向量之间的距离。
这里,可以依据不同的应用场景对上述第一距离对应的权重、以及第二距离对应的权重进行设置。例如,针对单一色彩的场景而言,颜色向量与中心颜色向量之间的一致性对于簇的划分至关重要,这时可以分配更高的权重。
本公开实施例提供的划分方法旨在使得相同语义的三角面片属于同一个簇,不同语义的三角面片属于不同的簇。这里,如一个簇的三角面片的目标向量由R i表示,R i包括目标法向量n(x)和中心颜色向量c(x),簇的中心向量为P i,P i包括中心法向量n i和中心颜色向量c i。为了实现上述技术目的,这里是找到一组簇可以使得如下公式所示的误差最小。
Figure PCTCN2022093649-appb-000001
其中,误差函数
Figure PCTCN2022093649-appb-000002
本公开实施例中,考虑到三角面片的法向量以及颜色向量对于簇划分的影响程度并不相同,可以基于加权求和方法实现上述有关三角面片的目标向量,分别与每个中心向量之间的距离的确定,适用性更强。
本公开实施例中,可以是基于场景图像对应的初始二维语义标注结果确定针对簇的初始三维语义标注结果,包括如下步骤:
步骤一、针对每个三角面片,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,确定该三角面片在至少一张场景图像中对应的投影图像区域;
步骤二、基于该三角面片在至少一张场景图像中对应的投影图像区域的初始二维语义标注结果,确定该三角面片的初始三维语义标注结果;
步骤三、针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果。
这里,考虑到三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,可以先针对每个三角面片,确定该三角面片在各张场景图像中对应的投影图像区域。
这里的投影图像区域可以是根据三角面片的中心坐标、图像位姿、内参等数据投影到场景图像得到的各个像素点的集合。由于各张场景图像的初始二维语义标注结果是预先标注得到的,这样,场景图像对应的投影图像区域的初始二维语义标注结果也可以确定。
在针对三角面片确定其投影图像区域的初始二维语义标注结果的情况下,可以确定对应各种初始二维语义标注结果的标签数量,也即,可以得到三角面片分别属于某一标签类的票数,将各标签类的票数归一化可以得到三角面片属于某一标签类的语义概率,这里,可以选取语义概率最大,也即标签数量最多的初始二维语义标注结果,作为三角面片的初始三维语义标注结果。
这里的簇是基于三角面片组合而成的,这里,针对每个簇可以基于簇包括的各个三角面片的初始三维语义标注结果,确定对应各种初始三维语义标注结果的三角面片的数 量,三角面片的数量越多的初始三维语义标注结果,对应为簇的初始三维语义标注结果的可能性也越高。这里,可以选取对应的三角面片的数量最多的初始三维语义标注结果,作为簇的初始三维语义标注结果。
本公开实施例中,可以基于场景图像的初始二维语义标注结果,共同确定对应三角面片的初始三维语义标注结果,还可以基于一个簇包括的多个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果,可见,有关簇的初始三维语义标注结果的确定依赖于多个三角面片的三维标注结果,而每个三角面片的标注结果又依赖于各张场景图像的二维标注结果,这使得最终所确定的簇的初始三维语义标注结果更为准确。
除此之外,本公开实施例还可以按照如下步骤确定簇的初始三维语义标注结果:
步骤一、将对应各种初始二维语义标注结果的标签数量与标签总数量进行比值运算,确定该三角面片属于每种初始二维语义标注结果的概率值;
步骤二、基于属于各种初始二维语义标注结果的概率值,确定该三角面片的初始三维语义标注结果;
步骤三、针对划分得到的每个簇,基于该簇包括的每个三角面片对应的属于每种初始二维语义标注结果的概率值,以及针对该三角面片预先分配的指向每种初始二维语义标注结果的权重,确定该簇属于每种初始二维语义标注结果的概率值;
步骤四、基于该簇属于各种初始二维语义标注结果的概率值,确定该簇的初始三维语义标注结果。
这里,可以先对三角面片进行属于每种初始二维语义标注结果的概率值的确定,而后再结合一个簇包含的三角面片实现针对每个簇的初始三维语义标注结果的确定。
其中,有关一个三角面片属于每种初始二维语义标注结果的概率值是基于标签数量的比值实现的。这里,可以统计一个三角面片在指向各张场景图像中对应的投影图像区域的过程中,所确定的针对每种初始二维语义标注结果的投票数量(对应标签数量),还可以统计所有投票数量(对应标签总数量),通过两个数量的比值可以确定上述概率值。
在针对一个簇进行概率计算的过程中,可以基于该簇包括的每个三角面片预先分配的指向每种初始二维语义标注结果的权重来完成。
在一些实施例中,上述权重可以是结合三角面片的面积来确定的,面积越大的三角面片可以对应分配较大的权重,面积越小的三角面片可以对应分配较小的权重,这是考虑到大面积三角面片在投票过程往往占据更为关键的作用。这里,通过加权求和即可以确定出一个簇属于各种初始二维语义标注结果的概率值,一个簇的初始三维语义标注结果则可以看作是一个概率分布。
在实际应用中,可能会出现无法直接对簇进行语义标注的情况。为了解决这一问题,本公开实施例可以对簇进行进一步的划分,通过对划分的子簇的标注实现最终的标注效果。
这里,有关子簇的划分方式参见上述簇的划分方式。
本公开实施例中,在确定一个簇存在多种语义标注可能性的情况下,可以先针对簇进行子簇的划分,进而基于划分后的子簇实现语义标注,这样会提高标注精度。
考虑到人工修正方式对于标签准确性的强影响性,这里,可以先获取针对多个簇中至少部分簇的初始三维语义标注结果进行人工修正的修正后三维语义标注结果,而后基于获取的修正后三维语义标注结果,以及未经人工修正的其他簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
本公开实施例中,考虑到人工修正对于后续有关修正后二维语义标注结果的准确性的关键影响,可以基于至少部分簇的人工修正的修正后三维语义标注结果来确定针对每 张场景图像的修正后二维语义标注结果,进一步提升标注准确性。
本公开实施例可以基于网络(World Wide Web,Web)的标注器进行人工修正,也即,可以将各个簇的初始三维语义标注结果加载到基于web的标注器中,通过这一标注工具手动修正。
例如,可以是在确定串联的多个簇中存在不同语义的目标簇的情况下,可以调取这一目标簇,并在确定针对目标簇标注错误的情况下,手动修正到正确的标注结果。这里,修正的簇所属类别的概率设置为1,其它类别的概率为0。
需要说明的是,人工修正标注结果的情况下,有关簇以及簇内的三角面片的语义标注结果也将随之发生变化。
本公开实施例中,不仅可以基于人工修正方式进行簇的三维语义标注结果的修正,还可以基于预先定义的能量函数来更新簇的三维语义标注结果,以使得所确定的针对场景图像的修正后二维语义标注结果更为准确。
其中,上述三维语义标注结果的更新过程可以包括如下步骤:
步骤一、基于每个簇的初始三维语义标注结果,确定各个簇在语义标注层面对应的概率分布误差项;以及基于任意相邻的不属于同一三维语义标注结果的两个簇之间的夹角信息,确定每个簇在几何特征层面对应的夹角误差项;
步骤二、根据概率分布误差项和夹角误差项,确定能量函数;
步骤三、确定使得能量函数的值最小的情况下,针对每个簇确定更新后的三维语义标注结果。
为了便于理解上述更新过程,接下来可以结合如下能量函数公式来说明。
Figure PCTCN2022093649-appb-000003
其中,
Figure PCTCN2022093649-appb-000004
其中,上述e 1(·)和e 2(·)分别对应概率分布误差项和夹角误差项,F表示所有簇的集合,N表示相邻簇的集合,x f表示簇f所属的三维语义标注结果,P(f|x f)表示簇f属于三维语义标注结果x f的概率,θ(f,g)表示簇f和簇g的夹角。
需要说明的是,针对概率分布误差项而言,这里除了可以考虑簇的初始三维语义标注结果,还可以考虑人工修正后的标注结果,例如可以给予人工修正的标注结果以最大的语义概率。
通过最小化上述能量函数,可以确定每个簇的更新后三维语义标注结果。
本公开实施例中,可以基于每个簇的更新后三维语义标注结果实现针对场景图像的二维语义标注结果的修正,自动实现,可以进一步提升标注效率。
这里,首先基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,可以针对每个,从至少一张场景图像中确定与该簇包括的每个三角面片对应的场景图像,而后针对每张场景图像,基于该张场景图像对应的三角面片所属簇的更新后三维语义标注结果,确定针对该张场景图像的修正后二维语义标注结果。也即,本公开实施例可以将带有三维语义的三维场景模型渲染到二维,从而得到场景图像对应的二维语义标注,自动实现,效率和准确性均较高。
其中,上述由三维到二维的渲染过程可以是基于开放式图形库(Open Graphics Library,opengl)实现的,还可以是基于其它方式实现的,在此不做限制。
需要说明的是,上文对各个实施例的描述倾向于强调各个实施例之间的不同之处, 其相同或相似之处可以互相参考。本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的执行顺序应当以其功能和可能的内在逻辑确定。
基于同一发明构思,本公开实施例中还提供了与语义标注的方法对应的语义标注的装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述语义标注的方法相似,因此装置的实施可以参见方法的实施。
参照图2所示,为本公开实施例提供的一种语义标注的装置的示意图,装置包括:获取模块201、划分模块202、第一标注模块203和第二标注模块204;其中,
获取模块201,配置为获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
划分模块202,配置为将三维场景模型包括的各个三角面片划分为多个簇;
第一标注模块203,配置为基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果;
第二标注模块204,配置为基于确定的每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
本公开实施例中,在对基于场景图像重建的三维场景模型进行簇的划分的情况下,可以基于每张场景图像对应的初始二维语义标注结果,对划分的每个簇进行三维语义标注,确定对应的初始三维语义标注结果,基于初始三维语义标注结果可以确定针对每张场景图像的修正后二维语义标注结果,也即,本公开实施例是基于三维空间内簇的初始三维语义标注结果对二维空间内场景图像的初始二维语义标注结果进行了修正,由于三维空间内一个簇指向的三角面片可能是多个,而每个三角面片所指向的场景图像可能是多个,基于上述对应关系可以对簇进行聚合性的标注,这将使得修正后二维语义标注结果更为准确,与此同时,上述修正过程自动完成,使得标注效率更高。
在一种可能的实施方式中,划分模块202,配置为按照如下步骤将三维场景模型包括的各个三角面片划分为多个簇:
从三维场景模型包括的各个三角面片中,随机选取预设数量个三角面片,将选取的每个三角面片的目标向量分别作为待划分的簇的中心向量;
确定各个三角面片中每个三角面片的目标向量,分别与每个中心向量之间的距离,并将该三角面片划分至对应的距离最小的中心向量所在的簇;
确定每个簇对应的新的中心向量,并基于该新的中心向量,返回执行划分簇的步骤,直到满足划分截止条件。
在一种可能的实施方式中,划分模块202,配置为按照如下步骤确定每个簇对应的新的中心向量:
基于每个簇包括的各个三角面片的目标向量,确定各个三角面片的平均向量,将平均向量作为该簇对应的新的中心向量。
在一种可能的实施方式中,目标向量包括目标法向量和目标颜色向量,中心向量包括中心法向量和中心颜色向量;划分模块202,配置为按照如下步骤确定各个三角面片中每个三角面片的目标向量,分别与每个中心向量之间的距离:
针对各个三角面片中的每个三角面片,基于该三角面片的目标法向量以及每个中心向量中的中心法向量,确定该三角面片与每个中心向量所在的簇之间的第一距离,以及,基于该三角面片的目标颜色向量以及每个中心向量中的中心颜色向量,确定该三角面片与每个中心向量所在的簇之间的第二距离;
基于第一距离及其对应的权重、以及第二距离及其对应的权重,确定该三角面片的目标向量,分别与每个中心向量之间的距离。
在一种可能的实施方式中,第一标注模块203,配置为按照以下步骤基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果:
针对每个三角面片,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,确定该三角面片在至少一张场景图像中对应的投影图像区域;以及,
基于该三角面片在至少一张场景图像中对应的投影图像区域的初始二维语义标注结果,确定该三角面片的初始三维语义标注结果;
针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果。
在一种可能的实施方式中,第一标注模块203,配置为按照以下步骤基于该三角面片的投影图像区域对应的初始二维语义标注结果,确定该三角面片的初始三维语义标注结果:
针对每个三角面片,基于该三角面片在至少一张场景图像中对应的投影图像区域的初始二维语义标注结果,确定对应各种初始二维语义标注结果的标签数量;
基于对应各种初始二维语义标注结果的标签数量,确定为该三角面片的初始三维语义标注结果。
在一种可能的实施方式中,第一标注模块203,配置为按照以下步骤针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果:
将标签数量最多的一种初始二维语义标注结果,确定为该三角面片的初始三维语义标注结果;
基于每个簇包括的各个三角面片的初始三维语义标注结果,确定对应各种初始三维语义标注结果的三角面片的数量;
将对应的三角面片的数量最多的一种初始三维语义标注结果,确定该簇的初始三维语义标注结果。
在一种可能的实施方式中,第一标注模块203,配置为按照以下步骤针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果:
将对应各种初始二维语义标注结果的标签数量与标签总数量进行比值运算,确定该三角面片属于每种初始二维语义标注结果的概率值;
基于属于各种初始二维语义标注结果的概率值,确定该三角面片的初始三维语义标注结果;
针对划分得到的每个簇,基于该簇包括的每个三角面片对应的属于每种初始二维语义标注结果的概率值,以及针对该三角面片预先分配的指向每种初始二维语义标注结果的权重,确定该簇属于每种初始二维语义标注结果的概率值;
基于该簇属于各种初始二维语义标注结果的概率值,确定该簇的初始三维语义标注结果。
在一种可能的实施方式中,在确定任一簇属于多种初始二维语义标注结果的概率值大于预设阈值的情况下,上述划分模块202还配置为:
针对任一簇,将该簇划分为多个子簇;
第一标注模块203,配置为按照以下步骤基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果:
基于初始二维语义标注结果,对划分得到的每个子簇进行三维语义标注,确定每个 子簇的初始三维语义标注结果;
第二标注模块204,配置为按照以下步骤基于确定的每个簇的三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
基于每个子簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
在一种可能的实施方式中,第二标注模块204,配置为按照以下步骤基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
获取修正后三维语义标注结果;修正后三维语义标注结果为针对多个簇中至少部分簇的初始三维语义标注结果进行人工修正后的三维语义标注结果;
基于的修正后三维语义标注结果,以及未经人工修正的其他簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
在一种可能的实施方式中,第二标注模块204,配置为按照以下步骤基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
基于每个簇的初始三维语义标注结果,确定各个簇在语义标注层面对应的概率分布误差项;以及基于任意相邻的不属于同一三维语义标注结果的两个簇之间的夹角信息,确定每个簇在几何特征层面对应的夹角误差项;
根据概率分布误差项和夹角误差项,确定能量函数;
确定使得能量函数的值最小的情况下,针对每个簇确定的更新后三维语义标注结果;
基于针对每个簇的更新后三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
在一种可能的实施方式中,第二标注模块204,配置为按照以下步骤基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
针对每个簇,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,从至少一张场景图像中确定与该簇包括的每个三角面片对应的场景图像;
针对每张场景图像,基于该张场景图像对应的三角面片所属簇的初始三维语义标注结果,确定针对该张场景图像的修正后二维语义标注结果。
在一种可能的实施方式中,初始二维语义标注结果为利用初始语义标注神经网络对每张场景图像进行语义标注得到的,上述装置还包括:
训练模块205,配置为确定针对每张场景图像的修正后二维语义标注结果之后,基于每张场景图像的修正后二维语义标注结果对初始语义标注神经网络进行训练,得到训练好的语义标注神经网络。
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明。
本公开实施例还提供了一种电子设备,如图3所示,为本公开实施例提供的电子设备结构示意图,包括:处理器301、存储器302、和总线303。存储器302存储有处理器301可执行的机器可读指令(例如图2中的装置中获取模块201、划分模块202、第一标注模块203、第二标注模块204对应的执行指令等),当电子设备运行时,处理器301与存储器302之间通过总线303通信,机器可读指令被处理器301执行时执行如下处理:
获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
将三维场景模型包括的各个三角面片划分为多个簇;
基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果;
基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的语义标注的方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的语义标注的方法的步骤,可参见上述方法实施例。
其中,上述计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***和装置的具体工作过程,可以参考前述方法实施例中的对应过程。在本公开所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
若本公开技术方案涉及个人信息,应用本公开技术方案的产品在处理个人信息前,已明确告知个人信息处理规则,并取得个人自主同意。若本公开技术方案涉及敏感个人信息,应用本公开技术方案的产品在处理敏感个人信息前,已取得个人单独同意,并且同时满足“明示同意”的要求。例如,在摄像头等个人信息采集装置处,设置明确显著的标识告知已进入个人信息采集范围,将会对个人信息进行采集,若个人自愿进入采集范围即视为同意对其个人信息进行采集;或者在个人信息处理的装置上,利用明显的标识/信息告知个人信息处理规则的情况下,通过弹窗信息或请个人自行上传其个人信息等方式获得个人授权;其中,个人信息处理规则可包括个人信息处理者、个人信息处理目的、处理方式、处理的个人信息种类等信息。
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。

Claims (29)

  1. 一种语义标注的方法,所述方法包括:
    获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
    将所述三维场景模型包括的各个三角面片划分为多个簇;
    基于所述初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个所述簇的初始三维语义标注结果;
    基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
  2. 根据权利要求1所述的方法,其中,所述将所述三维场景模型包括的各个三角面片划分为多个簇,包括:
    从所述三维场景模型包括的各个三角面片中,随机选取预设数量个三角面片,将选取的每个三角面片的目标向量分别作为待划分的簇的中心向量;
    确定所述各个三角面片中每个三角面片的所述目标向量,分别与每个所述中心向量之间的距离,并将该三角面片划分至对应的所述距离最小的中心向量所在的簇;
    确定每个簇对应的新的中心向量,并基于该新的中心向量,返回执行划分簇的步骤,直到满足划分截止条件。
  3. 根据权利要求2所述的方法,其中,所述确定每个簇对应的新的中心向量,包括:
    基于每个簇包括的各个三角面片的目标向量,确定各个三角面片的平均向量,将所述平均向量作为该簇对应的新的中心向量。
  4. 根据权利要求2或3所述的方法,其中,所述目标向量包括目标法向量和目标颜色向量,所述中心向量包括中心法向量和中心颜色向量;所述确定所述各个三角面片中每个三角面片的所述目标向量,分别与每个所述中心向量之间的距离,包括:
    针对所述各个三角面片中的每个三角面片,基于该三角面片的目标法向量以及每个所述中心向量中的中心法向量,确定该三角面片与每个所述中心向量所在的簇之间的第一距离,以及,基于该三角面片的目标颜色向量以及每个所述中心向量中的中心颜色向量,确定该三角面片与每个所述中心向量所在的簇之间的第二距离;
    基于所述第一距离及其对应的权重、以及所述第二距离及其对应的权重,确定该三角面片的所述目标向量,分别与每个所述中心向量之间的距离。
  5. 根据权利要求1至4任一所述的方法,其中,所述基于所述初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个所述簇的初始三维语义标注结果,包括:
    针对每个三角面片,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,确定该三角面片在至少一张场景图像中对应的投影图像区域;
    基于该三角面片在所述至少一张场景图像中对应的所述投影图像区域的所述初始二维语义标注结果,确定该三角面片的初始三维语义标注结果;
    针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果。
  6. 根据权利要求5所述的方法,其中,所述基于该三角面片的所述投影图像区域对应的所述初始二维语义标注结果,确定该三角面片的初始三维语义标注结果,包括:
    针对每个所述三角面片,基于该三角面片在所述至少一张场景图像中对应的投影图像区域的初始二维语义标注结果,确定对应各种初始二维语义标注结果的标签数量;
    基于对应各种初始二维语义标注结果的标签数量,确定该三角面片的初始三维语义 标注结果。
  7. 根据权利要求6所述的方法,其中,所述基于对应各种初始二维语义标注结果的标签数量,确定该三角面片的初始三维语义标注结果,包括:
    将标签数量最多的一种初始二维语义标注结果,确定为该三角面片的初始三维语义标注结果;
    所述针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果,包括:
    基于每个簇包括的各个三角面片的初始三维语义标注结果,确定对应各种初始三维语义标注结果的三角面片的数量;
    将对应的三角面片的数量最多的一种初始三维语义标注结果,确定该簇的初始三维语义标注结果。
  8. 根据权利要求6所述的方法,其中,所述基于对应各种初始二维语义标注结果的标签数量,确定该三角面片的初始三维语义标注结果,包括:
    将对应各种初始二维语义标注结果的标签数量与标签总数量进行比值运算,确定该三角面片属于每种初始二维语义标注结果的概率值;
    基于属于各种初始二维语义标注结果的概率值,确定该三角面片的初始三维语义标注结果;
    所述针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果,包括:
    针对划分得到的每个簇,基于该簇包括的每个三角面片对应的属于每种初始二维语义标注结果的概率值,以及针对该三角面片预先分配的指向每种初始二维语义标注结果的权重,确定该簇属于每种初始二维语义标注结果的概率值;
    基于该簇属于各种初始二维语义标注结果的概率值,确定该簇的初始三维语义标注结果。
  9. 根据权利要求8所述的方法,其中,在确定任一簇属于多种初始二维语义标注结果的概率值大于预设阈值的情况下,所述方法还包括:
    针对任一簇,将该簇划分为多个子簇;
    所述基于所述初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个所述簇的初始三维语义标注结果,包括:
    基于所述初始二维语义标注结果,对划分得到的每个子簇进行三维语义标注,确定每个子簇的初始三维语义标注结果;
    所述基于每个所述簇的三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果,包括:
    基于所述每个子簇的初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
  10. 根据权利要求1至9任一所述的方法,其中,所述基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果,包括:
    获取修正后三维语义标注结果;所述修正后三维语义标注结果为针对所述多个簇中至少部分簇的初始三维语义标注结果进行人工修正后的三维语义标注结果;
    基于所述修正后三维语义标注结果,以及未经人工修正的其他簇的初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
  11. 根据权利要求1至10任一所述的方法,其中,所述基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果,包括:
    基于每个簇的初始三维语义标注结果,确定各个簇在语义标注层面对应的概率分布 误差项;以及基于任意相邻的不属于同一三维语义标注结果的两个所述簇之间的夹角信息,确定每个簇在几何特征层面对应的夹角误差项;
    根据所述概率分布误差项和所述夹角误差项,确定能量函数;
    确定使得所述能量函数的值最小的情况下,针对每个簇确定的更新后三维语义标注结果;
    基于针对每个簇的更新后三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
  12. 根据权利要求1至11任一所述的方法,其中,所述基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果,包括:
    针对每个簇,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,从所述至少一张场景图像中确定与该簇包括的每个三角面片对应的场景图像;
    针对每张场景图像,基于该张场景图像对应的三角面片所属簇的初始三维语义标注结果,确定针对该张场景图像的修正后二维语义标注结果。
  13. 根据权利要求1至12任一所述的方法,其中,所述初始二维语义标注结果为利用初始语义标注神经网络对所述每张场景图像进行语义标注得到的,所述确定针对所述每张场景图像的修正后二维语义标注结果之后,还包括:
    基于所述每张场景图像的修正后二维语义标注结果对所述初始语义标注神经网络进行训练,得到训练好的语义标注神经网络。
  14. 一种语义标注的装置,所述装置包括:
    获取模块,配置为获取基于至少一张场景图像重建的三维场景模型、以及每张场景图像的初始二维语义标注结果;
    划分模块,配置为将所述三维场景模型包括的各个三角面片划分为多个簇;
    第一标注模块,配置为基于所述初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个所述簇的初始三维语义标注结果;
    第二标注模块,配置为基于每个所述簇的所述初始三维语义标注结果,确定针对所述每张场景图像的修正后二维语义标注结果。
  15. 根据权利要求14所述的装置,其中,所述划分模块,配置为按照如下步骤将三维场景模型包括的各个三角面片划分为多个簇:
    从三维场景模型包括的各个三角面片中,随机选取预设数量个三角面片,将选取的每个三角面片的目标向量分别作为待划分的簇的中心向量;
    确定各个三角面片中每个三角面片的目标向量,分别与每个中心向量之间的距离,并将该三角面片划分至对应的距离最小的中心向量所在的簇;
    确定每个簇对应的新的中心向量,并基于该新的中心向量,返回执行划分簇的步骤,直到满足划分截止条件。
  16. 根据权利要求15所述的装置,其中,所述划分模块,配置为按照如下步骤确定每个簇对应的新的中心向量:
    基于每个簇包括的各个三角面片的目标向量,确定各个三角面片的平均向量,将平均向量作为该簇对应的新的中心向量。
  17. 根据权利要求15或16所述的装置,其中,目标向量包括目标法向量和目标颜色向量,中心向量包括中心法向量和中心颜色向量;所述划分模块,配置为按照如下步骤确定各个三角面片中每个三角面片的目标向量,分别与每个中心向量之间的距离:
    针对各个三角面片中的每个三角面片,基于该三角面片的目标法向量以及每个中心向量中的中心法向量,确定该三角面片与每个中心向量所在的簇之间的第一距离,以及, 基于该三角面片的目标颜色向量以及每个中心向量中的中心颜色向量,确定该三角面片与每个中心向量所在的簇之间的第二距离;
    基于第一距离及其对应的权重、以及第二距离及其对应的权重,确定该三角面片的目标向量,分别与每个中心向量之间的距离。
  18. 根据权利要求14至17任一项所述的装置,其中,所述第一标注模块,配置为按照以下步骤基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果:
    针对每个三角面片,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,确定该三角面片在至少一张场景图像中对应的投影图像区域;
    基于该三角面片在至少一张场景图像中对应的投影图像区域的初始二维语义标注结果,确定该三角面片的初始三维语义标注结果;
    针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果。
  19. 根据权利要求18所述的装置,其中,所述第一标注模块,配置为按照以下步骤基于该三角面片的投影图像区域对应的初始二维语义标注结果,确定该三角面片的初始三维语义标注结果:
    针对每个三角面片,基于该三角面片在至少一张场景图像中对应的投影图像区域的初始二维语义标注结果,确定对应各种初始二维语义标注结果的标签数量;
    基于对应各种初始二维语义标注结果的标签数量,确定为该三角面片的初始三维语义标注结果。
  20. 根据权利要求19所述的装置,其中,所述第一标注模块,配置为按照以下步骤针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果:
    将标签数量最多的一种初始二维语义标注结果,确定为该三角面片的初始三维语义标注结果;
    基于每个簇包括的各个三角面片的初始三维语义标注结果,确定对应各种初始三维语义标注结果的三角面片的数量;
    将对应的三角面片的数量最多的一种初始三维语义标注结果,确定该簇的初始三维语义标注结果。
  21. 根据权利要求19所述的装置,其中,所述第一标注模块,配置为按照以下步骤针对划分得到的每个簇,基于该簇包括的每个三角面片的初始三维语义标注结果,确定该簇的初始三维语义标注结果:
    将对应各种初始二维语义标注结果的标签数量与标签总数量进行比值运算,确定该三角面片属于每种初始二维语义标注结果的概率值;
    基于属于各种初始二维语义标注结果的概率值,确定该三角面片的初始三维语义标注结果;
    针对划分得到的每个簇,基于该簇包括的每个三角面片对应的属于每种初始二维语义标注结果的概率值,以及针对该三角面片预先分配的指向每种初始二维语义标注结果的权重,确定该簇属于每种初始二维语义标注结果的概率值;
    基于该簇属于各种初始二维语义标注结果的概率值,确定该簇的初始三维语义标注结果。
  22. 根据权利要求21所述的装置,其中,在确定任一簇属于多种初始二维语义标注结果的概率值大于预设阈值的情况下,所述划分模块还配置为:
    针对任一簇,将该簇划分为多个子簇;
    所述第一标注模块,配置为按照以下步骤基于初始二维语义标注结果,对划分得到的每个簇进行三维语义标注,确定每个簇的初始三维语义标注结果:
    基于初始二维语义标注结果,对划分得到的每个子簇进行三维语义标注,确定每个子簇的初始三维语义标注结果;
    所述第二标注模块,配置为按照以下步骤基于确定的每个簇的三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
    基于每个子簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
  23. 根据权利要求14至22任一项所述的装置,其中,所述第二标注模块,配置为按照以下步骤基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
    获取修正后三维语义标注结果;修正后三维语义标注结果为针对多个簇中至少部分簇的初始三维语义标注结果进行人工修正后的三维语义标注结果;
    基于的修正后三维语义标注结果,以及未经人工修正的其他簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
  24. 根据权利要求14至23任一项所述的装置,其中,所述第二标注模块,配置为按照以下步骤基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
    基于每个簇的初始三维语义标注结果,确定各个簇在语义标注层面对应的概率分布误差项;以及基于任意相邻的不属于同一三维语义标注结果的两个簇之间的夹角信息,确定每个簇在几何特征层面对应的夹角误差项;
    根据概率分布误差项和夹角误差项,确定能量函数;
    确定使得能量函数的值最小的情况下,针对每个簇确定的更新后三维语义标注结果;
    基于针对每个簇的更新后三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果。
  25. 根据权利要求14至24任一项所述的装置,其中,所述第二标注模块,配置为按照以下步骤基于每个簇的初始三维语义标注结果,确定针对每张场景图像的修正后二维语义标注结果:
    针对每个簇,基于三维场景模型对应的第一坐标系以及场景图像对应的第二坐标系之间的转换关系,从至少一张场景图像中确定与该簇包括的每个三角面片对应的场景图像;
    针对每张场景图像,基于该张场景图像对应的三角面片所属簇的初始三维语义标注结果,确定针对该张场景图像的修正后二维语义标注结果。
  26. 根据权利要求14至25任一所述的装置,其中,初始二维语义标注结果为利用初始语义标注神经网络对每张场景图像进行语义标注得到的,所述装置还包括:
    训练模块,配置为确定针对每张场景图像的修正后二维语义标注结果之后,基于每张场景图像的修正后二维语义标注结果对初始语义标注神经网络进行训练,得到训练好的语义标注神经网络。
  27. 一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至13任一所述的语义标注的方法的步骤。
  28. 一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计 算机程序被处理器运行时执行如权利要求1至13任一所述的语义标注的方法的步骤。
  29. 一种计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令被计算机设备的处理器运行时,实现权利要求1至13中任一项所述方法中的步骤。
PCT/CN2022/093649 2021-11-09 2022-05-18 一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品 WO2023082588A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111320552.7 2021-11-09
CN202111320552.7A CN113902856B (zh) 2021-11-09 2021-11-09 一种语义标注的方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023082588A1 true WO2023082588A1 (zh) 2023-05-19

Family

ID=79193738

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093649 WO2023082588A1 (zh) 2021-11-09 2022-05-18 一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN113902856B (zh)
WO (1) WO2023082588A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902856B (zh) * 2021-11-09 2023-08-25 浙江商汤科技开发有限公司 一种语义标注的方法、装置、电子设备及存储介质
CN114827711B (zh) * 2022-06-24 2022-09-20 如你所视(北京)科技有限公司 图像信息显示方法和装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116781A1 (en) * 2015-10-21 2017-04-27 Nokia Technologies Oy 3d scene rendering
CN113034566A (zh) * 2021-05-28 2021-06-25 湖北亿咖通科技有限公司 高精度地图构建方法、装置、电子设备及存储介质
CN113160420A (zh) * 2021-05-17 2021-07-23 上海商汤临港智能科技有限公司 一种三维点云重建的方法、装置、电子设备及存储介质
CN113178014A (zh) * 2021-05-27 2021-07-27 网易(杭州)网络有限公司 场景模型渲染方法、装置、电子设备和存储介质
US20210272355A1 (en) * 2020-02-27 2021-09-02 Apple Inc. Semantic labeling of point cloud clusters
CN113362458A (zh) * 2020-12-23 2021-09-07 深圳大学 模拟多视角成像的三维模型解译方法、终端及存储介质
CN113902856A (zh) * 2021-11-09 2022-01-07 浙江商汤科技开发有限公司 一种语义标注的方法、装置、电子设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109242903B (zh) * 2018-09-07 2020-08-07 百度在线网络技术(北京)有限公司 三维数据的生成方法、装置、设备及存储介质
CN112085840B (zh) * 2020-09-17 2024-03-29 腾讯科技(深圳)有限公司 语义分割方法、装置、设备及计算机可读存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116781A1 (en) * 2015-10-21 2017-04-27 Nokia Technologies Oy 3d scene rendering
US20210272355A1 (en) * 2020-02-27 2021-09-02 Apple Inc. Semantic labeling of point cloud clusters
CN113362458A (zh) * 2020-12-23 2021-09-07 深圳大学 模拟多视角成像的三维模型解译方法、终端及存储介质
CN113160420A (zh) * 2021-05-17 2021-07-23 上海商汤临港智能科技有限公司 一种三维点云重建的方法、装置、电子设备及存储介质
CN113178014A (zh) * 2021-05-27 2021-07-27 网易(杭州)网络有限公司 场景模型渲染方法、装置、电子设备和存储介质
CN113034566A (zh) * 2021-05-28 2021-06-25 湖北亿咖通科技有限公司 高精度地图构建方法、装置、电子设备及存储介质
CN113902856A (zh) * 2021-11-09 2022-01-07 浙江商汤科技开发有限公司 一种语义标注的方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN113902856B (zh) 2023-08-25
CN113902856A (zh) 2022-01-07

Similar Documents

Publication Publication Date Title
WO2023082588A1 (zh) 一种语义标注的方法、装置、电子设备、存储介质及计算机程序产品
Ladický et al. Joint optimization for object class segmentation and dense stereo reconstruction
US8620026B2 (en) Video-based detection of multiple object types under varying poses
US9530218B2 (en) Method for classification and segmentation and forming 3D models from images
CN106951830B (zh) 一种基于先验条件约束的图像场景多对象标记方法
WO2020087564A1 (zh) 三维物体重建方法、计算机设备及存储介质
Liu et al. Deep representation learning for road detection using Siamese network
CN110544268B (zh) 一种基于结构光及SiamMask网络的多目标跟踪方法
CN112613387A (zh) 一种基于YOLOv3的交通标志检测方法
EP3973507B1 (en) Segmentation for holographic images
CN114140527A (zh) 一种基于语义分割的动态环境双目视觉slam方法
Kang et al. Yolo-6d+: single shot 6d pose estimation using privileged silhouette information
CN114998592A (zh) 用于实例分割的方法、装置、设备和存储介质
CN108664968B (zh) 一种基于文本选取模型的无监督文本定位方法
Liao et al. Multi-scale saliency features fusion model for person re-identification
CN110390724A (zh) 一种带有实例分割的slam方法
WO2021179751A1 (zh) 图像处理方法和***
Osuna-Coutiño et al. Structure extraction in urbanized aerial images from a single view using a CNN-based approach
CN113570713A (zh) 一种面向动态环境的语义地图构建方法及装置
Rubino et al. Semantic multi-body motion segmentation
WO2024000728A1 (zh) 单目三维平面恢复方法、设备及存储介质
CN115984583B (zh) 数据处理方法、装置、计算机设备、存储介质和程序产品
Liu et al. Deep representation learning for road detection through Siamese network
CN112182272B (zh) 图像检索方法及装置、电子设备、存储介质
CN115170585B (zh) 三维点云语义分割方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22891398

Country of ref document: EP

Kind code of ref document: A1