CN112837372A - Data generation method and device, electronic equipment and storage medium - Google Patents

Data generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112837372A
CN112837372A CN202110231700.1A CN202110231700A CN112837372A CN 112837372 A CN112837372 A CN 112837372A CN 202110231700 A CN202110231700 A CN 202110231700A CN 112837372 A CN112837372 A CN 112837372A
Authority
CN
China
Prior art keywords
target
information
voxel
image
pose
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110231700.1A
Other languages
Chinese (zh)
Inventor
段永利
孙佳明
周晓巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Shangtang Technology Development Co Ltd
Zhejiang Sensetime Technology Development Co Ltd
Original Assignee
Zhejiang Shangtang Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Shangtang Technology Development Co Ltd filed Critical Zhejiang Shangtang Technology Development Co Ltd
Priority to CN202110231700.1A priority Critical patent/CN112837372A/en
Publication of CN112837372A publication Critical patent/CN112837372A/en
Priority to KR1020227014409A priority patent/KR20220125715A/en
Priority to PCT/CN2021/105485 priority patent/WO2022183656A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Processing Or Creating Images (AREA)
  • Image Processing (AREA)
  • Saccharide Compounds (AREA)
  • Medicines That Contain Protein Lipid Enzymes And Other Medicines (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The disclosure relates to a data generation method and apparatus, an electronic device, and a storage medium. The method comprises the following steps: determining a first pose of a target image in a target scene, and establishing a voxel information set of the target image according to the first pose, wherein the target image comprises at least one frame of image obtained by data acquisition of the target scene; performing semantic segmentation on the target image to obtain semantic information of the target image; fusing the semantic information into the voxel information set to obtain a fused voxel information set; and obtaining map data of the target scene according to the fusion voxel information set corresponding to the target image. The embodiment of the disclosure can improve the data comprehensiveness and quality of the obtained map data.

Description

Data generation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer vision, and in particular, to a data generation method and apparatus, an electronic device, and a storage medium.
Background
With the rapid development of computer vision technology, scene reconstruction is an increasingly important application in the field of computer vision. In the related art, scene reconstruction generally includes geometric information such as the shape and position of each target object in a scene.
However, when the reconstructed scene needs to interact with the real world, such as being applied to an indoor robot or Augmented Reality (AR) scene, it is often difficult to achieve a good interaction effect by only acquiring geometric information of a target object in the scene.
Disclosure of Invention
The present disclosure presents a data generation scheme.
According to an aspect of the present disclosure, there is provided a data generation method including:
determining a first pose of a target image in a target scene, and establishing a voxel information set of the target image according to the first pose, wherein the target image comprises at least one frame of image obtained by data acquisition of the target scene; performing semantic segmentation on the target image to obtain semantic information of the target image; fusing the semantic information into the voxel information set to obtain a fused voxel information set; and obtaining map data of the target scene according to the fusion voxel information set corresponding to the target image.
In a possible implementation manner, the fusing the semantic information into the voxel information set to obtain a fused voxel information set includes: and projecting at least one voxel in the voxel information set to the target image, and determining semantic probability distribution information of the at least one voxel after projection according to the semantic information of at least one pixel in the target image to obtain a fusion voxel information set.
In a possible implementation manner, the fusing the semantic information into the voxel information set to obtain a fused voxel information set further includes: acquiring acquisition data obtained by acquiring data of the target scene, and fusing the acquisition data into the voxel information set to obtain a fused voxel information set.
In a possible implementation manner, the fusing the acquired data into the voxel information set to obtain a fused voxel information set includes: and performing information fusion on at least one voxel in the voxel information set according to the fusion weight of the acquired data and the projection relation between the at least one voxel in the voxel information set and the acquired data to obtain a fusion voxel information set.
In a possible implementation manner, the obtaining map data of the target scene according to the fused voxel information set corresponding to the target image includes: and storing the fused voxel information set corresponding to at least one target image into the map data of the target scene.
In one possible implementation, the method further includes: performing loop detection on at least one frame of target image, and taking the target image with loop detection as a loop image; determining a second pose of the loop image, wherein the accuracy of the second pose is higher than the accuracy of the first pose of the target image corresponding to the loop image; and updating the map data of the target scene according to the second pose of the loop image.
In one possible implementation manner, the updating the map data of the target scene according to the second pose of the loop image includes: acquiring a fused voxel information set corresponding to the loop image in a first position as a first target set; based on the first pose and the second pose corresponding to the loop image, re-fusing the fused information in the first target set to obtain a second target set; and updating the map data of the target scene according to the second target set.
In a possible implementation manner, the re-fusing the information fused in the first target set based on the first pose and the second pose corresponding to the loop image to obtain a second target set includes: according to the first pose corresponding to the loop image, fused information in the first target set is subjected to de-fusion to obtain a third target set; and according to the second pose corresponding to the loop image, fusing the information of the loop image into the third target set to obtain a second target set.
According to an aspect of the present disclosure, there is provided a data generating apparatus including:
the system comprises a voxel information set establishing module, a processing module and a display module, wherein the voxel information set establishing module is used for determining a first pose of a target image in a target scene and establishing a voxel information set of the target image according to the first pose, and the target image comprises at least one frame of image obtained by data acquisition of the target scene; the semantic segmentation module is used for performing semantic segmentation on the target image to obtain semantic information of the target image; the fusion module is used for fusing the semantic information into the voxel information set to obtain a fused voxel information set; and the data generation module is used for obtaining the map data of the target scene according to the fusion voxel information set corresponding to the target image.
In one possible implementation, the fusion module is configured to: and projecting at least one voxel in the voxel information set to the target image, and determining semantic probability distribution information of the at least one voxel after projection according to the semantic information of at least one pixel in the target image to obtain a fusion voxel information set.
In one possible implementation, the fusion module is further configured to: acquiring acquisition data obtained by acquiring data of the target scene, and fusing the acquisition data into the voxel information set to obtain a fused voxel information set.
In one possible implementation, the fusion module is configured to: and performing information fusion on at least one voxel in the voxel information set according to the fusion weight of the acquired data and the projection relation between the at least one voxel in the voxel information set and the acquired data to obtain a fusion voxel information set.
In one possible implementation, the data generating module is configured to: and storing the fused voxel information set corresponding to at least one target image into the map data of the target scene.
In one possible implementation, the apparatus is further configured to: performing loop detection on at least one frame of target image, and taking the target image with loop detection as a loop image; determining a second pose of the loop image, wherein the accuracy of the second pose is higher than the accuracy of the first pose of the target image corresponding to the loop image; and updating the map data of the target scene according to the second pose of the loop image.
In one possible implementation, the apparatus is further configured to: acquiring a fused voxel information set corresponding to the loop image in a first position as a first target set; based on the first pose and the second pose corresponding to the loop image, re-fusing the fused information in the first target set to obtain a second target set; and updating the map data of the target scene according to the second target set.
In one possible implementation, the apparatus is further configured to: according to the first pose corresponding to the loop image, fused information in the first target set is subjected to de-fusion to obtain a third target set; and according to the second pose corresponding to the loop image, fusing the information of the loop image into the third target set to obtain a second target set.
According to an aspect of the present disclosure, there is provided an electronic device including:
a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: the above-described data generation method is performed.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described data generation method.
In the embodiment of the disclosure, a first pose of a target image in a target scene is determined, a voxel information set of the target image is established according to the first pose, semantic information obtained by performing semantic segmentation on the target image is fused into the voxel information set, a fused voxel information set is obtained, and map data of the target scene is obtained according to the fused voxel information set. Through the process, the semantic information obtained by segmentation in the target image can be fused by establishing the voxel information set, so that the semantic information in the target scene is continuously fused, the map data containing the continuously fused semantic information is obtained, and the data comprehensiveness and the quality of the obtained map data are effectively improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a flow diagram of a data generation method according to an embodiment of the present disclosure.
Fig. 2 illustrates a block diagram of a data generation apparatus according to an embodiment of the present disclosure.
Fig. 3 shows a schematic diagram of an application example according to the present disclosure.
Fig. 4 shows a schematic diagram of an application example according to the present disclosure.
Fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.
Fig. 6 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Fig. 1 shows a flowchart of a data generation method according to an embodiment of the present disclosure, which may be applied to a data generation apparatus, which may be a terminal device, a server, or other processing device. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In one example, the data generation method can be applied to a cloud server or a local server, the cloud server can be a public cloud server or a private cloud server, and the cloud server can be flexibly selected according to actual conditions.
In some possible implementations, the data generation method may also be implemented by a processor calling computer readable instructions stored in a memory.
As shown in fig. 1, in one possible implementation manner, the data generation method may include:
step S11, determining a first pose of a target image in a target scene, and establishing a voxel information set of the target image according to the first pose, wherein the target image comprises at least one frame of image obtained by data acquisition of the target scene.
And step S12, performing semantic segmentation on the target image to obtain semantic information of the target image.
And step S13, fusing the semantic information into a voxel information set to obtain a fused voxel information set.
And step S14, obtaining map data of the target scene according to the fused voxel information set corresponding to the target image.
The target scene may be any scene having a reconstruction requirement, and the position, the range, and the like of the target scene may be flexibly selected according to an actual situation, which is not limited in the embodiments of the present disclosure and is not limited to the following embodiments of the present disclosure. In one possible implementation, the target scene may be an indoor scene, in one possible implementation, the target scene may be an outdoor scene, and in one possible implementation, the target scene may include both an indoor scene and an outdoor scene.
The target image may be at least one frame of image resulting from data acquisition of the target scene. The number of images included in the target image is not limited in the embodiment of the present disclosure, and may be flexibly determined according to the actual situation of data acquisition of the target scene. In a possible implementation manner, each frame of image obtained by acquiring data of a target scene may be used as a target image, in a possible implementation manner, one or more frames of images obtained by acquiring data of the target scene may also be selected as the target image, the selection manner may be flexibly selected according to an actual situation, in an example, the selection may be random, and in an example, the target image may also be selected by sampling the acquired images at a certain frequency.
The data acquisition mode of the target scene is not limited in the embodiments of the present disclosure, and may be flexibly determined according to actual situations, and is not limited to the following embodiments of the present disclosure. In one possible implementation, the target scene may be acquired by an image acquisition device such as a video camera or a still camera. In a possible implementation manner, the target scene may also be acquired by other apparatuses including an image acquisition device, and in one example, the target scene may be acquired by an integrated device including an Inertial Measurement Unit (IMU) and a visual sensor as the image acquisition device, such as a smartphone with a camera. In some possible implementations, in the case Of data acquisition Of a target scene by an image acquisition device or other apparatus including an image acquisition device, the image acquisition device may also have a function Of acquiring depth information, for example, in one example, the image acquisition device may include a Time Of Flight (TOF) camera.
Along with different data acquisition modes, the acquired data can be flexibly changed and can be flexibly determined according to actual conditions, and the method is not limited to the following disclosure embodiments. In some possible implementations, the collected data may include at least one frame of the target image as described in the above disclosed embodiments; in some possible implementations, the acquisition data may also include other data, such as in one example, in the case of data acquisition of the target scene by an integrated device including an inertial measurement unit, the acquisition data may also include inertial measurement data (IMU data), in one example, in the case of the image acquisition device including a TOF camera, the acquisition data may also include depth information of the target image, and so forth.
Based on the above disclosed embodiments, in one possible implementation, step S11 may determine the first pose of the target image based on the target image in the target scene. The pose of the image can be the pose of the equipment for collecting the image, and the coordinate corresponding relation between the image and the world coordinate system can be determined based on the pose of the image, so that the position of each pixel point in the image in the space and the like can be determined. In one possible implementation, the first pose may be a pose of the image capture device with the target image captured.
The determination manner of the first pose is not limited in the embodiments of the present disclosure, and any method for determining the pose of the image may be used as the method for determining the first pose in step S11, and is not limited to the following embodiments. In one possible implementation, the first pose of the target image may be determined by performing image pose estimation based only on the target image in the target scene; in a possible implementation manner, under the condition that the acquired data includes IMU data, a synchronous positioning and mapping (SLAM) system may also be utilized to perform pose estimation on the target image by using a Visual-Inertial odometer (VIO) method to obtain a first pose with six degrees of freedom; in some possible implementations, in a case that the acquisition data includes depth information, the first pose of the target image may also be determined based on pose estimation aided by the depth information.
In one possible implementation, step S11 may also establish a set of voxel information of the target image according to the first pose of the target image. In which a Voxel (Volume) is an abbreviation of a Volume element (Volume Pixel) and is the smallest unit of digital data on a three-dimensional space partition, in a possible implementation, the Voxel may be used in the embodiment of the present disclosure as a representation of each position in a target scene.
Since the target image is an image in the target scene, it may represent scene content in the target scene under a certain field of view or views. Therefore, in a possible implementation, the target image may correspond to at least a portion of the positions in the target scene, and therefore, the pixel points in the target image may also correspond to at least a portion of the voxels representing the target scene. In one possible implementation, therefore, a set of voxel information of the target image may be established according to the first pose of the target image, and the set of voxel information may include relevant voxel information of at least some voxels in the target scene, so as to achieve correspondence between the target image and at least some voxels representing the target scene.
The specific information content of the relevant voxel information included in the voxel information set can be flexibly determined according to the actual situation, and is not limited to the following disclosure embodiments. In some possible implementations, the set of voxel information may include: the fusion weight W (v) of the voxel information set v in the fusion process, image information of the target image such as a color value C (v), and depth information of the target image such as a truncated sign function D (v), etc. In some possible implementations, the image information, the depth information, and the like of the target image may also be represented by other functions; in some possible implementations, the voxel information set may further include related information required by other map data, and may be flexibly expanded according to practical situations, which is not listed one by one. In some possible implementation manners, because the voxel information set in the embodiments disclosed in the present application includes relevant information of voxels, and the voxels serving as three-dimensional data need to determine the relevant information by depth, the following embodiments all describe that the acquired data and the voxel information set include depth information, and the depth information needs to be fused in the process of fusing the voxel information set.
In step S11, the method of creating the voxel information set of the target image according to the first pose is not limited in the embodiments of the present disclosure, and may be flexibly determined according to actual situations, and is not limited to the following embodiments of the present disclosure. In a possible implementation manner, a voxel information set of each target image may be established by a voxel hashing (voxel hashing) method, and each voxel information set is stored and searched by using a hash table. In some possible implementation manners, the voxel information sets may also be established, stored, and searched through other data structures, which data structures are specifically selected, and how to establish each voxel information set based on these structures, and the implementation form thereof may be flexibly selected according to actual situations, and is not listed here.
In one possible implementation, the semantic information of the target image may also be obtained by performing semantic segmentation on the target image in step S12. The implementation sequence of step S11 and step S12 is not limited in the embodiment of the present disclosure, and step S11 and step S12 may be performed simultaneously, or may be performed sequentially according to a preset sequence, and a specific selection of which execution sequence may be flexibly determined according to an actual situation.
In step S12, the semantic segmentation of the target image may be to segment objects of different types contained in the target image, and the semantic information of the target image may be the type information of the object corresponding to at least one pixel point in the target image. For example, in an example, when the target scene is an indoor scene, the target image may be an image in the indoor scene, and the target image may include a wall object, a floor object, a ceiling object, a table object, a chair object, and the like, and the target image is semantically segmented, so that the objects included in the target image, such as the wall, the floor, the ceiling, the table, the chair, and the like, may be segmented, that is, it may be determined which pixel points in the target image correspond to the category information of the wall, which pixel points correspond to the category information of the floor, and the like. In some possible implementations, the category corresponding to the semantic information of the target image may also change according to the difference of the target scene, for example, other indoor object categories such as a cabinet or a window may also be included, and other outdoor object categories such as a sky, a tree, or a road may also be included.
The method for performing semantic segmentation on the target image is not limited in the embodiments of the present disclosure, and any method that can perform segmentation on the target image can be used as an implementation method of semantic segmentation in the embodiments of the present disclosure, and is not limited to the following embodiments of the present disclosure. In a possible implementation manner, the target image can be processed through a segmentation algorithm to obtain semantic information of the target image; in a possible implementation manner, the target image may also be input into the image segmentation neural network to obtain semantic information output by the image segmentation neural network.
In step S12, the target image is subjected to semantic segmentation, which may be one target image at a time or multiple target images at a time, and the number of target images corresponding to the semantic segmentation may be flexibly determined according to the actual situation, which is not limited in the embodiment of the present disclosure.
After the voxel information set of the target image is established and the semantic information of the target image is obtained, the semantic information may be fused to the corresponding voxel information set through step S13 to obtain a fused voxel information set. As described in the foregoing embodiments, information content that can be included in the voxel information set can be flexibly selected according to actual conditions, so that, in addition to semantic information, other information can be fused into the fused voxel information set, and which information is specifically fused can be flexibly selected according to actual conditions. By fusing semantic information to the voxel information set, the voxel information set can be further fused with semantic information on the basis of set information including the shape and position of each object to be expressed, and the object scene can be expressed with higher quality.
The manner of fusing the semantic information to the voxel information set in step S13 is not limited in the embodiments of the present disclosure, and may be flexibly selected according to actual situations, which is described in the following disclosure embodiments without being expanded first.
After the fused voxel information set is obtained, the map data of the target scene can be obtained according to the fused voxel information set corresponding to the target image. The map data of the target scene may be data obtained by summarizing target images of frames in the target scene, and the data content included in the map data may be flexibly determined according to actual situations. Therefore, the implementation of step S14 can be flexibly decided according to the actual data requirements of the map data. The implementation of step S14 can be seen in detail in the following disclosure embodiments, which are not first expanded here.
In the embodiment of the disclosure, a first pose of a target image in a target scene is determined, a voxel information set of the target image is established according to the first pose, semantic information obtained by performing semantic segmentation on the target image is fused into the voxel information set, a fused voxel information set is obtained, and map data of the target scene is obtained according to the fused voxel information set. Through the process, the semantic information obtained by segmentation in the target image can be fused by establishing the voxel information set, and as the voxels in the voxel information set can continuously express the position of the target scene in the three-dimensional space, the semantic information is fused based on the voxel information set, so that the semantic information in the target scene can be continuously fused, the map data containing the continuously fused semantic information is obtained, and the data comprehensiveness and quality of the obtained map data are effectively improved.
As described in the above embodiments, the implementation manner of step S13 can be flexibly determined according to practical situations. In one possible implementation, step S13 includes:
and projecting at least one voxel in the voxel information set to a target image, and determining semantic probability distribution information of the projected at least one voxel according to the semantic information of at least one pixel in the target image to obtain a fused voxel information set.
At least one voxel in the voxel information set is projected to the target image, and the projection mode, the angle, and the like can be flexibly selected according to the actual situation, which is not limited in the embodiment of the present disclosure. Specifically, which voxels in the voxel information set are projected to the target image, can be flexibly selected according to actual conditions, and in a possible implementation manner, each voxel contained in the voxel information set can be projected to the target image; in one possible implementation, the selected voxels may be randomly selected or sampled at a certain ratio and then projected onto a target image.
After projecting the at least one voxel to the target image, semantic probability distribution information of the at least one voxel after projection may be determined from semantic information of at least one pixel in the target image. The semantic probability distribution information may be probability distribution conditions of semantic information of a plurality of voxels, how to determine the semantic probability distribution information of at least one voxel after projection according to the semantic information of at least one pixel in the target image, and the implementation form of the semantic probability distribution information may be flexibly determined according to actual conditions. In a possible implementation manner, at least one pixel in the target image may be in one-to-one correspondence with the projected voxels, and after semantic information is fused according to the correspondence, semantic probability distribution information of at least one voxel is determined based on a fusion result of the semantic information of the plurality of voxels. In a possible implementation manner, the probability distribution condition of the semantic information in the target image may also be obtained according to the semantic information of at least one pixel in the target image, and the probability distribution condition of the semantic information in the target image is fused with the semantic probability distribution information of the projected voxel, and how to implement the method may be flexibly selected according to the actual situation. In one example, at least one voxel in the voxel information set is projected to a target image, semantic probability distribution information of the projected at least one voxel is determined according to semantic information of at least one pixel in the target image, and a manner of obtaining a fused voxel information set can be represented by the following formula (1):
Figure BDA0002958639230000061
wherein, I1,…,kFor each frame of the target image, wherein IkFor the current target image, P (l)i|I1,…,k) For fusing semantic probability distribution information of the current target image, wherein the voxel belongs to the ith category, Z is a normalization factor, P (l)i|I1,…,k-1) Before fusing semantic information of the current target image (i.e. fusing I)1~Ik-1Semantic information of frame target image) semantic probability distribution information that a voxel belongs to the ith class, P (O)u(v,k)=li|Ik) Probability distribution of semantic information obtained by performing semantic segmentation on current target image, wherein Ou(v,k)Is the projection result of projecting the voxel v to the current target image.
As can be seen from the above formula (1), in one example, semantic segmentation information P (O) obtained by performing semantic segmentation on the current target image can be obtainedu(v,k)=li|Ik) Multiplying the semantic probability distribution information of the voxels corresponding to the k-1 frame target image before the current target image and performing normalization processingAnd fusing the semantic information of the current target image into the voxel information set to obtain a fused voxel information set.
Through the process, the semantic information of the two-dimensional target image can be fused into the two-dimensional projection of the three-dimensional voxel by projecting the voxel in the voxel information set to the target image, so that the probability distribution condition of the semantic information contained in the fused voxel information set can be continuously expressed with high quality in the target scene, and the map data of the target scene obtained based on the fused voxel information set can be more widely applied to a plurality of scenes needing the semantic information.
As described in the above embodiments, in the fused voxel information set, in addition to semantic information, other information may be fused. Therefore, in one possible implementation, step S13 may further include:
acquiring acquisition data obtained by acquiring data of a target scene, and fusing the acquisition data into a voxel information set to obtain a fused voxel information set.
The implementation manner of the data collection can refer to the above disclosed embodiments. As described in the foregoing embodiments, the data content included in the collected data may be flexible according to the data collection mode, such as including depth information, IMU data, or other information, for details, see the foregoing embodiments, and are not described herein again.
In a possible implementation manner, all or part of the collected data may be fused into the voxel information set according to the actual situation of the collected data, and specifically, which collected data is selected for fusion is not limited in the embodiment of the present disclosure, and may be flexibly selected according to the actual situation, and is not limited in the following disclosure embodiments. With the difference of the collected data, the fusion mode can also be changed flexibly, and the detailed description is given in the following disclosure embodiments, which are not first developed.
The comprehensiveness of data in the combination of the fused voxel information can be further increased through the process, and the map data of the target scene obtained based on the fused voxel information set can be more comprehensive and complete and has higher quality.
In one possible implementation, the acquiring data includes depth information of the target image, and in some possible implementations, the acquiring data may further include color information and the like, and the fusing the acquiring data into the voxel information set to obtain a fused voxel information set may include:
and performing information fusion on at least one voxel in the voxel information set according to the fusion weight of the acquired data and the projection relation between the acquired data and at least one voxel in the voxel information set to obtain a fusion voxel information set.
In one possible implementation, the depth information may be information acquired by a TOF camera as described in the above disclosed embodiment, and in one example, the depth information may be represented in the form of a truncated sign function d (v); in one possible implementation, the color information may be RGB color information obtained from the acquired target image itself, and in one example, the color information may be represented in the form of a color value c (v), or the like. It should be noted that the above-mentioned obtaining manner and expression form of the depth information and the color information are only exemplary implementation forms, and in practical applications, other implementation forms may be flexibly selected according to practical situations, which is not limited in this disclosure.
The manner of fusing the acquired data may be flexibly changed according to the different acquired data forms. As can be seen from the above disclosed embodiments, in the case that the acquired data includes depth information and/or color information, the acquired data may be fused according to the fusion weight of the acquired data and the projection relationship between at least one voxel in the voxel information set and the acquired data. Specifically, how to perform fusion based on the projection relationship between the fusion weight and the voxel can be flexibly determined, and in one example, the fusion process of the depth information can be represented by the following formulas (2) and (3):
Figure BDA0002958639230000081
W'(v)=W(v)+wi(v) (3)
wherein D' (v) is the depth information of the voxel v after fusion, D (v) is the depth information of the voxel v before fusion, W (v) is the weight of the voxel v determined according to the information in the voxel information set, and wi(v) To fuse the weights, di(v) W' (v) is the distance between the backprojection point corresponding to the voxel v in the depth information and the voxel v, and is the updated weight after the voxel v is fused.
As can be seen from equations (2) and (3), in one example, the fusion weight w can be based on the depth informationi(v) And the projection relation d between the voxels v and the depth informationi(v) To fuse the depth information into the set of voxel information. Wherein the fusion weight w of the depth informationi(v) Can be flexibly set according to the actual situation, and the projection relation d between the voxel and the depth informationi(v) Can be flexibly determined according to the actual situation of the voxel and the depth information, and in one example, di(v) The calculation can be performed by the following formulas (4) and (5):
Figure BDA0002958639230000082
η=dc(v)-X(v) (5)
wherein d isc(v) And X (v) is the distance from the voxel v to the center of the camera, X (v) is the depth of the corresponding pixel obtained after the voxel v is projected to the depth information, and mu is a preset parameter.
As can be seen from the above disclosed embodiments, in one possible implementation, depth information may be fused into a voxel information set based on a fusion weight of the depth information and a projection relationship between a voxel and the depth information. In some possible implementations, in the case that the acquired data includes color information, the color information may be fused by referring to the depth information, except that the truncated symbolic function d (v) corresponding to the depth information is replaced by a color value c (v), and a projection relationship between the voxel and the depth information is replaced by a projection relationship between the voxel and the target image, and so on. In the case that the collected data includes other data forms, the fusion mode thereof can be flexibly replaced and expanded by referring to the above-mentioned disclosed embodiments, which are not described in detail herein.
By the process, under the condition that the acquired data comprises data in various forms, different information can be flexibly fused according to different forms of the acquired data, so that the data integrity of the fused voxel information set is improved, the fusion efficiency is improved, and the comprehensiveness of the obtained map data of the target scene and the data generation efficiency are improved.
In one possible implementation, step S14 may include:
and storing the fused voxel information set corresponding to at least one target image into the map data of the target scene.
As described in the foregoing embodiments, the map data of the target scene may be data obtained by summarizing the target images in the target scene, and the data content included in the map data may be determined flexibly according to actual situations, so that, in one possible implementation, the fused voxel information sets corresponding to different target images may be combined to form one data unit of fused voxel information as the map data of the target scene. The method includes the steps that fusion voxel information sets corresponding to target images are stored in map data, the fusion voxel information sets corresponding to the target images can be flexibly determined according to actual conditions, in a possible implementation mode, the fusion voxel information sets corresponding to each frame of target images can be stored in the map data of a target scene, in some possible implementation modes, the target images can be selected or screened to select fusion voxel information sets corresponding to partial target images to be stored in the map data, and the like, and how to achieve the fusion voxel information sets can be flexibly determined according to the actual conditions.
In some possible implementations, as described in the foregoing embodiments, the voxel information sets of the target image may be established by a voxel hashing method, and in this case, the map data may store, in addition to the fused voxel information sets, hash tables for searching the fused voxel information sets, and the like.
By storing the fused voxel information set corresponding to at least one target image into the map data of the target scene, the data integrity and comprehensiveness of the map data can be effectively improved, and the map data can continuously represent the target scene under the condition that the map data contains a plurality of continuous target images.
In a possible implementation manner, the data generation method provided in the embodiment of the present disclosure may further include:
performing loop detection on at least one frame of target image, and taking the target image detected to be looped as a loop image;
determining a second pose of the loop image, wherein the accuracy of the second pose is higher than the accuracy of the first pose of the target image corresponding to the loop image;
and updating the map data of the target scene according to the second pose of the loop image.
The loop detection may be to detect whether there are images corresponding to the same scene in the collected multiple frames of target images, a specific detection mode of the loop detection is not limited in the embodiment of the present disclosure, and any mode for performing the loop detection in the field of visual SLAM may be used as an implementation mode of the loop detection in the embodiment of the present disclosure, and is not limited to the following disclosed embodiments. In one possible implementation, loop detection may be implemented by building a bag-of-words model.
After the loop is detected, the target image with the detected loop can be used as a loop image, and the pose of the loop image can be optimized according to the loop detection result to determine the second pose of the loop image. The mode of optimizing the pose of the loop image is not limited in the embodiment of the present disclosure, and any mode of correcting and optimizing the pose of the loop image based on the detection result in the loop detection may be used as the determination mode of the second pose, which is not limited in the embodiment of the present disclosure. Because the loop image can be used for correcting and optimizing the first pose corresponding to the target image in the target scene, the obtained second pose has higher accuracy compared with the first pose determined by the target image corresponding to the loop image, and therefore the map data updated based on the second pose has higher precision.
After determining the second pose of the loop image, the map data of the target scene may be updated according to the second pose of the loop image. The updating mode can be flexibly selected in the embodiment of the disclosure, and the updating mode is detailed in each disclosed embodiment which is not expanded at first.
Through the process, the accumulated error in the map data can be effectively reduced, and the precision of the map data is improved.
In one possible implementation, updating the map data of the target scene according to the second pose of the loop image includes:
acquiring a fusion voxel information set corresponding to a loop image in a first position as a first target set;
based on the first pose and the second pose corresponding to the loopback image, re-fusing the fused information in the first target set to obtain a second target set;
and updating the map data of the target scene according to the second target set.
In the embodiment of the present disclosure, the fused voxel information set corresponding to the loop image in the first pose is a fused voxel information set established according to the first pose determined by the loop image and fused with information of the loop image.
As described in the above disclosed embodiments, in the case that a loop image is detected, the pose of the loop image may be optimized, and the second pose of the loop image may be determined again, in this case, the first target set corresponding to the loop image in the first pose may contain information that may have a deviation from the data of the real target scene. Therefore, in a possible implementation manner, information in the first target set may be re-fused according to the first pose and the second pose corresponding to the loop image, so as to obtain the second target set.
As described in the foregoing disclosure embodiments, the fused voxel information set may fuse multiple types of information, such as semantic information, depth information, or color information, and the like, so that the information fused in the first target set may further include one or more of semantic information and color information on the basis that the information fused in the first target set includes the depth information, and which information is specifically included may be flexibly determined according to actual situations. The manner of re-fusion may also be flexibly changed according to the type of information fused in the first target set, which is described in the following disclosure embodiments, and is not first expanded herein.
After the second target set is obtained, the map data of the target scene may be updated according to the second target set, the updating mode may be flexibly determined according to the actual situation, and in a possible implementation mode, the data of the first target set may be replaced by the data of the second target set to implement the updating of the map data.
The method comprises the steps of obtaining a fused voxel information set corresponding to a loop image in a first pose as a first target set, re-fusing information fused in the first target set based on the first pose and a second pose corresponding to the loop image to obtain a second target set, and updating map data of a target scene according to the second target set.
In a possible implementation manner, based on the first pose and the second pose corresponding to the loop back image, re-fusing the fused information in the first target set to obtain a second target set, including:
according to the first pose corresponding to the loop image, fused information in the first target set is subjected to de-fusion to obtain a third target set;
and according to the second pose corresponding to the loop image, fusing the information of the loop image into a third target set to obtain a second target set.
The implementation manner of the information fused in the first target set is described in detail in the above disclosed embodiments, and is not described herein again.
As can be seen from the above disclosed embodiments, in one possible implementation, the manner of re-fusing the fused information in the first set of targets may include de-fusion performed according to the first pose and re-fusion performed according to the second pose.
The method for de-fusion can be flexibly changed according to different implementation forms of the information fused in the first target set. In some possible implementations, the de-fusion can be viewed as the reverse process of the fusion, and thus the de-fusion can be achieved by the reverse operation of the fusion. For example, as described in the foregoing embodiments, in one possible implementation manner, the information fused in the first target set includes depth information and/or color information, in which case, based on the reverse operation of the fusion process of the depth information and/or the color information, the manner of performing the de-fusion on the information fused in the first target set may include:
and according to the fusion weight of the information fused in the first target set and the projection relation between at least one voxel in the first target set and the information fused in the first target set under the first attitude, performing information de-fusion on at least one voxel in the first target set to obtain a third target set.
Wherein the specific process of performing the de-fusion based on the fusion weights and the projection relation of the voxels in the first pose can be flexibly determined, in one example, referring to the fusion process of the above equations (2) and (3), the de-fusion process of the depth information in the first set of objects can be represented by the following equations (6) and (7):
Figure BDA0002958639230000111
W″(v)=W'(v)-wi(v) (7)
wherein D "(v) is the depth information of the voxel v after de-fusion, D '(v) is the depth information of the voxel v after fusion in the first pose, proposed in the above-mentioned published embodiment, W' (v) is the updated weight of the voxel v after fusion in the first pose, proposed in the above-mentioned published embodiment, W ″ (v) is the updated weight of the voxel v after fusion in the first posei(v) For the fusion weights proposed in the above-disclosed embodiment, di(v) For the proposed embodiment, in the first position, the distance between the back-projection point of the voxel v in the depth information and the voxel v, and W "(v) is the updated weight after the voxel v is fused.
Wherein the fusion weight w of the depth informationi(v) And the projection relation d between the voxels and the depth informationi(v) Reference may be made to the above-mentioned embodiments, which are not described herein again.
As can be seen from the above disclosed embodiments, in one possible implementation, the fused depth information in the first target set may be de-fused based on the fusion weight of the depth information and the projection relationship between the voxel and the depth information in the first bit position. In some possible implementations, in the case that the information fused in the first target set includes color information, the color information is fused in a manner that refers to the depth information, except that a truncated sign function d (v) corresponding to the depth information is replaced by a color value c (v), and a projection relationship between the voxel and the depth information in the first pose is replaced by a projection relationship between the voxel and the target image in the first pose, and the like. When the information fused in the first target set includes other data forms, the de-fusion mode may be flexibly replaced and expanded by referring to the above-mentioned embodiments, which is not described herein again.
By the process, under the condition that the information fused in the first target set comprises data in various forms, different information can be flexibly fused according to different forms of the information fused in the first target set, so that the efficiency and the flexibility of the fusion removal are improved, and the updating efficiency of map data is improved.
In some possible implementations, the information fused in the first target set may include semantic information, in which case, based on a reverse operation of the fusion process of the semantic information, the manner of performing the de-fusion on the information fused in the first target set may include:
at least one voxel in the first target set is projected to the loop image in the first pose, and semantic probability distribution information of the projected at least one voxel is determined according to semantic information of at least one pixel in the loop image, so that a third target set is obtained.
According to the above embodiments, it can be seen that, in the process of fusing the semantic information, at least one voxel in the voxel information set may be projected to the target image, and the semantic probability distribution information of the at least one voxel after projection may be determined according to the semantic information of the at least one pixel in the target image, so that, correspondingly, in the process of de-fusing the semantic information, the voxels in the first target set may be projected to the loop image in the first pose, so that the semantic information of the at least one pixel in the fused loop image is de-fused based on the projection result, so as to obtain the third target set.
In some possible implementations, referring to the above formula (1), it can be seen that in the process of fusing semantic information, semantic probability distribution information P (l) of voxels before fusing the semantic information of the current target image can be based oni|I1,…,k-1) And semantic information probability distribution P (O) obtained by performing semantic segmentation on the current target imageu(v,k)=li|Ik) A set of fused voxel information is obtained, and accordingly, in one possible implementation, semantic probability distribution information P (l) of voxels prior to fusing the semantic information of the current target image may be directly based oni|I1,…,k-1) To obtain a third set of targets. Specifically, which way is selected to obtain the third target set after the semantic information is de-fused is selected, and the implementation way of the third target set can be flexibly determined according to the actual situation, which is not limited in the embodiment of the present disclosure.
The method comprises the steps of projecting at least one voxel in a first target set to a loop image in a first posture, determining semantic probability distribution information of the projected at least one voxel according to semantic information of at least one pixel in the loop image to obtain a third target set, realizing independent de-fusion of the semantic information by utilizing reverse operation of a semantic information fusion process, facilitating subsequent re-fusion of the semantic information to realize update of map data on the basis of not influencing fusion and de-fusion processes of other information, and effectively improving feasibility, flexibility and efficiency of map data update.
After the third target set is obtained, the information of the loop images can be fused into the third target set according to the second pose corresponding to the loop images, and a second target set is obtained. The information content included in the information of the loop image may not be limited in the embodiment of the present disclosure, and reference may be made to an implementation form of the information fused in the first target set in the above-described embodiment, for example, the information may be semantic information, depth information, or color information, and is not described herein again.
For example, in one example, in the process of fusing semantic information, at least one voxel in the third target set may be projected to the loop image in the second pose, and semantic probability distribution information of the projected at least one voxel is determined according to semantic information of at least one pixel in the target image, so as to obtain the second target set. The manner of fusing the depth information and the color information may also refer to the above disclosed embodiments, and is not described in detail herein.
According to the process, the fused information in the first target set is fused to obtain a third target set according to the first pose corresponding to the loop image, and then the information of the loop image is fused to the third target set according to the second pose corresponding to the loop image to obtain a second target set.
After the map data is obtained in any combination form of the above disclosed embodiments, corresponding applications, such as control of an indoor robot or augmented reality, can be executed based on the obtained map data. The specific application to which scenarios are not limited in the embodiments of the present disclosure. In a possible implementation manner, since continuous semantic information is fused in the map data obtained in the embodiment of the present disclosure, the method may be used to control the robot to perform some tasks related to the semantic information in the target scene, and in one example, the robot may be controlled to perform corresponding operations on the target object in the target scene, such as picking up a water cup on a table. In a possible implementation manner, the map data obtained by the embodiment of the present disclosure may also be applied to an AR platform and the like having a semantic composition function.
Fig. 2 illustrates a block diagram of a data generation apparatus according to an embodiment of the present disclosure. As shown, the data generating apparatus 20 may include:
the voxel information set establishing module 21 is configured to determine a first pose of a target image in a target scene, and establish a voxel information set of the target image according to the first pose, where the target image includes at least one frame of image obtained by acquiring data of the target scene.
And the semantic segmentation module 22 is configured to perform semantic segmentation on the target image to obtain semantic information of the target image.
And the fusion module 23 is configured to fuse the semantic information into the voxel information set to obtain a fused voxel information set.
And the data generation module 24 is configured to obtain map data of the target scene according to the fused voxel information set corresponding to the target image.
In one possible implementation, the fusion module is configured to: and projecting at least one voxel in the voxel information set to a target image, and determining semantic probability distribution information of the projected at least one voxel according to the semantic information of at least one pixel in the target image to obtain a fused voxel information set.
In one possible implementation, the fusion module is further configured to: acquiring acquisition data obtained by acquiring data of a target scene, and fusing the acquisition data into a voxel information set to obtain a fused voxel information set.
In one possible implementation, the fusion module is configured to: and performing information fusion on at least one voxel in the voxel information set according to the fusion weight of the acquired data and the projection relation between the acquired data and at least one voxel in the voxel information set to obtain a fusion voxel information set.
In one possible implementation, the data generation module is configured to: and storing the fused voxel information set corresponding to at least one target image into the map data of the target scene.
In one possible implementation, the apparatus is further configured to: performing loop detection on at least one frame of target image, and taking the target image detected to be looped as a loop image; determining a second pose of the loop image, wherein the accuracy of the second pose is higher than the accuracy of the first pose of the target image corresponding to the loop image; and updating the map data of the target scene according to the second pose of the loop image.
In one possible implementation, the apparatus is further configured to: acquiring a fusion voxel information set corresponding to a loop image in a first position as a first target set; based on the first pose and the second pose corresponding to the loopback image, re-fusing the fused information in the first target set to obtain a second target set; and updating the map data of the target scene according to the second target set.
In one possible implementation, the apparatus is further configured to: according to the first pose corresponding to the loop image, fused information in the first target set is subjected to de-fusion to obtain a third target set; and according to the second pose corresponding to the loop image, fusing the information of the loop image into a third target set to obtain a second target set.
Application scenario example
The application example of the disclosure provides a data generation method, which can generate high-quality map data containing continuous semantic information.
Fig. 3 is a schematic diagram illustrating an application example according to the present disclosure, and as can be seen from fig. 3, in the application example of the present disclosure, the data generation method may include the following processes:
in the first step, a six degree of freedom pose is estimated using the target image and IMU data, and loop detection is performed.
In this step, a target image (RGB image) and IMU data obtained by acquiring data of a target scene by using a sensor device may be used, and a monocular VIO method having functions of tight coupling, repositioning, self-calibration, nonlinear optimization, global pose graph optimization, and the like is used to obtain a six-degree-of-freedom pose (i.e., the first pose in the above-described disclosed embodiment) corresponding to the device in the process of acquiring a target image of each frame, and determine whether a loop image exists in the target image.
And secondly, performing semantic segmentation based on the convolutional neural network.
In this step, the target image obtained in the first step may be subjected to semantic segmentation by using a convolutional neural network for image segmentation, so as to obtain a semantic segmentation result of each target image. The convolutional neural network for image segmentation can be obtained by training a training image, wherein the training image contains semantic annotation data of each pixel, and the annotation data can be obtained by a related annotation method or an annotation tool. In a possible implementation manner, the trained convolutional neural network for image segmentation has a relatively accurate segmentation result on an image acquired in an indoor environment, and has a certain generalization capability.
And thirdly, fusing the target image of the single frame, the depth information, the semantic segmentation result and the like into the map data.
As shown in fig. 3, in the application example of the present disclosure, in addition to the data acquisition of the target scene in the first step to obtain the target image and the IMU data, the depth information of the target scene is also obtained. In this step, a corresponding voxel information set may be established for the current target image in the map data including the hash table and the voxel block data unit based on the first pose determined in the first step and the current depth information, and then the voxel information set corresponding to the current target image may be updated according to the semantic information, the target image, the depth information, and the like, to obtain a fused voxel information set. Since most voxels in the map data are invisible in the view corresponding to the current target image, the fusion process of various information can be accelerated through the process.
The fusion mode is not limited in the application example of the present disclosure, in an example, the depth information and the color information may be fused in a moving average mode, and the specific fusion process may refer to the above-mentioned disclosed embodiments and formulas (2) to (5); in an example, semantic information may be fused based on bayesian theorem, and the specific fusion process may refer to the above disclosed embodiments and formula (1), which is not described herein again. Fig. 4 is a schematic diagram illustrating an application example according to the present disclosure, and fig. 4 is a semantic fusion effect diagram of a target scene, and it can be seen from fig. 4 that after semantic information is fused, the target scene can effectively and continuously express semantic information of different areas.
And fourthly, under the condition that the loop image is detected, performing pose optimization on the loop image and updating map data on line in real time.
Pose drift and even wrong estimation can occur in the generation process of the map data, so that the map data is distorted, and the pose accumulation error can be eliminated through loop detection. In this step, after the loop image is detected, the optimized second pose of the loop image can be obtained, and the map is updated on line in real time in a re-fusion mode based on the second pose. The re-fusion may include two processes of de-fusion and re-fusion, and the specific implementation form may refer to each of the above-described disclosed embodiments, which are not described herein again.
In the application example of the present disclosure, map data may be represented in a voxel manner, and at the same time, probability distributions of semantic labels are fused in a voxel information set corresponding to the voxel, so that a mesh with semantic labels may be generated, and a continuous scene may be represented with high quality. Moreover, after a loop is detected and the pose is updated, the application example of the present disclosure may update various types of information fused in the map data online in real time, such as a directional Distance Field (SDF) value representing depth information, color information, and probability distribution of a semantic label, by re-fusing, so as to eliminate distortion in the map in time.
It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile computer readable storage medium or a non-volatile computer readable storage medium.
An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.
In practical applications, the memory may be a volatile memory (RAM); or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.
The processor may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic devices for implementing the above-described processor functions may be other devices, and the embodiments of the present disclosure are not particularly limited.
The electronic device may be provided as a terminal, server, or other form of device.
Based on the same technical concept of the foregoing embodiments, the embodiments of the present disclosure also provide a computer program, which when executed by a processor implements the above method.
Fig. 5 is a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.
Referring to fig. 5, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related personnel information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.
Fig. 6 is a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 6, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.
The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), can execute computer-readable program instructions to implement various aspects of the present disclosure by utilizing state personnel information of the computer-readable program instructions to personalize the electronic circuitry.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method of generating data, comprising:
determining a first pose of a target image in a target scene, and establishing a voxel information set of the target image according to the first pose, wherein the target image comprises at least one frame of image obtained by data acquisition of the target scene;
performing semantic segmentation on the target image to obtain semantic information of the target image;
fusing the semantic information into the voxel information set to obtain a fused voxel information set;
and obtaining map data of the target scene according to the fusion voxel information set corresponding to the target image.
2. The method according to claim 1, wherein said fusing the semantic information into the set of voxel information resulting in a fused set of voxel information comprises:
and projecting at least one voxel in the voxel information set to the target image, and determining semantic probability distribution information of the at least one voxel after projection according to the semantic information of at least one pixel in the target image to obtain a fusion voxel information set.
3. The method according to claim 1 or 2, wherein said fusing the semantic information into the set of voxel information resulting in a fused set of voxel information, further comprises:
acquiring acquisition data obtained by acquiring data of the target scene, and fusing the acquisition data into the voxel information set to obtain a fused voxel information set.
4. The method of claim 3, wherein said fusing the acquired data into the set of voxel information resulting in a fused set of voxel information comprises:
and performing information fusion on at least one voxel in the voxel information set according to the fusion weight of the acquired data and the projection relation between the at least one voxel in the voxel information set and the acquired data to obtain a fusion voxel information set.
5. The method according to any one of claims 1 to 4, further comprising:
performing loop detection on at least one frame of target image, and taking the target image with loop detection as a loop image;
determining a second pose of the loop image, wherein the accuracy of the second pose is higher than the accuracy of the first pose of the target image corresponding to the loop image;
and updating the map data of the target scene according to the second pose of the loop image.
6. The method of claim 5, wherein updating the map data of the target scene according to the second pose of the loop image comprises:
acquiring a fused voxel information set corresponding to the loop image in a first position as a first target set;
based on the first pose and the second pose corresponding to the loop image, re-fusing the fused information in the first target set to obtain a second target set;
and updating the map data of the target scene according to the second target set.
7. The method according to claim 6, wherein the re-fusing the fused information in the first target set based on the first pose and the second pose corresponding to the loop image to obtain a second target set comprises:
according to the first pose corresponding to the loop image, fused information in the first target set is subjected to de-fusion to obtain a third target set;
and according to the second pose corresponding to the loop image, fusing the information of the loop image into the third target set to obtain a second target set.
8. A data generation apparatus, comprising:
the system comprises a voxel information set establishing module, a processing module and a display module, wherein the voxel information set establishing module is used for determining a first pose of a target image in a target scene and establishing a voxel information set of the target image according to the first pose, and the target image comprises at least one frame of image obtained by data acquisition of the target scene;
the semantic segmentation module is used for performing semantic segmentation on the target image to obtain semantic information of the target image;
the fusion module is used for fusing the semantic information into the voxel information set to obtain a fused voxel information set;
and the data generation module is used for obtaining the map data of the target scene according to the fusion voxel information set corresponding to the target image.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 7.
CN202110231700.1A 2021-03-02 2021-03-02 Data generation method and device, electronic equipment and storage medium Pending CN112837372A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202110231700.1A CN112837372A (en) 2021-03-02 2021-03-02 Data generation method and device, electronic equipment and storage medium
KR1020227014409A KR20220125715A (en) 2021-03-02 2021-07-09 Data generation methods, devices, devices, storage media and programs
PCT/CN2021/105485 WO2022183656A1 (en) 2021-03-02 2021-07-09 Data generation method and apparatus, device, storage medium, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110231700.1A CN112837372A (en) 2021-03-02 2021-03-02 Data generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112837372A true CN112837372A (en) 2021-05-25

Family

ID=75934338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110231700.1A Pending CN112837372A (en) 2021-03-02 2021-03-02 Data generation method and device, electronic equipment and storage medium

Country Status (3)

Country Link
KR (1) KR20220125715A (en)
CN (1) CN112837372A (en)
WO (1) WO2022183656A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022183656A1 (en) * 2021-03-02 2022-09-09 浙江商汤科技开发有限公司 Data generation method and apparatus, device, storage medium, and program

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115393386B (en) * 2022-10-25 2023-03-24 杭州华橙软件技术有限公司 Three-dimensional scene graph generation method, device and equipment and readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270361A1 (en) * 2016-03-15 2017-09-21 Solfice Research, Inc. Systems and methods for providing vehicle cognition
CN110044354A (en) * 2019-03-28 2019-07-23 东南大学 A kind of binocular vision indoor positioning and build drawing method and device
CN110807782A (en) * 2019-10-25 2020-02-18 中山大学 Map representation system of visual robot and construction method thereof
CN111047548A (en) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 Attitude transformation data processing method and device, computer equipment and storage medium
CN111462300A (en) * 2020-03-05 2020-07-28 清华大学 Real-time human body dynamic three-dimensional reconstruction method and system driven by clothing physical model
CN111476907A (en) * 2020-04-14 2020-07-31 青岛小鸟看看科技有限公司 Positioning and three-dimensional scene reconstruction device and method based on virtual reality technology

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732587B (en) * 2015-04-14 2019-02-01 中国科学技术大学 A kind of indoor 3D semanteme map constructing method based on depth transducer
US11360216B2 (en) * 2017-11-29 2022-06-14 VoxelMaps Inc. Method and system for positioning of autonomously operating entities
US11157527B2 (en) * 2018-02-20 2021-10-26 Zoox, Inc. Creating clean maps including semantic information
CN110781262B (en) * 2019-10-21 2023-06-02 中国科学院计算技术研究所 Semantic map construction method based on visual SLAM
CN111462324B (en) * 2020-05-18 2022-05-17 南京大学 Online spatiotemporal semantic fusion method and system
CN112348921B (en) * 2020-11-05 2024-03-29 上海汽车集团股份有限公司 Drawing construction method and system based on visual semantic point cloud
CN112837372A (en) * 2021-03-02 2021-05-25 浙江商汤科技开发有限公司 Data generation method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170270361A1 (en) * 2016-03-15 2017-09-21 Solfice Research, Inc. Systems and methods for providing vehicle cognition
CN110044354A (en) * 2019-03-28 2019-07-23 东南大学 A kind of binocular vision indoor positioning and build drawing method and device
CN110807782A (en) * 2019-10-25 2020-02-18 中山大学 Map representation system of visual robot and construction method thereof
CN111462300A (en) * 2020-03-05 2020-07-28 清华大学 Real-time human body dynamic three-dimensional reconstruction method and system driven by clothing physical model
CN111047548A (en) * 2020-03-12 2020-04-21 腾讯科技(深圳)有限公司 Attitude transformation data processing method and device, computer equipment and storage medium
CN111476907A (en) * 2020-04-14 2020-07-31 青岛小鸟看看科技有限公司 Positioning and three-dimensional scene reconstruction device and method based on virtual reality technology

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JOHN MCCORMAC ET AL.: "SemanticFusion: Dense 3D semantic mapping with convolutional neural networks", 《2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA)》 *
LEI LAI ET AL.: "3D Semantic Map Construction System Based on Visual SLAM and CNNs", 《IECON 2020 THE 46TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY》 *
刘峥: "基于深度学习的三维语义地图构建的研究与应用", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *
赵哲: "面向室内场景的3D场景重建与语义理解", 《中国博士学位论文全文数据库 (信息科技辑)》 *
郑冰清: "一种融合语义地图与回环检测的视觉SLAM方法", 《中国惯性技术学报》 *
陈森: "基于深度学习的室内三维语义地图构建研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022183656A1 (en) * 2021-03-02 2022-09-09 浙江商汤科技开发有限公司 Data generation method and apparatus, device, storage medium, and program

Also Published As

Publication number Publication date
KR20220125715A (en) 2022-09-14
WO2022183656A1 (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN110674719B (en) Target object matching method and device, electronic equipment and storage medium
CN111540000B (en) Scene depth and camera motion prediction method and device, electronic device and medium
CN112001321A (en) Network training method, pedestrian re-identification method, network training device, pedestrian re-identification device, electronic equipment and storage medium
CN111983635A (en) Pose determination method and device, electronic equipment and storage medium
CN112991553B (en) Information display method and device, electronic equipment and storage medium
CN111401230B (en) Gesture estimation method and device, electronic equipment and storage medium
CN111881827B (en) Target detection method and device, electronic equipment and storage medium
CN112432637B (en) Positioning method and device, electronic equipment and storage medium
CN111860373B (en) Target detection method and device, electronic equipment and storage medium
CN114088061B (en) Target positioning method and device, electronic equipment and storage medium
CN113074726A (en) Pose determination method and device, electronic equipment and storage medium
WO2022134475A1 (en) Point cloud map construction method and apparatus, electronic device, storage medium and program
WO2022183656A1 (en) Data generation method and apparatus, device, storage medium, and program
CN112432636B (en) Positioning method and device, electronic equipment and storage medium
CN113052874B (en) Target tracking method and device, electronic equipment and storage medium
CN112767541A (en) Three-dimensional reconstruction method and device, electronic equipment and storage medium
CN111784773A (en) Image processing method and device and neural network training method and device
WO2022237071A1 (en) Locating method and apparatus, and electronic device, storage medium and computer program
CN112948411B (en) Pose data processing method, interface, device, system, equipment and medium
WO2022110777A1 (en) Positioning method and apparatus, electronic device, storage medium, computer program product, and computer program
CN114549983A (en) Computer vision model training method and device, electronic equipment and storage medium
CN112461245A (en) Data processing method and device, electronic equipment and storage medium
CN112965652A (en) Information display method and device, electronic equipment and storage medium
CN112837361A (en) Depth estimation method and device, electronic equipment and storage medium
CN113160424A (en) Virtual object placing method, device, equipment and medium based on augmented reality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40045349

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20210525

RJ01 Rejection of invention patent application after publication