CN114140528A

CN114140528A - Data annotation method and device, computer equipment and storage medium

Info

Publication number: CN114140528A
Application number: CN202111396883.9A
Authority: CN
Inventors: 侯欣如; 姜翰青; 刘浩敏; 陈东生; 甄佳楠
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-04
Also published as: WO2023093217A1

Abstract

The present disclosure provides a data annotation method, apparatus, computer device and storage medium, wherein the application is to a server; the method comprises the following steps: acquiring image data to be processed; the image data carries tag information, and the image data comprises a video or an image obtained by carrying out image acquisition on a target scene; performing three-dimensional reconstruction on the target scene based on the image data to obtain a three-dimensional model of the target scene; determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data; and adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

Description

Data annotation method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data annotation method, an apparatus, a computer device, and a storage medium.

Background

In order to facilitate the labeling of the object in the target scene by the staff, the target scene can be shot, and the data labeling can be performed on the shot image. However, after each object is labeled on the two-dimensional image, the data labeling result corresponding to the object in the two-dimensional image cannot be visually displayed, and the subsequent management and control and maintenance of the object based on the data labeling result are inconvenient.

Disclosure of Invention

The embodiment of the disclosure at least provides a data annotation method, a data annotation device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a data annotation method, which is applied to a server; the method comprises the following steps: acquiring image data to be processed; the image data carries label information, and the image data comprises a video or an image obtained by image acquisition of the target scene; performing three-dimensional reconstruction on the target scene based on the image data to obtain a three-dimensional model of the target scene; determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data; and adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

Thus, a three-dimensional model of a target scene is constructed, and a target three-dimensional position of label information in a two-dimensional image under a model coordinate system corresponding to the three-dimensional model is determined based on a two-dimensional labeling position of the label information in image data; and adding label information to the three-dimensional model based on the three-dimensional position of the target, so that the generated three-dimensional model of the target scene carries the label information, the data labeling result corresponding to each target object in the target scene is conveniently and visually displayed through the three-dimensional model of the target scene carrying the label information, and the control and maintenance of the target object in the target scene are conveniently performed on the basis of the data labeling result.

In an optional implementation manner, the tag information carried by the image data is obtained according to the following steps: and performing semantic segmentation processing on the image in the image data, and generating the label information based on the result of the semantic segmentation processing.

In this way, the semantic segmentation processing is carried out on the image in the image data, so that the label information is automatically generated, and the efficiency of data annotation is improved.

In an optional implementation manner, the tag information carried by the image data is obtained according to the following steps: receiving the label information sent by the terminal equipment; the tag information is generated by the terminal device in response to the labeling operation of the image in the image data.

In an optional embodiment, the determining, based on the two-dimensional labeling position of the tag information in the image data, a target three-dimensional position of the tag information in a model coordinate system corresponding to the three-dimensional model includes: determining a target image marked with the label information in at least one frame of image included in the image data; determining a target pixel point corresponding to the target image at the two-dimensional labeling position of the label information; based on the target image and the pose of the image acquisition equipment when acquiring the target image, carrying out three-dimensional position recovery on the target image to obtain the three-dimensional position of the target pixel point under the model coordinate system; and determining the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model based on the three-dimensional position of the target pixel point in the model coordinate system.

Therefore, the target pixel points marked with the label information on the two-dimensional image can be accurately determined through the target image corresponding to the label information and the pose of the image acquisition equipment when the target image is acquired, and the target three-dimensional positions of the target scene in the three-dimensional coordinate system correspond to the target three-dimensional positions, so that the label information can be accurately added to the three-dimensional model of the target scene based on the target three-dimensional model subsequently, and the accuracy of subsequently generating the three-dimensional model of the target scene with the label information is improved.

In an optional implementation manner, the three-dimensional model includes three-dimensional submodels corresponding to respective target objects in the target scene; adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position, wherein the label information comprises: determining a target object to be added with the label information based on the target three-dimensional position and the pose of the three-dimensional sub-model corresponding to each target object in the three-dimensional model under the model coordinate system; and establishing an incidence relation between the target object to be added with the label information and the label information.

Therefore, the target object to be added with the label information can be accurately determined from the target objects contained in the target scene based on the three-dimensional position of the target and the poses of the three-dimensional paper models corresponding to the target objects in the target scene, so that the label information can be added to the target object to be added with the label information, and the accuracy of the generated three-dimensional model of the target scene with the label information is improved.

In an optional implementation manner, the establishing an association relationship between the target object to which the tag information is to be added and the tag information includes: determining a three-dimensional labeling position of the label information on the target object to which the label information is to be added based on the target three-dimensional position; and establishing an incidence relation between the three-dimensional labeling position and the label information.

Therefore, the corresponding label information can be accurately added to the three-dimensional sub-model corresponding to the target object to be added with the label information in the target scene based on the target three-dimensional position, and the accuracy of the generated three-dimensional model of the target scene with the label information is improved.

In an optional embodiment, the method further comprises: acquiring a display material; generating a label instance based on the display material and the label information; and in response to a label display event being triggered, showing the three-dimensional model of the target scene and the label instance.

Therefore, the tag instances containing the tag information and the three-dimensional model of the target scene are displayed to the user, so that the user can more intuitively know the spatial structure of the target scene, the pose information of each target object in the target scene, the structure of each target object and the tag information carried by the target object, and the subsequent user can conveniently manage and control and maintain the target object in the target scene based on the data labeling result.

In an alternative embodiment, the tag display event comprises: triggering the target object added with the label information; the displaying the three-dimensional model of the target scene and the label instance comprises: and displaying the three-dimensional model of the target scene and the label instance corresponding to the triggered target object.

In an alternative embodiment, the tag display event comprises: displaying the three-dimensional labeling position which establishes the incidence relation with the label information in a graphical user interface; the displaying the three-dimensional model of the target scene and the label instance comprises: and displaying the three-dimensional model of the target scene and the label instance having the incidence relation with the three-dimensional labeling position.

In an alternative embodiment, the tag information includes at least one of: tag attribute information, and/or tag content information; wherein the tag attribute information includes at least one of: label size information, label color information, label shape information; the tag content information includes at least one of: attribute information of the corresponding target object, defect inspection result information of the corresponding target object, and trouble shooting condition information of the corresponding target object.

In an alternative embodiment, the three-dimensional model comprises a three-dimensional point cloud model; the three-dimensional reconstruction of the target scene based on the image data to obtain a three-dimensional model of the target scene includes: performing three-dimensional point cloud reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain point cloud data of the target scene; the point cloud data includes: point cloud points corresponding to a plurality of target objects in the target scene respectively and position information corresponding to the point cloud points respectively; performing semantic segmentation processing on the point cloud data to obtain semantic information corresponding to a plurality of point cloud points respectively; generating a three-dimensional point cloud model of the target scene based on the point cloud data and the semantic information; the three-dimensional point cloud model of the target scene comprises three-dimensional sub-point cloud models corresponding to the target objects respectively.

Therefore, three-dimensional point cloud reconstruction is carried out on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data, semantic segmentation is carried out on the obtained point cloud data, a three-dimensional point cloud model capable of reflecting the real space structure of each target object in the target scene and the pose information corresponding to each target object is generated, and a relatively accurate data basis is provided for adding label information to the three-dimensional model of the target scene subsequently.

In an alternative embodiment, the three-dimensional model comprises a three-dimensional dense model; the three-dimensional reconstruction of the target scene based on the image data to obtain a three-dimensional model of the target scene includes: performing three-dimensional dense reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain three-dimensional dense data of the target scene; the three-dimensional dense data includes: a plurality of dense points on the surfaces of a plurality of target objects in the target scene and position information corresponding to each dense point; performing semantic segmentation processing on the three-dimensional dense data to obtain semantic information respectively corresponding to a plurality of patches formed by the dense points; generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information; the three-dimensional dense model of the target scene comprises three-dimensional sub dense models corresponding to the target objects respectively.

Therefore, three-dimensional dense reconstruction is carried out on the target scene based on the image data and the pose of the image acquisition equipment when the image data is acquired, the obtained three-dimensional dense data is subjected to semantic segmentation, a three-dimensional dense model capable of reflecting the real space structure of each target object in the target scene and the pose information corresponding to each target object is generated, and a relatively accurate data basis is provided for adding label information to the three-dimensional model of the target scene subsequently.

In an alternative embodiment, the target object comprises at least one of: a building located within the target scene, and a device deployed within the target scene.

In a second aspect, an embodiment of the present disclosure further provides a data annotation device, which is applied to a server, where the device includes: the acquisition module is used for acquiring image data to be processed; the image data carries tag information, and the image data comprises a video or an image obtained by carrying out image acquisition on a target scene; the first processing module is used for carrying out three-dimensional reconstruction on the target scene based on the image data to obtain a three-dimensional model of the target scene; the determining module is used for determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data; and the second processing module is used for adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

In an optional implementation manner, when the obtaining module obtains the tag information carried in the image data according to the following steps, the obtaining module is specifically configured to: and performing semantic segmentation processing on the image in the image data, and generating the label information based on the result of the semantic segmentation processing.

In an optional implementation manner, when the obtaining module obtains the tag information carried in the image data according to the following steps, the obtaining module is specifically configured to: receiving the label information sent by the terminal equipment; the tag information is generated by the terminal device in response to the labeling operation of the image in the image data.

In an optional embodiment, when the determining module determines the target three-dimensional position of the tag information in the model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the tag information in the image data, the determining module is specifically configured to: determining a target image marked with the label information in at least one frame of image included in the image data; determining a target pixel point corresponding to the target image at the two-dimensional labeling position of the label information; based on the target image and the pose of the image acquisition equipment when acquiring the target image, carrying out three-dimensional position recovery on the target image to obtain the three-dimensional position of the target pixel point under the model coordinate system; and determining the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model based on the three-dimensional position of the target pixel point in the model coordinate system.

In an optional implementation manner, the three-dimensional model includes three-dimensional submodels corresponding to respective target objects in the target scene; the second processing module, when executing the adding of the tag information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position, is specifically configured to: determining a target object to be added with the label information based on the target three-dimensional position and the pose of the three-dimensional sub-model corresponding to each target object in the three-dimensional model under the model coordinate system; and establishing an incidence relation between the target object to be added with the label information and the label information.

In an optional implementation manner, when the association relationship between the target object to which the tag information is to be added and the tag information is established, the second processing module is specifically configured to: determining a three-dimensional labeling position of the label information on the target object to which the label information is to be added based on the target three-dimensional position; and establishing an incidence relation between the three-dimensional labeling position and the label information.

In an alternative embodiment, the apparatus further comprises: the display module is used for acquiring display materials; generating a label instance based on the display material and the label information; and in response to a label display event being triggered, showing the three-dimensional model of the target scene and the label instance.

In an alternative embodiment, the tag display event comprises: triggering the target object added with the label information; the display module, when executing the three-dimensional model for displaying the target scene and the tag instance, is specifically configured to: and displaying the three-dimensional model of the target scene and the label instance corresponding to the triggered target object.

In an alternative embodiment, the tag display event comprises: displaying the three-dimensional labeling position which establishes the incidence relation with the label information in a graphical user interface; the display module, when executing the three-dimensional model for displaying the target scene and the tag instance, is specifically configured to: and displaying the three-dimensional model of the target scene and the label instance having the incidence relation with the three-dimensional labeling position.

In an alternative embodiment, the three-dimensional model comprises a three-dimensional point cloud model; the first processing module, when performing the three-dimensional reconstruction of the target scene based on the image data to obtain a three-dimensional model of the target scene, is specifically configured to: performing three-dimensional point cloud reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain point cloud data of the target scene; the point cloud data includes: point cloud points corresponding to a plurality of target objects in the target scene respectively and position information corresponding to the point cloud points respectively; performing semantic segmentation processing on the point cloud data to obtain semantic information corresponding to a plurality of point cloud points respectively; generating a three-dimensional point cloud model of the target scene based on the point cloud data and the semantic information; the three-dimensional point cloud model of the target scene comprises three-dimensional sub-point cloud models corresponding to the target objects respectively.

In an alternative embodiment, the three-dimensional model comprises a three-dimensional dense model; the first processing module, when performing the three-dimensional reconstruction of the target scene based on the image data to obtain a three-dimensional model of the target scene, is specifically configured to: performing three-dimensional dense reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain three-dimensional dense data of the target scene; the three-dimensional dense data includes: a plurality of dense points on the surfaces of a plurality of target objects in the target scene and position information corresponding to each dense point; performing semantic segmentation processing on the three-dimensional dense data to obtain semantic information respectively corresponding to a plurality of patches formed by the dense points; generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information; the three-dimensional dense model of the target scene comprises three-dimensional sub dense models corresponding to the target objects respectively.

In a third aspect, this disclosure also provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the machine-readable instructions are executed by the processor to perform the steps in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, this disclosure also provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the data annotation device, the computer device, and the computer-readable storage medium, reference is made to the description of the data annotation method, which is not repeated herein.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of a data annotation method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a graphical user interface for indicating image acquisition in the data annotation method provided by the embodiment of the disclosure;

FIG. 3 is a flow chart illustrating a specific manner of generating a three-dimensional dense model in the data annotation method provided by the embodiment of the disclosure;

fig. 4 is a schematic diagram illustrating a display interface displaying a three-dimensional model of a target scene and a tag instance in the data annotation method provided in the embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a data annotation device provided by an embodiment of the disclosure;

fig. 6 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that in order to facilitate the labeling of objects in a target scene by workers, the target scene can be shot and the data labeling can be performed on the shot images. However, after each object is labeled on the two-dimensional image, the data labeling result corresponding to the object in the two-dimensional image cannot be visually displayed, and the subsequent management and control and maintenance of the object based on the data labeling result are inconvenient.

Based on the research, the present disclosure provides a data labeling method, apparatus, computer device and storage medium, which determine a three-dimensional position of a target in a model coordinate system corresponding to a three-dimensional model of tag information in a two-dimensional image by constructing the three-dimensional model of the target scene and based on a two-dimensional labeling position of the tag information in image data; and adding label information to the three-dimensional model based on the three-dimensional position of the target, so that the generated three-dimensional model of the target scene carries the label information, the data labeling result corresponding to each target object in the target scene is conveniently and visually displayed through the three-dimensional model of the target scene carrying the label information, and the control and maintenance of the target object in the target scene are conveniently performed on the basis of the data labeling result.

The above drawbacks and solutions are the results of the inventor after practical and careful study, and therefore, the discovery process of the above problems and the solutions proposed by the present disclosure to the above problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a data annotation method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the data annotation method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the data annotation process can be implemented by a processor calling computer readable instructions stored in a memory.

The data annotation method provided by the embodiment of the present disclosure is explained below.

Referring to fig. 1, a flowchart of a data annotation method provided in an embodiment of the present disclosure is shown, where the method is applied to a server, and the method includes steps S101 to S104, where:

s101, image data to be processed are obtained.

The image data carries tag information, and the image data comprises a video or an image obtained by image acquisition of a target scene by using image acquisition equipment, wherein the video can comprise a panoramic video, for example; the image capture device may include, for example, at least one of a cell phone, a camera, a video camera, a panoramic camera, an unmanned aerial vehicle, a drone, and the like. Because the image acquisition equipment can obtain image data or a plurality of video frame images when shooting the target scene, the image acquisition equipment can be suitable for shooting in the target scene with larger space such as a machine room, a factory building and the like. Taking a computer room as an example, a computing device, a data storage device, a signal receiving device, and the like may be stored therein; the plant may house, for example, production facilities, handling facilities, transportation facilities, and the like. The machine room and the workshop are both solid spaces.

Illustratively, the target scenario may include, for example, a machine room having a large floor space, such as a machine room having a floor space of 20 square meters, 30 square meters, or 50 square meters. In the case of taking a machine room as a target scene, the scene in the machine room can be shot by using the image acquisition equipment.

In addition, the target scene may also be an outdoor scene, for example, in order to monitor the surrounding environment of the tower used for communication or for transmitting electric power, so as to prevent vegetation around the tower from affecting normal application of the tower during growth, the tower and the surrounding environment may be used as the target scene, the image data may be acquired, and modeling may be performed on the tower, vegetation near the tower, and buildings and the like that may exist near the tower.

In one possible case, the target scene for data acquisition and data annotation may include multiple regions, for example, multiple rooms may be included in a large target scene. Because the areas for data acquisition are similar, the data annotation method provided by the embodiment of the disclosure can be correspondingly applied to different multiple areas contained in the target scene. Additionally, at least one target object is also included in the target scene, and the target object may include, for example, but is not limited to: the system comprises a building positioned in a target scene, equipment deployed in the target scene and vegetation positioned in the target scene; for example, in the case that the target scene includes a machine room, the buildings located in the target scene may include, but are not limited to: at least one of a machine room ceiling, a machine room floor, a machine room wall, a machine room column, etc.; devices deployed within the target scene may include, for example, but are not limited to: the tower and the outdoor cabinet are arranged on the ceiling of the machine room, the cabling rack connected with the tower is arranged, and the indoor cabinet is arranged in the machine room.

Specifically, when the image acquisition device acquires an image of a target scene, the robot carrying the image acquisition device is controlled to walk in the target place to acquire image data corresponding to the target scene; or, image acquisition can be performed on the target scene in a manner that workers such as survey personnel hold the image acquisition equipment, so as to acquire image data corresponding to the target scene; alternatively, the unmanned aerial vehicle equipped with the image acquisition device may be controlled to fly in the target scene to acquire the image data of the target scene.

When the image of the target scene is acquired, in order to complete modeling of the target scene, the image acquisition device may be controlled to acquire images at different poses to form image data corresponding to the target scene.

When image data acquired by the image acquisition device is subjected to data processing, for example, the image data is used for three-dimensional reconstruction, so that the pose of the image acquisition device in a target scene needs to be determined. In this case, for example, before the image capturing device captures an image of the target scene, the gyroscope of the image capturing device may be calibrated to determine the pose of the image capturing device in the target scene; illustratively, for example, the optical axis of the image capture device may be adjusted to be parallel to the ground of the target scene.

After the gyroscope of the image acquisition equipment is calibrated, image data acquisition can be carried out by selecting an image data acquisition mode of the image acquisition equipment, and image data corresponding to a target scene is obtained.

For example, an image acquisition device installed on a server may be used to acquire an image of a target scene, and when acquiring the image, refer to fig. 2, which is a schematic diagram of a graphical user interface for indicating image acquisition provided by the embodiment of the present disclosure; in the graphical user interface shown in the schematic diagram, the image data obtained by shooting is shown. The image data includes two objects in a target scene, including a cabinet and a cabinet. In addition, for example, instruction information for prompting the user to perform shooting, for example, prompt information 21 of "cabinet front" and "cabinet side" shown below the graphical user interface may be displayed on the graphical user interface. The prompt information 21 may be shown in the form of an operation button, for example, and in response to the user activating the corresponding operation button, an acquisition button 22 indicating acquisition of an image may also be displayed at the corresponding position. In response to the activation of the capture button 22, image capture may be correspondingly enabled and further image capture may be performed in accordance with the indication.

In addition, in the graphical user interface, a return control 23 and an edit control 24 may also be displayed accordingly. And responding to the user triggering the return control 23, returning to the previous operation step, and correspondingly displaying a display image corresponding to the previous operation step on the graphical user interface. In response to the user triggering the edit control 24, a corresponding data annotation page may also be correspondingly displayed to the user.

In an embodiment of the present disclosure, the tag information carried by the image data may also be obtained correspondingly.

The tag information may include, but is not limited to: tag attribute information, and/or tag content information; the tag attribute information is used to present the tag, and may include, for example and without limitation: label size information, label color information, label shape information; the tag content information is used to describe information contained in the real scene by the target object in the target scene, and may include, but is not limited to, at least one of the following: attribute information of the corresponding target object, defect inspection result information of the corresponding target object, and trouble shooting condition information of the corresponding target object.

Specifically, the label information carried by the image data may be acquired by at least one of, but not limited to, the following a1 to A3:

a1, semantic segmentation processing is performed on the image in the video data, and label information is generated based on the result of the semantic segmentation processing.

In one embodiment, when an image acquisition device (e.g., a camera) installed on a server is used to acquire an image of a target scene, after image data of the target scene is acquired, a pre-trained neural network may be used to perform semantic segmentation processing on the image in the image data; based on the result of the semantic segmentation process, tag information is generated.

The pre-trained neural network may include, but is not limited to, at least one of the following: convolutional Neural Networks (CNN) and self-attention Neural Networks (transformers), which will not be described in detail below.

In another implementation, after acquiring image data of a target scene, the terminal device transmits the image data of the target scene to the server, and the server performs semantic segmentation processing on an image in the received image data of the target scene by using a pre-trained neural network; based on the result of the semantic segmentation process, tag information is generated.

Exemplarily, performing semantic segmentation processing on an image in image data by using a pre-trained neural network to obtain a cabinet and a comprehensive cabinet; then the cabinet and the integrated cabinet are marked at the corresponding position in the image data.

A2, after the image acquisition device installed on the server is used to acquire the image of the target scene and display the image on the graphical user interface, the label information can be generated in response to the marking operation of the user on the graphical user interface.

For example, after the image data acquired by the image acquisition device is acquired, the acquired image data may be correspondingly displayed on the graphical user interface. The image data shown on the graphical user interface can be viewed by a user such as a worker through operations such as dragging and sliding on the graphical user interface or by triggering controls such as playing and pausing.

In the process of viewing the image data, one or more frames of images can be selected for labeling, for example, a clear image can be determined in the process of viewing the image data, or an image of a target object can be displayed clearly, and data labeling is performed by triggering any pixel point in the determined image. Specifically, when data annotation is performed, for example, a corresponding data annotation control may be triggered, so as to fill in annotated data at a corresponding position on the graphical user interface.

For example, the respective at least one annotation control can be displayed on the graphical user interface in response to a user triggering any of the locations on the graphical user interface. By using different labeling controls, for example, the attribute information of the object, such as the equipment name, the service life, the specific function, the equipment responsible person, the equipment manufacturer, the equipment size specification, the relevant text remark and the like, can be filled and labeled. In addition, in order to better prompt the user to label the object on the graphical user interface, for example, indication labels, such as the indication label 25 corresponding to the integrated cabinet and the indication label 26 corresponding to the cabinet in fig. 2, may be displayed at the position triggered by the user. Here, the indication tag and the labeling data may be tag information obtained by labeling an image in the image data; in one possible case, the corresponding annotation data can also be viewed accordingly, for example, by triggering the indicator tag in the tag information.

A3, receiving label information sent by terminal equipment; the tag information is generated by the terminal device in response to the labeling operation of the image in the image data.

Specifically, after the terminal device is used for collecting the image data of the target scene, the collected image data can be correspondingly displayed on a graphical user interface of the terminal device; the image data shown on the graphical user interface can be checked by the user such as a worker through operations such as dragging and sliding on the graphical user interface or by triggering controls such as playing and pausing; in the process of viewing the image data, one or more frames of images may also be selected for manual labeling, and the specific labeling process may refer to the specific implementation shown in a2, and repeated details are not repeated.

The terminal equipment generates label information after responding to the marking operation of the image in the image data; and sending the label information to a server so that the server receives the label information obtained by labeling the image in the image data.

Receiving the above S101, the data annotation method provided in the embodiment of the present disclosure further includes:

s102, three-dimensional reconstruction is carried out on the target scene based on the image data, and a three-dimensional model of the target scene is obtained.

The three-dimensional model may include, for example: at least one of a three-dimensional point cloud model, and a three-dimensional dense model.

In a specific implementation, the three-dimensional reconstruction of the target scene may be performed based on the image data through at least one of, but not limited to, the following B1-B2, so as to obtain a three-dimensional model of the target scene:

b1, under the condition that the three-dimensional model comprises a three-dimensional point cloud model, performing three-dimensional point cloud reconstruction on the target scene based on the image data and the pose of the image data acquired by the image acquisition equipment to obtain point cloud data of the target scene; performing semantic segmentation processing on the point cloud data to obtain semantic information respectively corresponding to a plurality of point cloud points; generating a three-dimensional point cloud model of the target scene based on the point cloud data and the semantic information; the three-dimensional point cloud model of the target scene comprises three-dimensional sub-point cloud models corresponding to the target objects respectively.

Wherein the point cloud data comprises: the point cloud points respectively corresponding to a plurality of target objects in a target scene and the position information respectively corresponding to each point cloud point belong to.

For example, but not limited to, at least one of the following C1 to C2 may be adopted to perform three-dimensional point cloud reconstruction on the target scene based on the image data, so as to obtain point cloud data of the target scene:

and C1, if the image acquisition equipment comprises a mobile phone, acquiring that each pixel point in each frame of image in the image data corresponding to the target scene does not have a depth value. Specifically, through a plurality of images obtained by shooting the target scene at different angles, the specific position of each point in the target scene can be calculated, so that point cloud points corresponding to the target scene can be constructed, and point cloud data of the target scene can be obtained.

C2, if the image capturing device includes a panoramic camera, the obtained image data corresponding to the target scene has corresponding depth values for each pixel in each frame of image, and the image containing the depth values is used to determine the position coordinates of each point in the target scene, that is, the point cloud data of the target scene.

After point cloud data of a target scene is determined, semantic information of cloud points of each point can be determined in a semantic segmentation mode. For example, since semantic segmentation is performed on point cloud data more complicated than semantic segmentation in a two-dimensional space, the problem of semantic segmentation can be converted into a semantic segmentation problem on a two-dimensional image for processing by using a method of synthesizing a two-dimensional image by projecting point cloud points. Specifically, the point cloud points may be projected into the virtual two-dimensional image based on the position information of each point cloud point in the target scene, and then semantic segmentation processing may be performed by using a pre-trained neural network. For a description of the pre-trained neural network, reference may be made to the related description in the specific implementation manner shown in a1 in embodiment S101 of the present disclosure, and repeated descriptions are omitted.

After semantic segmentation processing is carried out on the virtual two-dimensional image, a virtual semantic segmentation image can be obtained; each virtual pixel point in the virtual semantic segmentation image corresponds to scores in different categories, wherein the scores in different categories represent confidence degrees that the virtual pixel point belongs to corresponding categories, and corresponding semantic information can be correspondingly determined for the virtual pixel point according to the scores in different categories.

In this way, the semantic information determined in the virtual two-dimensional image can be mapped to the point cloud points according to the corresponding relationship between the virtual pixel points and the point cloud points, and the determination of the semantic information of each point cloud point is also completed.

After determining the semantic information of each point cloud point, a three-dimensional point cloud model of the target scene can be determined according to the position information of each point cloud point in the target scene and the semantic information corresponding to each point cloud point, wherein the three-dimensional point cloud model of the target scene comprises three-dimensional sub-point cloud models corresponding to a plurality of target objects respectively.

For example, after determining semantic information corresponding to each point cloud point, point cloud points with adjacent positions and the same semantic information may be used as point cloud points corresponding to the same target object. After point cloud points belonging to the same target object are determined, a three-dimensional sub-point cloud model corresponding to the target object can be generated based on the point cloud points respectively corresponding to the target objects; here, the three-dimensional sub-point cloud model corresponding to each target object may carry semantic information corresponding to each cloud point.

B2, under the condition that the three-dimensional model comprises a three-dimensional dense model, performing three-dimensional dense reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain three-dimensional dense data of the target scene; performing semantic segmentation processing on the three-dimensional dense data to obtain semantic information respectively corresponding to a plurality of patches formed by dense points; generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information; the three-dimensional dense model of the target scene comprises three-dimensional sub dense models corresponding to the target objects respectively.

Wherein the three-dimensional dense data comprises: the position information comprises a plurality of dense points positioned on the surfaces of a plurality of target objects in a target scene and position information corresponding to each dense point; here, any patch of the plurality of patches composed of dense points is composed of at least three dense points having a connection relationship, and the patches provided by the embodiments of the present disclosure may include, but are not limited to, at least one of triangular patches or quadrilateral patches, for example, and are not limited specifically here.

For example, when the three-dimensional model of the target scene is obtained by three-dimensionally reconstructing the target scene based on the image data, at least one of the following methods D1 to D2 may be used:

and D1, the image acquisition equipment only takes charge of image acquisition, and transmits the acquired image data and the pose of the image acquisition equipment when the image data is acquired to the data processing equipment by depending on network connection, so that the data processing equipment establishes a three-dimensional model of the target scene.

Network connections that may be relied upon may include, but are not limited to, Fiber Ethernet adapters, mobile communication technologies (e.g., fourth generation mobile communication technology (4G), or fifth generation mobile communication technology (5G)), and Wireless Fidelity (Wi-Fi), among others, for example; the data processing device may for example comprise, but is not limited to, the computer device described above.

When the data processing device processes the image data, for example, three-dimensional point cloud reconstruction may be performed on the target scene according to the image data and a pose of the image acquisition device when acquiring the image data (i.e., a pose of the image acquisition device in the target scene), so as to obtain point cloud data of the target scene; performing semantic segmentation processing on point cloud data by using at least one of CNN (common noise network) and a deep self-attention transformation network to obtain semantic information corresponding to a plurality of point cloud points respectively; generating a three-dimensional point cloud model of the target scene based on the point cloud data and the semantic information; the target scene can be subjected to three-dimensional dense reconstruction according to the image data and the pose of the image acquisition equipment when the image data is acquired, so that three-dimensional dense data of the target scene is obtained; performing semantic segmentation processing on the three-dimensional dense data by using at least one of CNN (CNN) and a deep self-attention transformation network to obtain semantic information corresponding to a plurality of patches formed by dense points; and generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information corresponding to each patch.

When the pose of the image acquisition device when acquiring the image data is obtained, for example, the relevant data of an Inertial Measurement Unit (IMU) when the image acquisition device acquires the image data may be obtained. For example, in the inertial measurement unit IMU of the image capturing device, for example, three single-axis accelerometers and three single-axis gyroscopes may be included, where the accelerometers may detect acceleration of the image capturing device when capturing image data, and the gyroscopes may detect angular velocities of the image capturing device when capturing image data. Therefore, the pose of the image acquisition equipment when acquiring image data can be accurately determined by acquiring the relevant data of the inertial measurement unit IMU in the image acquisition equipment.

For example, when the image acquisition device acquires image data, a three-dimensional point cloud model or a three-dimensional dense model covering a target scene may be gradually generated when the image acquisition device gradually moves to acquire the image data; or after the image acquisition equipment finishes acquiring the image data, generating a three-dimensional point cloud model or a three-dimensional dense model corresponding to the target scene by using the obtained complete image data.

In another embodiment of the present disclosure, the image data may include a panoramic video, and in a case that the image data includes the panoramic video, the three-dimensional model of the target scene may be generated based on the panoramic video obtained by image capturing of the target scene by the panoramic camera. The panoramic camera selects two fisheye cameras arranged at the front and back positions on the scanner; the fisheye camera is arranged on the scanner in a preset pose position to acquire a panoramic video corresponding to a complete target scene.

Referring to fig. 3, a flowchart of a specific manner of generating a three-dimensional dense model for a panoramic video obtained by acquiring an image of a target scene by a panoramic camera by a data processing device according to an embodiment of the present disclosure is shown, where:

s301, the data processing equipment acquires two panoramic videos which are acquired by the front fisheye camera and the rear fisheye camera of the scanner in real time and are synchronous in time.

Wherein, the two panoramic videos respectively comprise a plurality of frames of video frame images. Because the two fisheye cameras collect two panoramic videos with synchronous time in real time, timestamps of multi-frame video frame images respectively included in the two panoramic videos respectively correspond to each other.

In addition, the precision of the time stamp and the acquisition frequency when acquiring the video frame images in the panoramic video can be determined according to the specific instrument parameters of the two fisheye cameras. For example, setting the time stamp of the video frame image to be accurate to nanosecond; and when the video frame images in the panoramic video are acquired, the acquisition frequency is not lower than 30 hertz (Hz).

S302, the data processing equipment determines relevant data of the inertial measurement unit IMU when the two fisheye cameras respectively acquire the panoramic video.

Taking any one of the two fisheye cameras as an example, when the fisheye camera captures a video frame image in a panoramic video, the relevant data of the inertial measurement unit IMU between two adjacent frames of video frames and the timestamp when the relevant data is acquired can be correspondingly observed and acquired. In particular, a corresponding scanner coordinate system (which may be constituted by, for example, an X-axis, a Y-axis, and a Z-axis) may also be determined for the fisheye camera to determine relevant data of the inertial measurement unit IMU on the scanner coordinate system, such as accelerations and angular velocities under the X-axis, the Y-axis, and the Z-axis of the scanner coordinate system.

In addition, the time stamp for acquiring the relevant data of the inertial measurement unit IMU can be determined according to the specific instrument parameters of the two fisheye cameras. For example, it may be determined that the observation frequency for acquiring the relevant data of the inertial measurement unit IMU is not lower than 400 Hz.

S303, the data processing equipment determines the poses of the two fisheye cameras in the world coordinate system based on the relevant data of the inertial measurement unit IMU.

Specifically, since the coordinate system transformation relationship between the scanner coordinate system and the world coordinate system can be determined, after the relevant data Of the inertial measurement unit IMU is acquired, the poses Of the two fisheye cameras in the world coordinate system can be determined according to the coordinate system transformation relationship, for example, the poses can be expressed as 6-Degree Of Freedom (6 DOF) poses, and specifically, according to the coordinate system transformation relationship between the scanner coordinate system and the world coordinate system, the existing coordinate system transformation method is adopted for determining the poses Of the two fisheye cameras in the world coordinate system, and details are not repeated here.

For the above S301 to S303, since the video frame images in the panoramic video are all panoramic images, the 6DOF pose of the image capturing device can be accurately solved by the processing steps of image processing, key point extraction, key point tracking, and association relationship establishment between key points, that is, the 6DOF pose of the image capturing device can be captured and calculated in real time; and moreover, the coordinates of dense point cloud points in the target object can be obtained.

When the video frame images in the panoramic video are processed, the key frame images can be determined in the corresponding multi-frame video frame images in the panoramic video, so that the calculated amount is reduced and the efficiency is improved while the sufficient amount of processing data is ensured in the process of three-dimensional dense reconstruction.

Specifically, the manner of determining the key frame image from the panoramic video may be, for example, but not limited to, at least one of the following manners E1 to E4:

e1, extracting at least one frame of video frame image from the panoramic video as a key frame image by using an alternate frame sampling method.

And E2, extracting the frequency of a preset number of video frame images in a preset time, and extracting at least one frame of video frame image from the panoramic video to be used as a key frame image.

The preset time extraction of the preset number of video frame images may include, but is not limited to: two frames per second.

E3, recognizing the content of each frame of video frame image in the panoramic video by using the technologies such as image Processing algorithm, image analysis algorithm, Natural Language Processing (NLP), etc., determining semantic information corresponding to each frame of video frame image, and extracting the video frame image including the target object as the key frame image based on the semantic information corresponding to each frame of video frame image.

E4, determining key frame images in the panoramic video in response to the selection of the video frame images in the panoramic video.

In a specific implementation, a panoramic video of a target scene can be presented to a user, and when the panoramic video is presented, a part of selected video frames in the panoramic video can be used as key frame images in the panoramic video in response to a user selecting operation on the part of selected video frames.

Illustratively, when the panoramic video is presented to the user, for example, a prompt for a selected key frame image may be displayed to the user. Specifically, for example, a video frame image in the panoramic video may be selected in response to a specific operation such as a long press, a double click, or the like by the user, and the selected video frame image may be used as the key frame image. In addition, prompt information may be displayed, for example, a message containing the text "please press the frame of video frame image long to select" is displayed, and when receiving that the user performs a long press operation on any frame of video frame image in the panoramic video, the frame of video frame image is used as a key frame image.

After determining the key frame image in the panoramic video, storing a key frame image map in the background, so that after controlling the image acquisition device to return to the acquired position again, comparing the two frames of video frame images at the position to perform loop detection on the image acquisition device, thereby correcting the positioning accumulated error of the image acquisition device under long-time and long-distance operation.

S304, the data processing equipment processes the keyframe images in the panoramic video and the poses of the fisheye cameras, which are respectively acquired by the fisheye cameras, as input data of the real-time dense reconstruction algorithm.

For example, for a panoramic video acquired by any fisheye camera, after determining a new key frame image in the panoramic video by using the above S301 to S303, all currently obtained key frame images and the poses of the fisheye cameras corresponding to the new key frame image are used as input data of the real-time dense reconstruction algorithm.

Before a new key frame image is obtained, for the transmitted key frame image, when the key frame image is used as input data of the real-time dense reconstruction algorithm, the pose of the corresponding fisheye camera is used as the input data to be input into the real-time dense reconstruction algorithm, so that the new key frame image can not be input repeatedly.

S305, the data processing equipment processes the input data by using a real-time dense reconstruction algorithm to obtain three-dimensional dense data corresponding to the target scene.

Exemplary, resulting three-dimensional dense data may include, for example, but is not limited to: and the position information respectively corresponds to a plurality of dense points on the surfaces of a plurality of target objects in the target scene and each dense point.

S306, performing semantic segmentation processing on the three-dimensional dense data corresponding to the target scene to obtain semantic information corresponding to a plurality of patches.

In a specific implementation, for the relevant description of performing semantic segmentation on the multiple dense points on the surfaces of the multiple target objects in the target scene in S306 to obtain semantic information corresponding to each dense point, reference may be made to the relevant description in the specific implementation manner shown in B1 in S102 in this disclosure, and repeated details are not repeated.

After determining the semantic information corresponding to each dense point, the semantic information of a patch formed by dense points with adjacent positions and the same semantic information may be determined based on the position information corresponding to each dense point and the semantic information corresponding to each dense point.

And S307, generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information.

In generating the three-dimensional dense model, the dense point cloud may, for example, but not limited to, be updated as the process of acquiring the panoramic video continues to scale up. The updating frequency can be determined according to the input frequency of the key frame images and the pose of the fisheye camera when the real-time dense reconstruction algorithm is input.

For the above S304 to S307, when the real-time dense reconstruction algorithm is adopted, the dense depth maps corresponding to the keyframe images may be estimated by using a dense stereo matching technique, and the dense depth maps are fused into a three-dimensional dense model by using the pose of the corresponding fisheye camera, so that the three-dimensional dense model of the target scene is obtained after the target scene is completely acquired.

The dense depth map is also called a distance image, and is different from the storage brightness value of a pixel point in a gray image, and the pixel point stores the distance between the pixel point and image acquisition equipment, namely the depth value; because the depth value is only related to the distance and is not related to factors such as environment, light rays, direction and the like, the dense depth map can truly and accurately represent the geometric depth information of the scene, and thus a three-dimensional dense model which can represent a real target scene can be generated based on the dense depth map; in addition, in consideration of the limitation of the resolution of the device, the image enhancement processing such as denoising or repairing can be carried out on the dense depth image so as to provide the high-quality dense depth image for three-dimensional reconstruction.

In a possible case, for the processed key frame image, by using the pose of the image capturing device corresponding to the key frame image and the pose of the image capturing device corresponding to the new key frame image adjacent to the key frame image, it can be determined whether the pose of the image capturing device at the time of capturing the target scene is adjusted. If the pose is not adjusted, continuously performing real-time three-dimensional dense reconstruction on the target scene to obtain a three-dimensional dense model; and if the pose is adjusted, correspondingly adjusting the dense depth map according to the pose adjustment, and performing real-time three-dimensional dense reconstruction on the target object based on the adjusted dense depth map so as to obtain an accurate three-dimensional dense model.

D2, the image acquisition device has computing power capable of performing data processing on the image data, and after the image data are acquired, the image data are subjected to data processing by using the computing power of the image acquisition device, so that a three-dimensional model corresponding to the target scene is obtained.

Here, the specific manner of generating the three-dimensional model of the target scene by the image capturing device based on the image data may refer to the description of D1, and repeated descriptions are omitted.

Receiving the above S102, the data annotation method provided in the embodiment of the present disclosure further includes:

s103, determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data.

In specific implementation, after a three-dimensional model of a target scene is obtained, a target image marked with label information can be determined in at least one frame of image included in image data; determining a target pixel point corresponding to the target image at the two-dimensional labeling position of the label information; based on the target image and the pose of the image acquisition equipment when acquiring the target image, carrying out three-dimensional position recovery on the target image to obtain the three-dimensional position of a target pixel point under a model coordinate system; and determining the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model based on the three-dimensional position of the target pixel point in the model coordinate system.

In a specific implementation, as seen from the specific implementation shown in S101, in the data labeling, labeling is usually performed on one or more frames of images in the video data to obtain the tag information. Therefore, the target image marked with the label information can be directly determined in the image data.

In addition, when the annotation is performed, corresponding data annotation can be performed by triggering a certain pixel point in the image data in the graphical user interface. Therefore, after the target image marked with the label information is determined, the marking operation of the user can be responded to, so that the corresponding target pixel point of the two-dimensional marking position of the label information in the target image can be determined.

Here, the target three-dimensional position of the tag information in the model coordinate system corresponding to the three-dimensional model is the target three-dimensional position of the tag information in the target scene; the target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model can be determined according to the three-dimensional position of the target pixel point in the target scene.

In a possible case, if the target pixel point is a pixel point in a panoramic image captured by the panoramic camera, the depth value of the target pixel point in the panoramic image can be correspondingly determined. In this way, the three-dimensional position information of the target pixel point in the camera coordinate system (here, the three-dimensional position information of the target pixel point in the camera coordinate system includes the abscissa, the ordinate and the depth value) and the determined coordinate system conversion relationship can be directly obtained, that is, the three-dimensional position of the target pixel point in the target image can be restored, so as to determine the three-dimensional position of the target pixel point in the target scene.

In another possible case, if the target pixel point is a pixel point in the two-dimensional image, for example, a three-dimensional position of the target pixel point in the target scene may be determined by determining whether the target pixel point has a corresponding point cloud point in a three-dimensional model of the target scene; specifically, if the target pixel point has a corresponding point cloud point in the three-dimensional model of the target scene, the three-dimensional position information of the point cloud point corresponding to the camera coordinate system may be used as the three-dimensional coordinate information of the target pixel point corresponding to the camera coordinate system. In a similar manner to that in the above example, the three-dimensional position of the target pixel point may be restored to determine the three-dimensional position of the target pixel point in the target scene.

The specific manner of determining the three-dimensional position of the target pixel point in the target scene may be determined according to actual conditions, and is not specifically limited herein.

After the three-dimensional position of the target pixel point in the target scene is determined, the target three-dimensional position of the label information in the target scene can be determined through the three-dimensional position of the target pixel point in the target scene.

For example, the target three-dimensional position of the tag information in the target scene may be determined based on the three-dimensional position of the target pixel point in the target scene according to, but not limited to, at least one of the following F1 to F2:

f1, under the condition that the target pixel point comprises one target pixel point, the three-dimensional position of the target pixel point in the target scene can be directly determined as the target three-dimensional position of the label information in the target scene; namely, the three-dimensional position of the target pixel point in the model coordinate system is determined as the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model.

F2, under the condition that the target pixel point includes a plurality of target pixel points, calculating to obtain the target three-dimensional position of the label information in the target scene based on the three-dimensional position of the target pixel point in the target scene.

Exemplarily, calculating an average three-dimensional position of three-dimensional positions of a plurality of target pixel points in a target scene, and determining the average three-dimensional position as a target three-dimensional position of the label information in the target scene, that is, determining the average three-dimensional position as the target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model; in a possible implementation manner, the weighted summation may be performed on the three-dimensional positions of the target pixel points in the target scene based on the three-dimensional positions of the target pixel points in the target scene, an average three-dimensional position of the three-dimensional positions of the target pixel points in the target scene is calculated based on a result of the weighted summation, and the average three-dimensional position is determined as the target three-dimensional position of the tag information in the target scene.

Specifically, based on the three-dimensional position of the target pixel point in the target scene, the manner of calculating the target three-dimensional position of the obtained tag information in the target scene may be set according to implementation requirements, and is not specifically limited herein.

Receiving the above S103, the data annotation method provided in the embodiment of the present disclosure further includes:

and S104, adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

In a specific implementation, since the three-dimensional model includes three-dimensional submodels corresponding to the target objects in the target scene, after the target three-dimensional position of the tag information in the model coordinate system corresponding to the three-dimensional model is determined based on the step S103, the target object to which the tag information is to be added can be determined based on the target three-dimensional position and the poses of the three-dimensional submodels corresponding to the target objects in the three-dimensional model in the model coordinate system; and establishing an association relation between the target object to be added with the label information and the label information.

For example, a target point cloud point, the distance between which and the target three-dimensional position of which is less than or equal to a preset distance threshold value, may be determined from a plurality of cloud points in a three-dimensional submodel corresponding to each target object in the three-dimensional model based on the target three-dimensional position and the position information of each cloud point in the three-dimensional submodel corresponding to each target object in the three-dimensional model; determining the target object to which the target cloud point belongs as a target object to which label information is to be added; here, the preset distance threshold may be set according to actual requirements, and is not specifically limited herein; for example, if the preset distance may be 0, the point cloud point located at the target three-dimensional position is used as a target point cloud point, and a target object to which the target point cloud point belongs is used as a target object to which the tag information is to be added.

After the target object to be added with the label information is determined, an association relation between the three-dimensional submodel or the vector model corresponding to the target object to be added with the label information and the label information can be established. The association relationship may include, for example, a three-dimensional submodel or a vector model corresponding to the target object to which the tag information is to be added, and a corresponding relationship between the tag information and the three-dimensional submodel or the vector model corresponding to the target object to which the tag information is to be added, and a relative positional relationship between the tag information and the three-dimensional submodel or the vector model corresponding to the target object to which the tag information is to be added.

Specifically, the three-dimensional labeling position of the label information on the target object to which the label information is to be added can be determined based on the target three-dimensional position; and establishing an incidence relation between the three-dimensional labeling position and the label information.

For example, the target three-dimensional position may be used as a three-dimensional labeling position of the tag information on the target object to which the tag information is to be added; and adding label information on the three-dimensional labeling position.

In another embodiment of the present disclosure, a three-dimensional model of a target scene and label information may be correspondingly presented. Specifically, the following manner may be adopted: acquiring a display material; generating a label instance based on the display material and the label information; and in response to the label display event being triggered, showing the three-dimensional model of the target scene and the label instance.

The presentation material may include, for example, a new interface displayed on a graphical user interface, or pop-up information. The tag information may include, for example, at least one of: tag attribute information, and/or tag content information; the label attribute information may include at least one of label size information, label color information, label shape information, and the like, where the label size information may be determined according to a displayable size of the graphical user interface, the label color information may be changed according to a selection of a user, and the label shape information may include a text box or a table, for example; for tag content information, for example, may include but is not limited to: at least one of the name of the target object, the service life of the target object, specific functions, equipment responsible persons, equipment manufacturers, equipment dimension specifications, attribute information such as relevant text notes and the like, defect detection result information of the target object, fault maintenance condition information of the target object and the like; the defect detection result information of the target object may include, but is not limited to: the target object has at least one of a defect position, a defect condition of the target object and the like, for example, a crack is formed at a left side cabinet door of the cabinet.

After determining the tag attribute information, determining a style for rendering and displaying the three-dimensional model of the target scene and the tag information, for example, displaying a wire frame with a certain color outline when displaying the three-dimensional model of the target scene; when the label information is displayed, a label in a text box form can be displayed, and the name of the target object, the position of the target object with the defect, and the defect condition of the target object can be displayed in the label. The specific determination may be determined according to actual situations, and details are not described herein.

Illustratively, a three-dimensional model of the target scene, and tag instances may be presented using, but not limited to, at least one of G1-G2:

g1, displaying events on the label including: when the target object added with the tag information is triggered, a three-dimensional model of the target scene and a tag instance corresponding to the triggered target object can be displayed.

Illustratively, in response to a trigger of a user on a graphical user interface for a target object, if the target object matches with corresponding tag information, a corresponding tag instance may be displayed while a three-dimensional model of the target scene and a three-dimensional sub-model corresponding to the target object are displayed. In this way, the user can view the tag instance by triggering the corresponding position.

G2, displaying events on the label including: under the condition that the three-dimensional labeling position which establishes the incidence relation with the label information is displayed in the graphical user interface, a three-dimensional model of the target scene and a label instance which has the incidence relation with the three-dimensional labeling position can be displayed.

Illustratively, any three-dimensional position is responded to input by a user, the three-dimensional position is used as a target three-dimensional position corresponding to label information, and if label information exists at the position or the position close to the position, a three-dimensional model of a target scene and a label example are correspondingly displayed.

At least one embodiment of the above-mentioned G1-G2 can check and verify the data labeled at the data labeling stage through the displayed label example, so as to improve the accuracy of data labeling.

A specific display interface for displaying a three-dimensional model with a target scene and a tag instance may be as shown in fig. 4, where the target scene in fig. 4 includes a machine room, and the machine room includes two control cabinets and one cabinet; the illustration in FIG. 4 shows an example of a tag containing "control cabinets" at a corresponding location for each control cabinet, and an example of a tag containing "cabinets" at a corresponding location for cabinets; here, after triggering the tag instance corresponding to each control cabinet or cabinet, the detailed attribute information such as the service life, the specific function, the equipment responsible person, the equipment manufacturer, the equipment size specification, the relevant text remark, etc. of the target object corresponding to each control cabinet or cabinet, the defect detection result information of the target object, and the fault maintenance condition information of the target object may be obtained.

In the embodiment of the disclosure, a three-dimensional model of a target scene is constructed, and a target three-dimensional position of label information in a two-dimensional image under a model coordinate system corresponding to the three-dimensional model is determined based on a two-dimensional labeling position of the label information in image data; and adding label information to the three-dimensional model based on the three-dimensional position of the target, so that the generated three-dimensional model of the target scene carries the label information, the data labeling result corresponding to each target object in the target scene is conveniently and visually displayed through the three-dimensional model of the target scene carrying the label information, and the control and maintenance of the target object in the target scene are conveniently performed on the basis of the data labeling result.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a data labeling device corresponding to the data labeling method is also provided in the embodiments of the present disclosure, and as the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the data labeling method in the embodiments of the present disclosure, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 5, a schematic diagram of a data annotation device provided in an embodiment of the present disclosure is shown, where the device includes: an acquisition module 501, a first processing module 502, a determination module 503 and a second processing module 504; wherein:

an obtaining module 501, configured to obtain image data to be processed; the image data carries label information, and the image data comprises a video or an image obtained by image acquisition of a target scene; a first processing module 502, configured to perform three-dimensional reconstruction on the target scene based on the image data, so as to obtain a three-dimensional model of the target scene; a determining module 503, configured to determine, based on the two-dimensional labeling position of the tag information in the image data, a target three-dimensional position of the tag information in a model coordinate system corresponding to the three-dimensional model; a second processing module 504, configured to add, to the three-dimensional model, tag information corresponding to the target three-dimensional position based on the target three-dimensional position.

In an optional implementation manner, when the obtaining module 501 obtains the tag information carried in the image data according to the following steps, it is specifically configured to: and performing semantic segmentation processing on the image in the image data, and generating the label information based on the result of the semantic segmentation processing.

In an optional implementation manner, when the obtaining module 501 obtains the tag information carried in the image data according to the following steps, it is specifically configured to: receiving the label information sent by the terminal equipment; the tag information is generated by the terminal device in response to the labeling operation of the image in the image data.

In an optional implementation manner, when the determining module 503 is executed to determine the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data, specifically, to: determining a target image marked with the label information in at least one frame of image included in the image data; determining a target pixel point corresponding to the target image at the two-dimensional labeling position of the label information; based on the target image and the pose of the image acquisition equipment when acquiring the target image, carrying out three-dimensional position recovery on the target image to obtain the three-dimensional position of the target pixel point under the model coordinate system; and determining the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model based on the three-dimensional position of the target pixel point in the model coordinate system.

In an optional implementation manner, the three-dimensional model includes three-dimensional submodels corresponding to respective target objects in the target scene; the second processing module 504, when executing the adding of the tag information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position, is specifically configured to: determining a target object to be added with the label information based on the target three-dimensional position and the pose of the three-dimensional sub-model corresponding to each target object in the three-dimensional model under the model coordinate system; and establishing an incidence relation between the target object to be added with the label information and the label information.

In an optional implementation manner, when the association relationship between the target object to which the tag information is to be added and the tag information is established, the second processing module 504 is specifically configured to: determining a three-dimensional labeling position of the label information on the target object to which the label information is to be added based on the target three-dimensional position; and establishing an incidence relation between the three-dimensional labeling position and the label information.

In an alternative embodiment, the three-dimensional model comprises a three-dimensional point cloud model; the first processing module 502 is specifically configured to, when the three-dimensional reconstruction is performed on the target scene based on the image data to obtain a three-dimensional model of the target scene: performing three-dimensional point cloud reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain point cloud data of the target scene; the point cloud data includes: point cloud points corresponding to a plurality of target objects in the target scene respectively and position information corresponding to the point cloud points respectively; performing semantic segmentation processing on the point cloud data to obtain semantic information corresponding to a plurality of point cloud points respectively; generating a three-dimensional point cloud model of the target scene based on the point cloud data and the semantic information; the three-dimensional point cloud model of the target scene comprises three-dimensional sub-point cloud models corresponding to the target objects respectively.

In an alternative embodiment, the three-dimensional model comprises a three-dimensional dense model; the first processing module 502 is specifically configured to, when the three-dimensional reconstruction is performed on the target scene based on the image data to obtain a three-dimensional model of the target scene: performing three-dimensional dense reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain three-dimensional dense data of the target scene; the three-dimensional dense data includes: a plurality of dense points on the surfaces of a plurality of target objects in the target scene and position information corresponding to each dense point; performing semantic segmentation processing on the three-dimensional dense data to obtain semantic information respectively corresponding to a plurality of patches formed by the dense points; generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information; the three-dimensional dense model of the target scene comprises three-dimensional sub dense models corresponding to the target objects respectively.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the application also provides computer equipment. Referring to fig. 6, a schematic structural diagram of a computer device 600 provided in the embodiment of the present application includes a processor 601, a memory 602, and a bus 603. The memory 602 is used for storing execution instructions and includes a memory 6021 and an external memory 6022; the memory 6021 is also referred to as an internal memory, and is used for temporarily storing the operation data in the processor 601 and the data exchanged with the external memory 6022 such as a hard disk, the processor 601 exchanges data with the external memory 6022 through the memory 6021, and when the computer device 600 operates, the processor 601 communicates with the memory 602 through the bus 603, so that the processor 601 executes the following instructions:

acquiring image data to be processed; the image data carries tag information, and the image data comprises a video or an image obtained by carrying out image acquisition on a target scene; performing three-dimensional reconstruction on the target scene based on the image data to obtain a three-dimensional model of the target scene; determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data; and adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

The specific processing flow of the processor 601 may refer to the description of the above method embodiment, and is not described herein again.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the data annotation method in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the data labeling method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A data annotation method is characterized by being applied to a server; the method comprises the following steps:

acquiring image data to be processed; the image data carries tag information, and the image data comprises a video or an image obtained by carrying out image acquisition on a target scene;

performing three-dimensional reconstruction on the target scene based on the image data to obtain a three-dimensional model of the target scene;

determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data;

and adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

2. The method according to claim 1, wherein the tag information carried by the image data is obtained according to the following steps:

and performing semantic segmentation processing on the image in the image data, and generating the label information based on the result of the semantic segmentation processing.

3. The method of claim 1, wherein obtaining the tag information carried by the image data comprises:

receiving the label information sent by the terminal equipment; the tag information is generated by the terminal device in response to the labeling operation of the image in the image data.

4. The method according to any one of claims 1 to 3, wherein the determining the target three-dimensional position of the tag information in the model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the tag information in the image data comprises:

determining a target image marked with the label information in at least one frame of image included in the image data;

determining a target pixel point corresponding to the target image at the two-dimensional labeling position of the label information;

based on the target image and the pose of the image acquisition equipment when acquiring the target image, carrying out three-dimensional position recovery on the target image to obtain the three-dimensional position of the target pixel point under the model coordinate system;

and determining the target three-dimensional position of the label information in the model coordinate system corresponding to the three-dimensional model based on the three-dimensional position of the target pixel point in the model coordinate system.

5. The method according to any one of claims 1-4, wherein the three-dimensional model comprises a three-dimensional sub-model corresponding to each target object in the target scene; adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position, wherein the label information comprises:

determining a target object to be added with the label information based on the target three-dimensional position and the pose of the three-dimensional sub-model corresponding to each target object in the three-dimensional model under the model coordinate system;

and establishing an incidence relation between the target object to be added with the label information and the label information.

6. The method according to claim 5, wherein the establishing of the association relationship between the target object to which the tag information is to be added and the tag information comprises:

determining a three-dimensional labeling position of the label information on the target object to which the label information is to be added based on the target three-dimensional position;

and establishing an incidence relation between the three-dimensional labeling position and the label information.

7. The method according to any one of claims 1-6, further comprising:

acquiring a display material;

generating a label instance based on the display material and the label information;

and in response to a label display event being triggered, showing the three-dimensional model of the target scene and the label instance.

8. The method of claim 7, wherein the tag display event comprises: triggering the target object added with the label information;

the displaying the three-dimensional model of the target scene and the label instance comprises: and displaying the three-dimensional model of the target scene and the label instance corresponding to the triggered target object.

9. The method of claim 7 or 8, wherein the tag display event comprises: displaying the three-dimensional labeling position which establishes the incidence relation with the label information in a graphical user interface;

the displaying the three-dimensional model of the target scene and the label instance comprises: and displaying the three-dimensional model of the target scene and the label instance having the incidence relation with the three-dimensional labeling position.

10. The method of any of claims 1-9, wherein the tag information comprises at least one of: tag attribute information, and/or tag content information; wherein the tag attribute information includes at least one of: label size information, label color information, label shape information; the tag content information includes at least one of: attribute information of the corresponding target object, defect inspection result information of the corresponding target object, and trouble shooting condition information of the corresponding target object.

11. The method of any one of claims 1-10, wherein the three-dimensional model comprises a three-dimensional point cloud model; the three-dimensional reconstruction of the target scene based on the image data to obtain a three-dimensional model of the target scene includes:

performing three-dimensional point cloud reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain point cloud data of the target scene; the point cloud data includes: point cloud points corresponding to a plurality of target objects in the target scene respectively and position information corresponding to the point cloud points respectively;

performing semantic segmentation processing on the point cloud data to obtain semantic information corresponding to a plurality of point cloud points respectively;

generating a three-dimensional point cloud model of the target scene based on the point cloud data and the semantic information; the three-dimensional point cloud model of the target scene comprises three-dimensional sub-point cloud models corresponding to the target objects respectively.

12. The method of any one of claims 1-10, wherein the three-dimensional model comprises a three-dimensional dense model; the three-dimensional reconstruction of the target scene based on the image data to obtain a three-dimensional model of the target scene includes:

performing three-dimensional dense reconstruction on the target scene based on the image data and the pose of the image acquisition equipment when acquiring the image data to obtain three-dimensional dense data of the target scene; the three-dimensional dense data includes: a plurality of dense points on the surfaces of a plurality of target objects in the target scene and position information corresponding to each dense point;

performing semantic segmentation processing on the three-dimensional dense data to obtain semantic information respectively corresponding to a plurality of patches formed by the dense points;

generating a three-dimensional dense model of the target scene based on the three-dimensional dense data and the semantic information; the three-dimensional dense model of the target scene comprises three-dimensional sub dense models corresponding to the target objects respectively.

13. The method of any one of claims 5-12, wherein the target object comprises at least one of: a building located within the target scene, and a device deployed within the target scene.

14. A data annotation device, which is applied to a server, the device comprises:

the acquisition module is used for acquiring image data to be processed; the image data carries tag information, and the image data comprises a video or an image obtained by image acquisition of a target scene;

the first processing module is used for carrying out three-dimensional reconstruction on the target scene based on the image data to obtain a three-dimensional model of the target scene;

the determining module is used for determining a target three-dimensional position of the label information in a model coordinate system corresponding to the three-dimensional model based on the two-dimensional labeling position of the label information in the image data;

and the second processing module is used for adding label information corresponding to the target three-dimensional position to the three-dimensional model based on the target three-dimensional position.

15. A computer device, comprising: a processor, a memory storing machine readable instructions executable by the processor, the processor for executing the machine readable instructions stored in the memory, the processor performing the steps of the data annotation method of any one of claims 1 to 13 when the machine readable instructions are executed by the processor.

16. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a computer device, performs the steps of the data annotation method according to any one of claims 1 to 13.