CN113660469A

CN113660469A - Data labeling method and device, computer equipment and storage medium

Info

Publication number: CN113660469A
Application number: CN202110963206.4A
Authority: CN
Inventors: 侯欣如; 刘浩敏; 姜翰青; 王楠; 盛崇山
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-16

Abstract

The present disclosure provides a data labeling method, apparatus, computer device and storage medium, wherein the method comprises: acquiring a panoramic video obtained by utilizing image acquisition equipment to acquire an image of a target space; determining a key frame image comprising an object to be marked from the panoramic video; responding to attribute labeling of the object to be labeled in the key frame image, and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling; and generating target acquisition data based on the attribute labeling data and the panoramic video. Therefore, the phenomenon of omission in data labeling can be reduced, and the digital assets in a computer room are prevented from being lost.

Description

Data labeling method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data annotation method, an apparatus, a computer device, and a storage medium.

Background

When data in a machine room are labeled, a manual labeling mode is usually adopted. For example, a worker may view asset information related to a device in a machine room and record the asset information. For a computer room, there are many devices to be labeled and many data to be collected for different devices, so that omission may be generated by manual labeling, resulting in loss of digital assets.

Disclosure of Invention

The embodiment of the disclosure at least provides a data annotation method, a data annotation device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a data annotation method, including: acquiring a panoramic video obtained by utilizing image acquisition equipment to acquire an image of a target space; determining a key frame image comprising an object to be marked from the panoramic video; responding to attribute labeling of the object to be labeled in the key frame image, and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling; and generating target acquisition data based on the attribute labeling data and the panoramic video.

Therefore, the key frame image screened from the panoramic video comprises the object to be marked in the target space and the marked data finished in the key frame image, so that the object of the data marking unfinished in the target space and the position in the key frame image can be more intuitively displayed when data marking is carried out, the attribute marked data of the panoramic video can be more easily obtained, and the target acquisition data can be generated. That is, the object to be standardized in the target space can be visually displayed through the key frame image, and the annotation data can also be displayed to prompt the current data annotation condition of the user, so that the user is assisted to perform further annotation, the phenomenon of omission in data annotation is reduced, and the digital assets in the computer room are prevented from being lost.

In an alternative embodiment, the image capture device comprises: a panoramic camera; the acquiring of the panoramic video acquired by image acquisition of the target space by using the image acquisition equipment comprises the following steps: controlling the panoramic camera to perform panoramic image acquisition on the target space to obtain a first panoramic video; determining that a complementary acquisition area to be subjected to complementary acquisition exists in the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is acquired; and controlling the panoramic camera to perform complementary mining on the target space based on the complementary mining area to obtain a second panoramic video.

In an optional embodiment, the determining, based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is captured, that a complementary capture area to be complementary captured exists in the target space includes: performing three-dimensional reconstruction on the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is collected, and generating a three-dimensional scene model of the target space; and determining that a complementary acquisition region to be subjected to complementary acquisition exists in the target space based on the three-dimensional scene model.

In an optional implementation manner, based on the complementary mining area, controlling the panoramic camera to perform complementary mining on the target space to obtain a second panoramic video includes: and controlling the panoramic camera to perform complementary mining on the complementary mining area based on the current pose of the panoramic camera in the target space and the position of the complementary mining area in the three-dimensional scene model to obtain the second panoramic video.

Therefore, the acquisition strategy of the complementary acquisition area can be determined more accurately and efficiently through the determined complementary acquisition area and the determined current pose of the panoramic camera, and a second panoramic video for the complementary acquisition area is obtained. And the second panoramic video can be used for supplementing the missing part collected in the first panoramic video, so that the obtained panoramic video can more completely display images at different positions of the target space.

In an optional embodiment, the determining, based on the three-dimensional scene model, that there is a complementary acquisition region to be acquired additionally in the target space includes: detecting whether an area which is not modeled completely exists in the three-dimensional scene model or not based on three-dimensional position information of each dense point cloud point in the three-dimensional scene model in the target space; and if the area which is not modeled completely exists, determining the area which is not modeled completely as the complementary mining area.

Therefore, the accurate position of the complementary mining area can be reflected more accurately by utilizing the three-dimensional position information of the dense point cloud points in the three-dimensional scene model determined in the three-dimensional reconstruction, so that the area needing data complementary mining can be determined more accurately.

In an optional embodiment, the determining, based on the three-dimensional scene model, that there is a complementary acquisition region to be acquired additionally in the target space includes: displaying the three-dimensional scene model; and in response to the triggering of any region in the three-dimensional scene model by a user, determining the triggered region as the complementary mining region.

Therefore, a user can select the complementary acquisition region more flexibly by triggering any region in the three-dimensional scene model, and the data labeling is more flexible.

In an optional implementation manner, the performing attribute labeling on the object to be labeled in the key frame image includes: displaying the key frame image; and responding to a first labeling operation on the object to be labeled in the key frame image, and generating labeling data corresponding to the first labeling operation.

Therefore, the data annotation can be realized simply and conveniently by finishing the data annotation on the key frame image.

In an optional implementation manner, the performing attribute labeling on the object to be labeled in the key frame image includes: and performing semantic segmentation processing on the key frame image, and generating annotation data of the object to be annotated based on the result of the semantic segmentation processing.

Therefore, the method can realize automatic labeling of the objects to be labeled in a semantic segmentation processing mode, and is simpler and more convenient.

In an optional implementation manner, the performing attribute labeling on the object to be labeled in the key frame image includes: generating a preview image based on the key frame image, and displaying the preview image; wherein the preview image has a resolution lower than the resolution of the key frame image; responding to a second labeling operation on the object to be labeled in the preview image, and generating labeling data corresponding to the preview image; and obtaining the annotation data of the key frame image based on the annotation data corresponding to the preview image.

Thus, by previewing the image, the data transmission amount can be reduced, and the data annotation speed can be correspondingly increased.

In an optional embodiment, the generating attribute labeling data of the panoramic video based on the labeling data obtained by attribute labeling includes: for each frame of video frame image in the panoramic video, in response to the frame of video frame image not being a key frame image, determining a target key frame image matched with the frame of video frame image from the key frame images; and generating the annotation information of the frame of video frame image based on the annotation information of the target key frame image.

In this way, by determining the corresponding target key frame image for the video frame image, the annotation data in the target key frame image can be correspondingly synchronized to other video frame images in the panoramic video, so that less data needs to be annotated, but the annotation of data of all the video frame images in the panoramic video is completed faster and the efficiency is higher.

In an alternative embodiment, determining a target key frame image for the frame of video frame image from the key frame images includes: and determining the target key frame image matched with the frame video frame image in the key frame images based on the first position of the key frame image in the panoramic video and the second position of the frame video frame image in the panoramic video.

Therefore, the mode of determining the target key frame image according to the position of the image can perform data synchronous labeling on the video frame image of the non-key frame image more accurately in the follow-up process.

In an optional embodiment, the method further comprises: acquiring the pose of the image acquisition equipment when acquiring the panoramic video; generating target acquisition data based on the attribute labeling data and the panoramic video, comprising: and generating the target acquisition data based on the attribute labeling data, the panoramic video and the pose.

In a second aspect, an embodiment of the present disclosure further provides a data annotation device, including: the first acquisition module is used for acquiring a panoramic video acquired by acquiring an image of a target space by using image acquisition equipment; the determining module is used for determining a key frame image comprising an object to be marked from the panoramic video; the processing module is used for responding to attribute labeling of the object to be labeled in the key frame image and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling; and the generating module is used for generating target acquisition data based on the attribute labeling data and the panoramic video.

In a third aspect, this disclosure also provides a computer device, a processor, and a memory, where the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the machine-readable instructions stored in the memory, and when the machine-readable instructions are executed by the processor, the machine-readable instructions are executed by the processor to perform the steps in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, this disclosure also provides a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the data annotation device, the computer device, and the computer-readable storage medium, reference is made to the description of the data annotation method, which is not repeated herein.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of a data annotation method provided by an embodiment of the present disclosure;

fig. 2 shows a flowchart corresponding to a specific embodiment of generating a three-dimensional scene model by a data processing device according to an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a specific embodiment of a data processing device when performing data complementary acquisition according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a video frame image displayed on a graphical display interface according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a video frame image and a three-dimensional scene model displayed on a graphical display interface according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram illustrating a graphical display interface when displaying a key frame image and an annotation control according to an embodiment of the disclosure;

FIG. 7 is a schematic diagram illustrating a graphical display interface displaying preview images provided by an embodiment of the present disclosure;

FIG. 8 is a diagram illustrating a multi-frame video frame image provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating a data annotation of a non-key frame image according to an embodiment of the disclosure;

FIG. 10 is a flow chart illustrating one embodiment of the present disclosure for performing data annotation;

fig. 11 shows a flowchart corresponding to an embodiment of performing data annotation on a machine room according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a data annotation device provided in an embodiment of the disclosure;

fig. 13 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of embodiments of the present disclosure, as generally described and illustrated herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

Research shows that the machine room has a large area and more related digital assets. Therefore, when the digital assets in the machine room are labeled by using a manual labeling method, related workers are usually required to patrol and determine the positions of a plurality of data to be labeled in the machine room, and perform related data labeling one by one. The manual labeling mode may cause omission, resulting in the loss of digital assets in the computer room.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a data annotation method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the data annotation method provided in the embodiments of the present disclosure is generally a data processing device with certain computing capability, and the data processing device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device (e.g., a tablet, or a cell phone in the examples described below), a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or a server or other processing device; in a possible case, the target space can also be equipped with a dedicated data processing device, for example a management computer in a computer room or a portable handheld management device. Specifically, the determination may be performed according to actual situations, and details are not described herein. In addition, in some possible implementations, the data annotation method can be implemented by a processor calling computer-readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a data annotation method provided in the embodiment of the present disclosure is shown, where the method includes steps S101 to S104, where:

s101: acquiring a panoramic video obtained by utilizing image acquisition equipment to acquire an image of a target space;

s102: determining a key frame image comprising an object to be marked from the panoramic video;

s103: responding to attribute labeling of the object to be labeled in the key frame image, and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling;

s104: and generating target acquisition data based on the attribute labeling data and the panoramic video.

The following describes the details of S101 to S104.

For the above S101, the image capturing device may include, for example, a panoramic camera; illustratively, the panoramic camera may include, for example, a fisheye camera provided on a scanner in the following examples. Particularly, the panoramic camera can obtain a panoramic image when shooting a target space, so that the panoramic camera is more suitable for omnibearing shooting in the target space with larger space such as a machine room, a factory building and the like. Wherein, for example, a computing device, a data storage device, a signal receiving device, and the like may be stored in the computer room; the plant may house, for example, production facilities, handling facilities, transportation facilities, and the like. The machine room and the workshop are both solid spaces.

Illustratively, the target space may include, for example, a machine room having a large floor space, such as a machine room having a floor space of 20 square meters, 30 square meters, or 50 square meters. In the case of using a machine room as a target space, a scene in the machine room can be photographed by using an image acquisition device.

In one possible case, the target space in which data acquisition can be performed in one area may include a plurality of target spaces, for example, a plurality of machine rooms in one area. Because the target spaces for data acquisition are similar, the data annotation method provided by the embodiment of the disclosure can be correspondingly applied to different multiple target spaces. For example, when data labeling is performed on a plurality of machine rooms, whether data labeling is performed on the machine room or not, or whether data labeling is completed on the machine room or not, or whether data to be labeled exists in the machine room or not can be determined according to machine room identifiers corresponding to the plurality of different machine rooms respectively, so that whether the machine room is a target space which needs to be subjected to data labeling currently or not is determined.

In addition, in the target space, there are also usually installed many devices, such as a rack installed indoors and connected to an outdoor tower, and an indoor cabinet is also usually installed on the floor of a machine room. Specifically, when the panoramic camera collects panoramic images of a target space, the panoramic camera is carried on the robot and moves in the target space to drive the panoramic camera to acquire a panoramic video; alternatively, panoramic images of the target space can be acquired by a method of holding the panoramic camera by workers such as survey personnel, so as to acquire panoramic videos. The panoramic video may include, for example, a first panoramic video described below, or a first panoramic video and a second panoramic video described below.

Here, when the target space is subjected to image acquisition, for example, a plurality of video frame images (that is, panoramic images) corresponding to the target space may be obtained, and a panoramic video corresponding to the target space may also be obtained accordingly.

The panoramic camera can be used for image acquisition, and the panoramic video obtained by image acquisition of the panoramic camera can be used for three-dimensional scene model reconstruction when data processing is carried out, so that the pose of the panoramic camera can be determined. In this case, for example, before the panoramic camera performs panoramic image acquisition on the target space, the gyroscope of the panoramic camera may be calibrated to determine the pose of the panoramic camera in the target space; illustratively, this may be achieved, for example, by adjusting the optical axis of the panoramic camera to be parallel to the floor of the target space.

After the gyroscope of the panoramic camera is calibrated, the panoramic camera can acquire images in a video mode and acquire a first panoramic video corresponding to a target space.

In another embodiment of the present disclosure, since incomplete capturing may occur when the image capturing device is used to capture an image of the target space, for example, a captured image of a partial area such as a corner position inside the target space is absent, the panoramic camera may further acquire the second panoramic video in the following manner, for example: controlling the panoramic camera to perform panoramic image acquisition on the target space to obtain a first panoramic video; determining that a complementary acquisition area to be subjected to complementary acquisition exists in the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is acquired; and controlling the panoramic camera to perform complementary mining on the target space based on the complementary mining area to obtain a second panoramic video.

In a specific implementation, when determining the complementary capture area in the target space based on the first panoramic video and the pose of the panoramic camera capturing the first panoramic video, for example, the following manners may be adopted: performing three-dimensional reconstruction on the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is collected, and generating a three-dimensional scene model of the target space; and determining that a complementary acquisition region to be subjected to complementary acquisition exists in the target space based on the three-dimensional scene model.

Specifically, the determined pose of the panoramic camera when the first panoramic video is acquired can reflect the pose of the panoramic camera in the target space, and a plurality of dense point cloud points can be determined in the first panoramic video acquired by the panoramic camera through acquisition of the target space, so that the target space can be three-dimensionally reconstructed by using the first panoramic video and the pose of the panoramic camera when the first panoramic video is acquired, so as to generate a three-dimensional scene model of the target space. For example, the three-dimensional scene model may reflect objects including a target space, an object to be labeled in the target space, and the like.

In a specific implementation, when determining the three-dimensional scene model corresponding to the target space, for example, the following two ways including, but not limited to (a1) and (a2) may be adopted:

(A1) the method comprises the following steps The image acquisition equipment only undertakes the task of image acquisition, and transmits the acquired first panoramic video and the pose of the panoramic camera during the acquisition of the first panoramic video to the data processing equipment depending on network connection, so that the data processing equipment establishes a three-dimensional scene model corresponding to the target space.

The network connections that may be relied upon may include, for example, a Fiber Ethernet Adapter (Fiber Ethernet Adapter), a mobile communication technology (e.g., a fourth generation mobile communication technology (4G) or a fifth generation mobile communication technology (5G)), and Wireless Fidelity (WiFi); the data processing device may for example comprise a computer device as described above. When the data processing device processes the first panoramic video, a corresponding three-dimensional scene model is determined for the target space according to dense point cloud points in the first panoramic video and the pose of the panoramic camera in the target space.

When the data processing device acquires the pose of the image capturing device when capturing the first panoramic video, for example, the data related to an Inertial Measurement Unit (IMU) of the image capturing device when capturing the first panoramic video may be acquired. For example, in the inertial measurement unit IMU of the image capturing device, for example, three single-axis accelerometers and three single-axis gyroscopes may be included, where the accelerometers may detect an acceleration of the image capturing device when capturing the first panoramic video in the target space, and the gyroscopes may detect an angular velocity of the image capturing device when capturing the first panoramic video in the target space. Therefore, the data processing equipment can accurately determine the pose of the image acquisition equipment when acquiring the first panoramic video by acquiring the relevant data of the inertial measurement unit IMU in the image acquisition equipment.

The data processing device may employ at least one algorithm of Simultaneous Localization And Mapping (SLAM) And real-time dense reconstruction, for example, when determining the three-dimensional scene model. For example, when the image capturing device captures the first panoramic video, the first panoramic video may be gradually captured as the image capturing device moves, and the data processing device gradually generates a three-dimensional scene model covering the target space; or after the image acquisition equipment finishes acquiring the first panoramic video, the data processing equipment generates a three-dimensional scene model corresponding to the target space by using the obtained complete first panoramic video.

In another embodiment of the present disclosure, a specific embodiment is further provided, in which the data processing device generates the three-dimensional scene model for the target space by using a SLAM algorithm and a real-time dense reconstruction algorithm. The panoramic camera selects two fisheye cameras arranged at the front and back positions on the scanner; the fisheye camera is arranged on the scanner in a preset pose position to acquire a complete panoramic video corresponding to the target space. Referring to fig. 2, a flowchart corresponding to a specific embodiment of generating a three-dimensional scene model for a data processing device according to an embodiment of the present disclosure is shown, where:

s201: the data processing equipment acquires two panoramic videos with synchronous real-time acquisition time of the front fisheye camera and the rear fisheye camera of the scanner.

Wherein, the two panoramic videos respectively comprise a plurality of frames of video frame images. Because the two fisheye cameras collect two panoramic videos with synchronous time in real time, timestamps of multi-frame video frame images respectively included in the two panoramic videos respectively correspond to each other.

In addition, the data processing device can also determine the precision of the time stamp and the acquisition frequency when acquiring the video frame images in the panoramic video according to the specific instrument parameters of the two fisheye cameras. For example, setting the time stamp of the video frame image to be accurate to nanosecond; and when the video frame images in the panoramic video are acquired, the acquisition frequency is not lower than 30 hertz (Hz).

After the panoramic video is acquired by the fisheye camera, the panoramic video is sent to the data processing device through the network connection that can be relied on in the above example, so that the data processing device receives the panoramic video.

S202: the data processing device determines relevant data of the inertial measurement units IMU when the two fisheye cameras respectively acquire the panoramic video.

Taking any one of the two fisheye cameras as an example, when the data processing device acquires a video frame image from a target space, the data processing device may correspondingly observe and acquire related data of the inertial measurement unit IMU between two adjacent video frames and a timestamp when the related data is acquired.

Specifically, in order to more accurately acquire the relevant data of the inertial measurement unit IMU, a corresponding scanner coordinate system (which may be composed of an X axis, a Y axis, and a Z axis, for example) may be determined for the fisheye camera, so as to determine the relevant data of the inertial measurement unit IMU on the scanner coordinate system, such as the acceleration and the angular velocity under the X axis, the Y axis, and the Z axis of the scanner coordinate system.

In addition, the time stamp for acquiring the relevant data of the inertial measurement unit IMU can be determined according to the specific instrument parameters of the two fisheye cameras. For example, it may be determined that the observation frequency for acquiring the relevant data of the inertial measurement unit IMU is not lower than 400 Hz. In this way, the data processing device can also determine the relevant data directly by acquiring the specific instrument parameters of the two fisheye cameras.

S203: the data processing device determines the poses of the two fisheye cameras in the world coordinate system based on the relevant data of the inertial measurement unit IMU.

Specifically, since the coordinate system transformation relationship between the scanner coordinate system and the world coordinate system can be determined, after acquiring the relevant data Of the inertial measurement unit IMU, the data processing device can determine the poses Of the two fisheye cameras in the world coordinate system according to the coordinate system transformation relationship, for example, the poses can be expressed as 6-Degree Of Freedom (6-Degree Of Freedom, 6DOF) poses, which is not described herein again in detail.

For the above S201 to S203, when the SLAM algorithm is adopted, since the video frame images in the panoramic video are all panoramic images, the 6DOF pose of the image acquisition device can be accurately solved by the processing steps of image processing, key point extraction, key point tracking, and establishment of the association relationship between the key points, that is, the acquisition and calculation of the 6DOF pose of the image acquisition device in real time are realized; and moreover, the coordinates of dense point cloud points in the target space can be obtained.

When the video frame images in the panoramic video are processed, the key frame images can be determined in the multi-frame video frame images corresponding to the panoramic video, so that the SLAM algorithm is ensured to have enough processing data, the calculation amount is reduced, and the efficiency is improved. The specific manner of determining the key frame image from the panoramic video can be referred to the following description of S102, and is not described in detail here.

In this way, the key frame image map can be stored in the background of the SLAM algorithm, so that after the image acquisition device is controlled to return to the acquired position again, the two frames of video frame images at the position can be compared to perform loop detection on the image acquisition device, and the positioning accumulated error of the image acquisition device under long-time and long-distance operation can be corrected.

S204: and the data processing equipment processes the keyframe images in the panoramic video and the pose of the fisheye camera, which are respectively acquired by the fisheye camera, as input data of the real-time dense reconstruction algorithm.

For example, for a panoramic video acquired by any fisheye camera, after determining a new key frame image in the panoramic video by using the above-mentioned S201 to S203, the data processing device uses all currently obtained key frame images and the pose of the fisheye camera corresponding to the new key frame image as input data of the real-time dense reconstruction algorithm.

Before a new key frame image is obtained, for the transmitted key frame image, when the key frame image is used as input data of the real-time dense reconstruction algorithm, the pose of the corresponding fisheye camera is used as the input data to be input into the real-time dense reconstruction algorithm, so that the new key frame image can not be input repeatedly.

S205: and the data processing equipment processes the input data by using a real-time dense reconstruction algorithm to obtain a three-dimensional scene model corresponding to the target space.

For example, the resulting three-dimensional scene model may include a dense point cloud that may be shown in a preset color, for example. In generating the three-dimensional scene model, the dense point cloud may be updated, for example, as the panoramic video is captured. The updating frequency can be determined according to the input frequency of the key frame images and the pose of the fisheye camera when the real-time dense reconstruction algorithm is input.

For the above S204 to S205, when the data processing device adopts the real-time dense reconstruction algorithm, the data processing device may estimate a dense depth map of the scene three-dimensional model by using the new keyframe image, and fuse the dense depth map into the three-dimensional scene model by using the pose of the corresponding fisheye camera, thereby obtaining the scene three-dimensional model after the target space is completely acquired. In a possible case, for the processed key frame image, by using the pose of the image capturing device corresponding to the key frame image and the pose of the image capturing device corresponding to the new key frame image adjacent to the key frame image, it can be determined whether the pose of the image capturing device when capturing the target space is adjusted. If the pose is not adjusted, correspondingly continuing to carry out real-time dense reconstruction on the target space to obtain a three-dimensional scene model; and if the pose is adjusted, correspondingly adjusting the dense depth map according to the pose adjustment, so as to obtain an accurate three-dimensional scene model.

(A2) The method comprises the following steps The image acquisition equipment has computing power capable of processing data of the panoramic video, and after the first panoramic video is acquired, the computing power of the image acquisition equipment is utilized to process the data of the first panoramic video so as to obtain a three-dimensional scene model corresponding to the target space.

Here, the specific manner of determining the three-dimensional scene model by the image capturing device may refer to the description of establishing the three-dimensional scene model by the data processing device in (a1), which is not described herein again.

After the image acquisition device has completed building the three-dimensional scene model, the built three-dimensional scene model may be transmitted to the data processing device, for example, in dependence on the network connection described above, for further processing by the data processing device.

After the data processing device or the image acquisition device generates the three-dimensional scene model corresponding to the target space, for example, the data processing device may determine whether the target space has a complementary acquisition region by determining whether the three-dimensional scene model can completely express any one of the target space and the object to be labeled.

In a possible case, if it is determined that the three-dimensional scene model can completely express any one of the target space and the object to be labeled, the data processing device determines that no complementary acquisition area exists, and determines that the first panoramic video acquired by the panoramic camera through acquiring the panoramic image of the target space is the required panoramic video.

In another possible case, if it is determined that the three-dimensional scene model cannot completely express any one of the target space and the object to be labeled, the data processing device determines that the complementary mining area exists. In this case, it is determined to perform data complementary acquisition, and the second panoramic video is acquired to supplement the complementary acquisition area in the target space, in which image acquisition is not completed, with the second panoramic video. Here, the second panoramic video is also referred to as a complementary video.

Here, the complementary mining area can be determined through the three-dimensional scene model, that is, the specific position of the complementary mining area can be determined while the three-dimensional scene model corresponding to the complete acquisition area can be determined by using the first panoramic video, so that complementary mining can be directly performed on the complementary mining area, and acquisition work on other areas in the target space is not performed again. That is, the second panoramic video may include, for example, a video acquired for the complementary mining area, which does not include an area in which the three-dimensional scene model is built, or a small portion of the area in which the three-dimensional scene model is built, which is inevitably present in the target space during shooting. Therefore, the second panoramic video can be directly used for building the three-dimensional scene model of the complementary mining area, the corresponding poses, images and the like of the first panoramic video and the second panoramic video do not need to be spliced, the spliced panoramic video is reused for reconstructing the three-dimensional scene model again, the required calculation power is less, and the efficiency is higher.

In a specific implementation, when determining whether data complementary acquisition needs to be performed on an complementary acquisition region in a target space by using a three-dimensional scene model, the data processing device may adopt, for example, two ways including, but not limited to, the following (B1) or (B2):

(B1) the method comprises the following steps Detecting whether an area which is not modeled completely exists in the three-dimensional scene model or not based on three-dimensional position information of each dense point cloud point in the three-dimensional scene model in the target space; and if the area which is not modeled completely exists, determining the area which is not modeled completely as the complementary mining area.

In this case, when the three-dimensional scene model is generated, the three-dimensional scene model includes a plurality of dense point cloud points, each dense point cloud point corresponds to the three-dimensional position information in the target space, so that by determining the three-dimensional position information of the dense point cloud points in the three-dimensional scene model, a region in which the dense point cloud points are not distributed in the three-dimensional scene model can be determined, and whether the region is not collected or not can be correspondingly determined. For example, for a vertex angle position in a target space, due to the influence of a shooting angle and the like, a video frame image at the vertex angle position may not be acquired after one image acquisition, and therefore, the generated three-dimensional scene model has a defect of the region, that is, the three-dimensional scene model does not have a dense point cloud point corresponding to the vertex angle position. At this time, the region where the vertex angle is located may be taken as a complementary mining region for which modeling is not completed.

And under the condition that the complementary mining area exists, determining that data complementary mining needs to be carried out on the complementary mining area.

Specifically, after determining that data complementary acquisition needs to be performed on the complementary acquisition area, the data processing device may control the panoramic camera to perform complementary acquisition on the target space in the following manner to obtain the second panoramic video: and controlling the panoramic camera to perform complementary mining on the complementary mining area based on the current pose of the panoramic camera in the target space and the position of the complementary mining area in the three-dimensional scene model to obtain the second panoramic video.

For example, in the case of performing data complementary collection by using a robot equipped with a panoramic camera, the data processing device may determine the position of the robot in the target space by using the current pose of the panoramic camera in the target space, and further, the data processing device may determine the walking strategy of the robot by using the position of the complementary collection area in the three-dimensional scene model, and cause the panoramic camera to collect the complementary collection area again by relying on the movement of the robot in the target space. In this way, the data processing device can also control the robot to efficiently and directly move to the position where the complementary acquisition area can be acquired in the target space, and the image acquisition efficiency is correspondingly higher.

In addition, if the user holds the panoramic camera to perform data complementary collection, the data processing equipment can similarly display related prompt information for complementary collection of the complementary collection area to the user according to the current pose of the panoramic camera in the target space and the position of the complementary collection area in the three-dimensional scene model. Illustratively, the related prompt information may include at least one of voice prompt information, text prompt information, and image prompt information, for example. For example, the data processing device may issue a voice prompt to the user "there is a complementary acquisition area 5 meters ahead", or a text prompt "please advance 5 meters to the next complementary acquisition area", or an image prompt, such as when presenting the panoramic video and/or the three-dimensional scene model to the user, instructing the user to advance 5 meters with a marked arrow.

When the data processing device presents the panoramic video and/or the three-dimensional scene model to the user, the panoramic video and/or the three-dimensional scene model can be presented in a graphical display interface of the data processing device. In a possible case, if the user selects any one of the mobile phone and the dedicated capturing device as the data processing device, the panoramic video and/or the three-dimensional scene model may be displayed by directly using a graphical display interface corresponding to the mobile phone or the dedicated capturing device, for example, a mobile phone screen or a display screen connected to the dedicated capturing device. Specifically, the determination may be performed according to actual situations, and details are not described herein.

In this way, the data processing equipment can automatically complete the operation of determining the complementary mining area, so that the complementary mining area can be determined quickly, and the efficiency is high; and because the user does not need to determine the complementary mining area, the technical requirements on the working personnel can be reduced, and the working personnel can be assisted to complete related data labeling tasks more conveniently.

(B2) The method comprises the following steps Displaying the three-dimensional scene model; and in response to the triggering of any region in the three-dimensional scene model by a user, determining the triggered region as the complementary mining region.

In this case, the three-dimensional scene model may be presented to the user in a graphical user interface, which may be specifically referred to the description in (B1), and will not be described herein again.

In a possible case, when the three-dimensional scene model is established, there may be a case where fewer dense point cloud points correspond to a partial region determined by using the first panoramic video, so that the modeling result of the partial region is inaccurate. Therefore, the adoption of the automatic detection method for the three-dimensional scene model may omit the supplementary collection of the partial region, and further cause the inaccuracy of the three-dimensional model corresponding to the partial region in the obtained three-dimensional scene model. By means of displaying the three-dimensional scene model to the user, the user can flexibly select the area needing to be supplemented and collected through a mode of checking the three-dimensional scene model, the area needing to be collected more carefully is selected by the user, or the area where the three-dimensional scene model cannot be accurately established by utilizing the first panoramic video can be supplemented more completely and clearly, so that the three-dimensional scene model obtained after being supplemented and collected is more complete, and the actual precision requirement of the user and the actual labeling requirement of the user are met.

Here, the manner in which the data processing device controls the panoramic camera to perform the complementary acquisition on the target space to obtain the second panoramic video may be referred to the description in (B1) above, and details are also omitted here.

In another embodiment of the present disclosure, a specific embodiment of the data processing device during data complementary acquisition is further provided. In this embodiment, the data processing device may include, for example, a mobile phone, and perform acquisition of a panoramic video through a panoramic camera wirelessly connected to the mobile phone, and present the relevant panoramic video and the three-dimensional scene model to the user on a graphical display interface of the mobile phone. Specifically, referring to fig. 3, a flowchart of a specific embodiment when performing data complementary collection for a data processing device provided by the embodiment of the present disclosure is shown, where:

s301: the mobile phone acquires a panoramic video obtained by the panoramic camera performing image acquisition on a target space.

S302: and displaying the panoramic video on a graphical display interface of the mobile phone.

S303: and responding to the dragging action of the user on the panoramic video on the graphical display interface of the mobile phone, and determining a three-dimensional scene model corresponding to the frame of video frame image by using the video frame image in the displayed panoramic video.

S304: and displaying the three-dimensional scene model and rendering the image for the panoramic video on a graphical display interface of the mobile phone in an overlapping manner.

In steps S302 to S304, when the panoramic video is displayed on the graphic display interface of the mobile phone, for example, a rendered image with transparency and shown in a certain color may be superimposed on the video frame image. Referring to fig. 4, a schematic diagram of displaying a video frame image on a graphical display interface according to an embodiment of the present disclosure is provided. When the user drags to the frame of video frame image, three devices, three cabinets, antennas connected with the three devices, and a gray rendering image are displayed.

In addition, after the mobile phone performs data processing on the video frame image, a three-dimensional scene model corresponding to the video frame image can be obtained and displayed on a graphical display interface. And erasing the rendering image for the area for establishing the three-dimensional scene model. Referring to fig. 5, a schematic diagram of a video frame image and a three-dimensional scene model displayed on a graphical display interface according to an embodiment of the present disclosure is shown; in the figure, a three-dimensional scene model is shown by a dotted line. In addition, in fig. 5, it may be determined that a corresponding three-dimensional scene model is not established in the complementary acquisition region according to the position of the rendering image that is not erased, that is, the illustrated complementary acquisition region.

S305: and responding to the selection operation of the user according to the complementary mining area shown by the graphical display interface, and determining whether to perform data complementary mining on the complementary mining area.

In this step, the graphic display interface may show the area for complementary mining, and the user may determine whether to perform complementary mining on the area according to the actual data acquisition requirement.

Specifically, the user may determine, for example, according to the video frame image and the complementary region shown in the graphical display interface in fig. 5, a device corresponding to the complementary region, that is, an antenna connected to the three devices. If data marking needs to be carried out on the antenna, because image acquisition is not carried out on the antenna at present, the fact that complementary acquisition needs to be carried out on a complementary acquisition area where the antenna is located again is determined, namely a second panoramic video needs to be acquired; if the data marking of the antenna is not needed, the establishment of the three-dimensional scene model can be continuously carried out on other areas which can be shown in the panoramic video, namely, the acquisition of the second panoramic video is not needed.

For example, if the user determines that data tagging needs to be performed on the antenna, for example, the complementary mining area corresponding to the antenna may be selected on the graphical display interface through operations such as clicking, so that the mobile phone determines to perform complementary mining on the complementary mining area.

With respect to S102 described above, after determining a panoramic video using the first panoramic video, or determining a panoramic video using the first panoramic video and the second panoramic video, the data processing apparatus may determine a key frame image from the panoramic video.

Specifically, the data processing apparatus, when determining the key frame image from the panoramic video, may adopt, for example, but not limited to, any of the following two manners (C1) and (C2):

(C1) the method comprises the following steps And determining a preset number of key frame images according to the number of video frame images contained in the panoramic video and the actual data annotation requirement.

Illustratively, when 100 frames of video frame images are contained in the panoramic video and annotation of 10 frames of key frame images is determined, in the case that annotation data in all 100 frames of video frame images in the panoramic video can be effectively and accurately determined, the preset number of key frame images can be determined to be 10 frames, and the 10 frames of key frame images are determined at the same frame number interval in the 100 frames of video frame images, for example, 10 frames of key frame images are determined, which are the 1 st frame, the 11 th frame, the 21 st frame, … …, the 81 th frame and the 91 th frame.

Therefore, the key frame images can be determined from the multi-frame video frame images contained in the panoramic video more easily and conveniently, and the number of the key frame images can meet the requirement of subsequent actual data annotation.

(C2) The method comprises the following steps In response to a user selection of a video frame image in the panoramic video, a key frame image in the panoramic video is determined.

In an implementation, when the panoramic video is presented to the user, for example, in response to a user selection operation on a part of video frames in the panoramic video, the part of video frames selected by the user is used as a key frame image in the panoramic video.

Illustratively, when the panoramic video is presented to the user, for example, a prompt for a selected key frame image may be displayed to the user. Specifically, for example, a video frame image in the panoramic video may be selected by a specific operation such as long-press, double-click, or the like by the user, and the selected video frame image may be used as the key frame image. In addition, it is also possible to present prompt information to the user, for example, present a message containing the text "please press long to select the frame of video frame image" to the user, and in the case where the user presses any frame of video frame image in the panoramic video for long, the frame of video frame image is taken as the key frame image.

In this way, the data processing device can more flexibly select a key frame image from the panoramic video in response to a selection by the user. In a possible case, if the distribution of each device in the target space is not concentrated, no device exists in continuous multi-frame video frames but the devices are concentrated in other frame video frames in the panoramic video, so that the data processing device responds to a mode of manually selecting a video frame image, the situation that the video frames without the devices are used as key frames can be avoided, and the efficiency of subsequently utilizing the key frames for data annotation is improved. In another possible case, if there is a situation that a part of the video frames in the video frames is not clear or the data is damaged, the data processing device responds to the way of manually selecting the video frame image, and the situation that the video frame is taken as the key frame can be avoided.

With respect to S103 described above, in the case of determining a key frame image in the panoramic video, the data processing apparatus may perform attribute annotation on an object to be annotated in the key frame image, for example. Specifically, when the data processing apparatus performs attribute labeling on the object to be labeled in the key frame image, for example, the following manners (D1) to (D3) may be adopted:

(D1) the method comprises the following steps Displaying the key frame image; and responding to a first labeling operation on the object to be labeled in the key frame image, and generating labeling data corresponding to the first labeling operation.

When the data processing device presents the key frame image to the user, for example, the key frame image with the same size (in pixels) as the panoramic video can be presented to the user on a graphical display interface of the data processing device.

Specifically, when the key frame image is presented to the user, for example, the annotation control required when the first annotation operation is performed on the key frame image may also be provided to the user at the same time. Illustratively, referring to fig. 6, a schematic diagram of a graphical display interface when displaying a key frame image and labeling a control according to an embodiment of the present disclosure is provided.

Illustratively, the user can determine the position for data annotation on the graphical display interface through clicking operation. Referring to fig. 6, in the key frame image 61, the user can select a position 62 by a click operation, and accordingly, the relevant annotation data is filled in the data annotation area 63 corresponding to the key frame image 61.

For the data labeling area 63, exemplary, part of the labeled data types that may be contained therein, such as device name, age, specific function, device person in charge, device manufacturer, device size specification, related text notes, etc., are shown in fig. 6. When data corresponding to different data types are filled, as shown in a data labeling area 63 in fig. 6, for example, the data can be directly filled in by inputting characters, for example, characters of "camera in cabinet 1" are input in a text input box under the name of the device, or characters of "image acquisition is performed on the area in front of cabinet 1" are input in a text input box under a specific function. For another example, a plurality of selectable input options may be provided to the user, such as in response to a click operation of the user on the under-age selection box, a pull-down menu including a plurality of different ages is presented to the user, and selection items of "1 year", "2 years", "3 years" are included in the pull-down menu, so that the user can determine input data under the age through selection of the selection items.

In one possible case, a custom annotation segment, such as "custom annotation segment 1" shown in the data annotation area 63, can also be included in the data annotation area 63. In response to user editing of the custom annotation segment, a new annotation data type can be determined and a new input box generated. Thus, the flexibility in data annotation is higher. In a possible case, a slider bar 64 is included in the data annotation area 63 for the user to use, because the content displayable on the graphical display interface is limited. In response to the user sliding the screen up and down in the graphical display interface, the sliding bar 64 may also display an up-and-down sliding effect to prompt the user of the current progress of the data in the data annotation area. Therefore, by means of sliding the screen, the limitation of the size of the graphical display interface can be eliminated, and more space for writing the annotation data can be provided for the user.

In another possible case, the key frame image 61 may also be part of the data annotation region 63, for example. Illustratively, for example, a selection box corresponding to "whether to use the key frame image as annotation data" may be displayed in the data annotation region 63, and the key frame image may be automatically used as annotation data in response to a user's selection operation on the selection box. Therefore, the original key frame image can be reserved in the annotation data, so that whether the annotation data has annotation errors or not can be checked back and corrected in time by calling the key frame image.

In this way, the data processing device may respond to the first annotation operation of the user to generate the annotation data corresponding to the first annotation operation.

(D2) The method comprises the following steps And performing semantic segmentation processing on the key frame image, and generating annotation data of the object to be annotated based on the result of the semantic segmentation processing.

When the data processing device performs semantic segmentation processing on the key frame image, the algorithm that can be adopted by the data processing device may include, for example, at least one of the following: convolutional Neural Networks (CNNs) and deep self-attention transform Networks (transformers).

After performing semantic segmentation processing on the key frame image, for example, a plurality of devices in the key frame image may be identified, and specific types of the plurality of devices may be determined. Specifically, for example, the description of the semantic segmentation processing in (D3) below may be referred to, and will not be described here.

After determining the specific types of the multiple devices, the relevant information of the devices, such as at least one of the device names, device service lives, device specific functions, device responsible persons, device manufacturers, device dimensions, and relevant text notes described above, may be retrieved from the corresponding database according to the specific types of the devices, and the determined relevant information is used as the labeling data of the object to be labeled.

Therefore, the data processing equipment can more efficiently label the data of the equipment and the like in the key frame image in a semantic segmentation mode without manual intervention, and the efficiency is higher.

(D3) The method comprises the following steps Generating a preview image based on the key frame image, and displaying the preview image; wherein the preview image has a resolution lower than the resolution of the key frame image; responding to a second labeling operation on the object to be labeled in the preview image, and generating labeling data corresponding to the preview image; and obtaining the annotation data of the key frame image based on the annotation data corresponding to the preview image.

The size of the key frame image is relatively large, and the requirement of a user on the image definition is not high when the user performs data annotation, so that the size of the key frame image in the panoramic video can be reasonably reduced when the key frame image is displayed to the user, for example, a preview image corresponding to the key frame image is generated, and the preview image is directly displayed to the user, so that the user can clearly identify each device in the preview image through a graphical display interface, the data transmission amount and the data processing amount after performing data annotation on the preview image are reduced, and the transmission efficiency and the data processing efficiency are further improved.

In a possible case, when the data processing apparatus responds to the user performing the second annotation operation on the object to be annotated in the preview image, for example, in a manner similar to the above (D1), which is not described herein again. In another possible case, the data processing apparatus may further perform the second labeling operation by using the semantic segmentation processing in combination with (D2) above. Illustratively, referring to fig. 7, a schematic diagram of a graphical display interface for displaying a preview image according to an embodiment of the present disclosure is provided.

Illustratively, a preview image 71 corresponding to the key frame image may be shown in fig. 7, for example. After the data processing device performs semantic segmentation processing on the key frame image in the manner described in (D2) above, it can determine the antenna in the area 72, the set of devices in the area 73, and the set of cabinets in the area 74 in the preview image 71. Further, the data processing apparatus may also identify in the area 73 a group of apparatuses including the apparatus 1, the apparatus 2, and the apparatus 3; a set of cabinets identified in area 74 includes cabinet 1, cabinet 2, and cabinet 3. In addition, for multiple cabinets in the area 74, the data processing apparatus may also determine cameras mounted on different cabinets through semantic segmentation processing, such as a camera 741 on the cabinet 1, a camera 742 on the cabinet 2, and a camera 743 on the cabinet 3.

After obtaining the semantic segmentation result, the data processing device may also display the selected device name in the identification label area 75 on the right side of the preview image 71 in response to a user's selection operation of any one of the areas, for example. Taking the example that the selected area includes the area 741 (the area 741 is partially shaded in the figure for identification), since the devices of the selected area 741 are determined by semantic segmentation, the semantic segmentation result "cabinet 1 camera" corresponding to the selected area 741 can be directly displayed in the text box under the "selected device name".

In this way, in a possible case, the user can check the object currently subjected to data annotation again through the preview image 71 and the selected device name in the corresponding identification annotation area 75, and the data processing device can also respond to the check operation of the user to reduce the generation of data annotation errors; in another possible case, if the data processing device fails to obtain a correct recognition result by adopting the semantic segmentation method, the method may further adjust in response to the modification of the selected device name by the user to ensure the correctness of the data annotation.

In addition, the identification tag area 75 further includes other tag data types similar to those in the data tag area 63 in fig. 6, and a slider 76 is correspondingly disposed, which can be referred to the corresponding description in fig. 6, and will not be repeated herein.

Here, when performing attribute labeling, since the panoramic video acquired by image capturing the target object includes a key frame image capable of performing attribute labeling, the data processing device performs labeling on the key frame image, which may also be regarded as a process of performing attribute labeling on the panoramic video.

After the key frame image is subjected to data annotation, the attribute annotation data of the panoramic video can be generated according to the annotation data obtained based on the attribute annotation. Specifically, for example, the following manner may be adopted: for each frame of video frame image in the panoramic video, in response to the frame of video frame image not being a key frame image, determining a target key frame image matched with the frame of video frame image from the key frame images; and generating the annotation information of the frame of video frame image based on the annotation information of the target key frame image.

Specifically, for a multi-frame video frame image included in the panoramic video, the data processing apparatus may determine a specific position of the multi-frame key frame in the multi-frame video frame image. For convenience of description, 10 frames of video frame images are included in the panoramic video, and the 1 st frame, the 4 th frame, and the 8 th frame are key frame images. The video frame images of the non-key frame images in the panoramic video comprise 2 nd frame, 3 rd frame, 5 th to 7 th frame, 9 th frame and 10 th frame video frame images, and the total number of 7 video frame images.

In a specific implementation, when determining a target key frame image for a video frame image other than a key frame image, for example, the following manner may be adopted: and determining the target key frame image matched with the frame video frame image in the key frame images based on the first position of the key frame image in the panoramic video and the second position of the frame video frame image in the panoramic video.

Illustratively, the panoramic video including 10 frames of video frame images is described as an example. The first position of the key frame image in the panoramic video may be represented by, for example, the frame number of the key frame image in the panoramic video, for example, in a 10-frame video frame image, the first position corresponding to the 1 st frame key frame image is the 1 st frame. Similarly, the second position of the video frame image of the non-key frame in the panoramic video can also be directly characterized by the frame number of the video frame image in the panoramic video, for example, in the 10 frame video frame image, the second position corresponding to the 2 nd frame video frame image is the 2 nd frame.

After determining the location of each frame of video frame image in the panoramic video, the data processing device may determine a target key frame image for a video frame image that is not a key frame image corresponding thereto. Specifically, the data processing apparatus may take a key frame image of the latest frame positioned before the frame of video frame image as a target key frame image to which the frame of video frame image corresponds.

For example, when determining a target key frame image for a 3 rd frame video frame image, the target key frame image may be determined in a key frame image determined before the 3 rd frame video frame image. Here, the 3 rd frame video frame image only includes one key frame image before, that is, the 1 st frame key frame image, and accordingly the 1 st frame key frame image is used as the target key frame image corresponding to the 3 rd frame video frame image. When determining the target key frame image for the 9 th frame video frame image, the target key frame image may be determined among key frame images determined before the 9 th frame video frame image. Here, the 9 th frame of video frame image includes three key frame images, wherein the key frame image adjacent to the 9 th frame of video frame image is the 8 th frame of key frame image, and the 8 th frame of key frame image is correspondingly used as the target key frame image corresponding to the 9 th frame of video frame image.

Here, in a possible case, since a video frame image of a device not currently shown in the panoramic video is first shown in the panoramic video may be used as a key frame image when determining the key frame image in the panoramic video. Referring to fig. 8, a schematic diagram of a multi-frame video frame image provided in the embodiment of the present disclosure shows a schematic diagram of consecutive 1 st to 6 th frame video frame images, where the 1 st frame video frame image is used as a first frame key frame image in a panoramic video, and the 4 th frame video frame image is used as a second frame key frame image in the panoramic video.

Specifically, since the 1 st frame video frame image shows the device in the target space, that is, the device 1, for the first time, the 1 st frame video frame image is taken as the first frame key frame image in the panoramic video. Then, the device 1 is shown in each of the 1 st frame video frame image to the 3 rd frame video frame image, and the device 2 is shown in the 4 th frame video frame image, so that the 4 th frame video frame image is taken as the second frame key frame image in the panoramic video. Similarly, the device 2 is shown in each of the 4 th frame video frame image to the 7 th frame video frame image (this portion is not shown in fig. 8, and only the description is given), and the device 3 is shown in the 8 th frame video frame image, so that the 8 th frame video frame image is taken as the third frame key frame image in the panoramic video.

Therefore, for a video frame image that is not a key frame image in the panoramic video, it can be considered that a device included in a key frame image of the frame immediately before the video frame image is closest to a device shown in the video frame image, for example, the 7 th frame video frame image includes the device 2 shown in the 4 th frame video frame image, but does not include the device 3 newly shown in a key frame image (i.e., the 8 th frame video frame image) closest to the device 2. That is, when the 1 st frame, the 4 th frame and the 8 th frame are used as the key frame images to label other key frame images in the panoramic video, the method that the key frame image of the nearest frame before the video frame image of the non-key frame is used as the target key frame image corresponding to the key frame image can be used for synchronously labeling data of the video frame images of the non-key frame images more accurately in the follow-up process.

Specifically, when the data processing device generates annotation information of a target key frame image based on the annotation information of the target key frame image, for example, the data annotated in the target key frame image may be synchronized to other frame video frame images in a way of tracking key points.

Fig. 9 is a schematic diagram illustrating data annotation of a non-key frame image according to an embodiment of the disclosure. For the 1 st video frame image, for example, a plurality of keypoints, such as the corresponding keypoint 911 at the doorknob position of the device 1 or the corresponding keypoint 912 at one bottom corner of the device 1, are determined for the device 1 shown therein. Accordingly, since the device 1 is also shown in the 2 nd video frame image, the keypoint 921 and/or the keypoint 922 corresponding to the device 1 shown in the 2 nd video frame image can be determined when the keypoint 911 or the keypoint 912 is subjected to keypoint tracking.

In this way, since the data annotation for the key frame image (e.g., the 1 st video frame image shown in fig. 9) is completed, the data can be synchronized in other video frame images (e.g., the 2 nd video frame image shown in fig. 9) than the key frame image according to the way of the key point tracking. In a possible case, since there may be a case that a more detailed part needs to be labeled when labeling a device, in the preceding labeling step, for example, a part area on the device may be labeled, and the same part area of the same device in other video frame images is labeled in a way of keypoint tracking, so that the more detailed labeling may be completed when labeling data.

After the synchronization of the labeling data of the video frame images of all the non-key frame images in the panoramic video is completed, the panoramic video with the data labeling completed can be obtained.

In another embodiment of the present disclosure, a specific embodiment of data annotation is also provided. Referring to fig. 10, a flowchart of a specific embodiment of the present disclosure is provided when performing data annotation, where:

s1001: and opening a data labeling environment.

Specifically, the data annotation method provided by the embodiment of the present disclosure may be applied to an Application (APP). After the user opens the APP, the data annotation environment can be correspondingly opened, for example, an entrance for data acquisition and annotation is provided; after the portal is started, a Software Development Kit (SDK) is called.

S1002: and the data processing equipment acquires the station data and determines an acquisition task list corresponding to the station.

After the SDK is called, the site data can be acquired correspondingly. Wherein the site is, for example, a site where many devices are installed; the station may include, but is not limited to, at least one machine room, and at least one of indoor control cabinet equipment deployed in the machine room, a tower installed on the ceiling of the machine room, and an outdoor control cabinet.

Specifically, corresponding to different sites, an Identifier (ID) corresponding to a site may be used as a unique identification identifier, and the data processing device determines the site for performing data annotation by inputting the identifier to the data processing device and scanning a two-dimensional code of the site, and transmits data required by an asset platform and a generation platform corresponding to the site to create a current collection task list. When data annotation is performed on the site, the data to be annotated on the site can be determined according to the relevant tasks in the collection task list.

S1003: and the data processing equipment determines the latest data labeling attribute packet through the network request attribute platform.

Specifically, after the current collection task list is created, the data processing device may first request the attribute platform through the network to determine whether the data annotation attribute package is updated. When there is an updated data tagging attribute package, for example, the latest data tagging attribute package may be downloaded in an APP update manner, so as to call data such as tagging attributes in the data tagging attribute package in a specific process of data tagging in the following. In the case that there is no updated data tagging attribute packet, S1004 may be continuously executed, that is, the collection process is normally performed.

S1004: the data processing device acquires a panoramic video.

For a specific description of this step, reference may be made to the above description of S101, and details are not repeated here.

S1005: and the data processing equipment performs data annotation on the video frame images contained in the panoramic video by using the data annotation attribute packet to obtain attribute annotation data of the panoramic video.

For a specific description of this step, reference may be made to the above description of S102 and S103, and details are not repeated here again.

Here, in steps S1004 and S1005, when performing image acquisition and data annotation, the image acquisition and/or the attribute annotation can be performed by using the data annotation attribute package in an offline state without depending on network connection, for example, by using an image acquisition device in an offline state.

S1006: the data processing equipment confirms whether the attribute marking data are correct or not; if yes, go to step S1007; if not, executing step S1008;

s1007: and uploading the attribute marking data by the data processing equipment.

When uploading the attribute marking data, for example, the network connection may be waited, and the attribute marking data may be uploaded sequentially after the network connection is successful. The specific manner of uploading the attribute labeling data can be referred to the following description of S104, and will not be described in detail here.

S1008: the data processing equipment determines whether an image acquisition error exists; if yes, return to step S1004; if not, the process returns to step S1005.

For the above S104, in another embodiment of the present disclosure, a pose when the image capturing device captures the panoramic video may also be obtained; in addition, when generating target capture data based on the attribute labeling data and the panoramic video, for example, the following manner may be adopted: and generating the target acquisition data based on the attribute labeling data, the panoramic video and the pose.

Specifically, for example, the attribute labeling data, the panoramic video, and the pose may be generated into corresponding target acquisition data under different timestamps according to at least one of the timestamps; or, the attribute labeling data, the panoramic video and the pose can be directly subjected to data packing according to at least one of the timestamps to obtain a finally submitted collected data packet, and the collected data packet is used as target collected data.

In addition, in a possible case, since the number of video frames included in the panoramic video may be large, and the size of the video frame image in the panoramic video is large, the amount of data to be transmitted when data transmission is performed is also large, and therefore, if the entire panoramic video is uploaded, the time required for uploading is also long, which causes a problem of low transmission efficiency. Therefore, the frame extraction processing can be carried out on the acquired original panoramic video, and the video data with the extracted partial frames is used as partial data in the target acquisition data, so that the data volume required to be transmitted is effectively reduced, and the data transmission efficiency is improved.

In another embodiment of the present disclosure, a specific embodiment of performing data annotation on a machine room by using a panoramic camera and a mobile phone is further provided; in this embodiment, to facilitate the description of the specific processes of the panoramic camera and the two sides of the mobile phone in the data labeling process, the specific operation steps corresponding to the two ends in the data labeling process are shown in the specific description and the drawings. Moreover, the data annotation method provided by the embodiment of the present disclosure is applied to the embodiment, and can be divided into three stages, including stage I: data acquisition phase, phase II: data annotation phase, and phase III: and (5) a data uploading stage. Referring to fig. 11, a flowchart corresponding to an embodiment of labeling data in a machine room according to an embodiment of the present disclosure is shown, wherein,

stage I: a data acquisition stage, comprising the following steps S1101-S1124; wherein the content of the first and second substances,

s1101: and the mobile phone starts a data labeling environment.

S1102: the mobile phone is connected with the panoramic camera.

S1103: the panoramic camera is connected with the mobile phone.

S1104: and the mobile phone performs gyroscope calibration on the panoramic camera.

S1105: and starting collection by the mobile phone.

S1106: the panoramic camera collects panoramic videos in real time.

S1107: the panoramic camera continuously generates data.

In this step, for example, a preview panoramic video acquired in real time may be generated, and the pose of the panoramic camera, that is, the data shown in the figure, is acquired accordingly: a. and previewing the panoramic video and the pose collected in real time. In addition, a panoramic video can also be generated, and the pose of the panoramic camera is correspondingly acquired, that is, the data shown in the figure: b. panoramic video and pose.

S1108: the panoramic camera stops capturing.

S1109: the panoramic camera transmits data to the mobile phone through network transmission.

In this step, the panoramic camera may transmit the synchronized preview image and the pose of the panoramic camera, that is, the data shown in the figure, to the mobile phone, for example: c. and (5) synchronizing the preview image and the pose.

S1110: and the mobile phone completes the SLAM real-time calculation.

S1111: and the mobile phone completes the reconstruction of the three-dimensional scene model.

S1112: and generating a visual reconstruction result in real time by the mobile phone.

S1113: the mobile phone determines whether to preview on a graphical display interface during acquisition; if yes, go to step S1114, otherwise go to step S1115.

In this step, as well as other steps described below, the graphical display interface comprises a graphical display interface of a cell phone.

S1114: and the graphical display interface displays the real-time preview result.

S1115: the graphical display interface displays other content.

In this step, the graphical display interface may display, for example, a prompt message, such as a text message "please determine whether to present the preview image", and accordingly provide a control for viewing the preview image.

S1116: the mobile phone determines to finish the acquisition.

For the above step S1108, after the capture is stopped, a captured preview video, that is, data shown in the figure, may be generated accordingly: d. and collecting the finished preview video. And correspondingly determining a preview video set according to the acquired preview video, namely the data shown in the figure: e. and collecting the finished preview video set.

In addition, the pose of the panoramic video and the panoramic camera obtained after the acquisition is finished, namely the data shown in the figure, can be acquired: f. and collecting the finished panoramic video and the pose. Correspondingly, the positions and postures of the acquired panoramic video set and the panoramic camera are determined according to the obtained positions and postures of the panoramic video set and the panoramic camera, namely the data shown in the figure: g. and collecting the complete panoramic video set and pose.

In addition, the synchronized preview video, that is, the data shown in the figure, may also be acquired: h. the synchronized preview video. And correspondingly obtaining a synchronized preview video set according to the preview video, namely data shown in the figure: i. the synchronized preview video set.

S1117: and the panoramic camera transmits the panoramic video and the pose acquired by the data f to the mobile phone through a network.

Here, after the panoramic video and the pose acquired by the data f are transmitted to the mobile phone through the network, the synchronized panoramic video and the pose, that is, the data shown in the figure, can be obtained in the mobile phone, for example: j. and (5) synchronizing the panoramic video and the pose.

S1118: and the panoramic camera transmits the panoramic video set and the pose acquired by the data g to the mobile phone through a network.

Here, after the panoramic video set and the pose acquired by the data g are transmitted to the mobile phone through the network, the panoramic video set and the pose synchronized in the mobile phone, that is, the data shown in the figure, may be obtained: k. and (5) synchronizing the panoramic video set and the pose.

S1119: the mobile phone determines whether a mode of manually determining a complementary mining area is selected; if yes, go to step S1121; if not, go to step S1120.

S1120: the mobile phone automatically detects whether a complementary mining area exists in the three-dimensional scene model; it jumps to step S1123.

S1121: and displaying the three-dimensional scene model on a graphical display interface of the mobile phone.

S1122: and the mobile phone responds to the manual confirmation operation and determines the complementary mining area.

S1123: the mobile phone confirms whether the additional mining is needed; if yes, return to step S1105; if not, executing a stage II: s1125 in the data annotation phase.

Stage II: a data labeling stage, which comprises the following steps S1125-S1128; wherein the content of the first and second substances,

s1125: the handset determines key frame images from the preview video set.

S1126: and the mobile phone carries out data annotation on the key frame image.

S1127: the mobile phone determines to finish the data annotation of the key frame image.

S1128: and the mobile phone synchronizes the annotation data on the key frame image to the panoramic video.

In this step, annotation data matching the panoramic video, i.e. the annotation data matching the panoramic video shown in the figure, can be generated accordingly.

Here, after generating the data i. annotation data matching the panoramic video, stage III may be performed accordingly: step S1129 in the data upload phase; after completion of S1128, step S1130 may be executed accordingly.

And stage III: a data uploading stage, which comprises the following steps S1129-S1134; wherein the content of the first and second substances,

s1129: the mobile phone modifies the file name according to the site and/or the time.

The file described here may be, for example, a file including storage of annotation data, a site, a panoramic video, and the like.

S1130: and the mobile phone selects the files to be uploaded.

S1131: and the mobile phone performs frame extraction on the panoramic video.

In this step, for example, the data shown in the figure can be obtained: m, the panoramic video after frame extraction, the pose, n, the annotation data matched with the panoramic video and a preview video packet.

S1132: and packaging all files by the mobile phone.

S1133: and checking the uploading progress.

S1134: and finishing uploading.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a data labeling device corresponding to the data labeling method is also provided in the embodiments of the present disclosure, and as the principle of solving the problem of the device in the embodiments of the present disclosure is similar to the data labeling method in the embodiments of the present disclosure, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 12, a schematic diagram of a data annotation device provided in an embodiment of the present disclosure is shown, where the device includes: a first obtaining module 121, a determining module 122, a processing module 123, and a generating module 124; wherein the content of the first and second substances,

a first obtaining module 121, configured to obtain a panoramic video obtained by performing image acquisition on a target space by using an image acquisition device; a determining module 122, configured to determine, from the panoramic video, a key frame image including an object to be annotated; the processing module 123 is configured to generate attribute annotation data of the panoramic video based on annotation data obtained by attribute annotation in response to attribute annotation on the object to be annotated in the key frame image; a generating module 124, configured to generate target acquisition data based on the attribute labeling data and the panoramic video.

In an alternative embodiment, the image capture device comprises: a panoramic camera; the first obtaining module 121, when obtaining a panoramic video obtained by performing image acquisition on a target space by using an image acquisition device, is configured to: controlling the panoramic camera to perform panoramic image acquisition on the target space to obtain a first panoramic video; determining that a complementary acquisition area to be subjected to complementary acquisition exists in the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is acquired; and controlling the panoramic camera to perform complementary mining on the target space based on the complementary mining area to obtain a second panoramic video.

In an optional implementation manner, when determining that there is a complementary collection area to be complementary collected in the target space based on the first panoramic video and the pose of the panoramic camera when collecting the first panoramic video, the first obtaining module 121 is configured to: performing three-dimensional reconstruction on the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is collected, and generating a three-dimensional scene model of the target space; and determining that a complementary acquisition region to be subjected to complementary acquisition exists in the target space based on the three-dimensional scene model.

In an optional implementation manner, when the first obtaining module 121 controls the panoramic camera to perform the complementary acquisition on the target space based on the complementary acquisition area to obtain the second panoramic video, to: and controlling the panoramic camera to perform complementary mining on the complementary mining area based on the current pose of the panoramic camera in the target space and the position of the complementary mining area in the three-dimensional scene model to obtain the second panoramic video.

In an optional implementation manner, when determining, based on the three-dimensional scene model, that a complementary acquisition region to be acquired additionally exists in the target space, the first acquiring module 121 is configured to: detecting whether an area which is not modeled completely exists in the three-dimensional scene model or not based on three-dimensional position information of each dense point cloud point in the three-dimensional scene model in the target space; and if the area which is not modeled completely exists, determining the area which is not modeled completely as the complementary mining area.

In an optional implementation manner, when determining, based on the three-dimensional scene model, that a complementary acquisition region to be acquired additionally exists in the target space, the first acquiring module 121 is configured to: displaying the three-dimensional scene model; and in response to the triggering of any region in the three-dimensional scene model by a user, determining the triggered region as the complementary mining region.

In an optional implementation manner, when performing attribute annotation on an object to be annotated in the key frame image, the processing module 123 is configured to: displaying the key frame image; and responding to a first labeling operation on the object to be labeled in the key frame image, and generating labeling data corresponding to the first labeling operation.

In an optional implementation manner, when performing attribute annotation on an object to be annotated in the key frame image, the processing module 123 is configured to: and performing semantic segmentation processing on the key frame image, and generating annotation data of the object to be annotated based on the result of the semantic segmentation processing.

In an optional implementation manner, when performing attribute annotation on an object to be annotated in the key frame image, the processing module 123 is configured to: generating a preview image based on the key frame image, and displaying the preview image; wherein the preview image has a resolution lower than the resolution of the key frame image; responding to a second labeling operation on the object to be labeled in the preview image, and generating labeling data corresponding to the preview image; and obtaining the annotation data of the key frame image based on the annotation data corresponding to the preview image.

In an optional embodiment, when generating the attribute annotation data of the panoramic video based on the annotation data obtained by attribute annotation, the processing module 123 is configured to: for each frame of video frame image in the panoramic video, in response to the frame of video frame image not being a key frame image, determining a target key frame image matched with the frame of video frame image from the key frame images; and generating the annotation information of the frame of video frame image based on the annotation information of the target key frame image.

In an alternative embodiment, the processing module 123, when determining the target key frame image for the frame of video frame image from the key frame images, is configured to: and determining the target key frame image matched with the frame video frame image in the key frame images based on the first position of the key frame image in the panoramic video and the second position of the frame video frame image in the panoramic video.

In an optional embodiment, the data annotation apparatus further includes a second obtaining module 125, configured to: acquiring the pose of the image acquisition equipment when acquiring the panoramic video; the generating module 124, when generating target acquisition data based on the attribute labeling data and the panoramic video, is configured to: and generating the target acquisition data based on the attribute labeling data, the panoramic video and the pose.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

An embodiment of the present disclosure further provides a computer device, as shown in fig. 13, which is a schematic structural diagram of the computer device provided in the embodiment of the present disclosure, and includes:

a processor 10 and a memory 20; the memory 20 stores machine-readable instructions executable by the processor 10, the processor 10 being configured to execute the machine-readable instructions stored in the memory 20, the processor 10 performing the following steps when the machine-readable instructions are executed by the processor 10:

acquiring a panoramic video obtained by utilizing image acquisition equipment to acquire an image of a target space; determining key frame images from the panoramic video; responding to attribute labeling of the object to be labeled in the key frame image, and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling; and generating target acquisition data based on the attribute labeling data and the panoramic video.

The storage 20 includes a memory 210 and an external storage 220; the memory 210 is also referred to as an internal memory, and temporarily stores operation data in the processor 10 and data exchanged with the external memory 220 such as a hard disk, and the processor 10 exchanges data with the external memory 220 through the memory 210.

The specific execution process of the instruction may refer to the steps of the data labeling method described in the embodiments of the present disclosure, and details are not repeated here.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the data annotation method in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the data labeling method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, image registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, and the like. The specific application can not only relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to real scenes or articles, but also relate to special effect treatment related to people, such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for annotating data, comprising:

acquiring a panoramic video obtained by utilizing image acquisition equipment to acquire an image of a target space;

determining a key frame image comprising an object to be marked from the panoramic video;

responding to attribute labeling of the object to be labeled in the key frame image, and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling;

and generating target acquisition data based on the attribute labeling data and the panoramic video.

2. The data annotation method of claim 1, wherein the image capture device comprises: a panoramic camera; the acquiring of the panoramic video acquired by image acquisition of the target space by using the image acquisition equipment comprises the following steps:

controlling the panoramic camera to perform panoramic image acquisition on the target space to obtain a first panoramic video;

determining that a complementary acquisition area to be subjected to complementary acquisition exists in the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is acquired;

and controlling the panoramic camera to perform complementary mining on the target space based on the complementary mining area to obtain a second panoramic video.

3. The data annotation method of claim 2, wherein the determining that there is a complementary acquisition region to be complementary acquired in the target space based on the first panoramic video and the pose of the panoramic camera at the time of acquiring the first panoramic video comprises:

performing three-dimensional reconstruction on the target space based on the first panoramic video and the pose of the panoramic camera when the first panoramic video is collected, and generating a three-dimensional scene model of the target space;

and determining that a complementary acquisition region to be subjected to complementary acquisition exists in the target space based on the three-dimensional scene model.

4. The data annotation method of claim 2 or 3, wherein the controlling the panoramic camera to perform the complementary mining on the target space based on the complementary mining area to obtain a second panoramic video comprises:

and controlling the panoramic camera to perform complementary mining on the complementary mining area based on the current pose of the panoramic camera in the target space and the position of the complementary mining area in the three-dimensional scene model to obtain the second panoramic video.

5. The data annotation method of claim 3, wherein the determining, based on the three-dimensional scene model, that there is a complementary acquisition region to be acquired in the target space comprises:

detecting whether an area which is not modeled completely exists in the three-dimensional scene model or not based on three-dimensional position information of each dense point cloud point in the three-dimensional scene model in the target space;

and if the area which is not modeled completely exists, determining the area which is not modeled completely as the complementary mining area.

6. The data annotation method of claim 3, wherein the determining, based on the three-dimensional scene model, that there is a complementary acquisition region to be acquired in the target space comprises:

displaying the three-dimensional scene model;

and in response to the triggering of any region in the three-dimensional scene model by a user, determining the triggered region as the complementary mining region.

7. The data annotation method according to any one of claims 1 to 6, wherein the attribute annotation of the object to be annotated in the key frame image comprises:

displaying the key frame image;

and responding to a first labeling operation on the object to be labeled in the key frame image, and generating labeling data corresponding to the first labeling operation.

8. The data annotation method according to any one of claims 1 to 7, wherein the attribute annotation of the object to be annotated in the key frame image comprises:

and performing semantic segmentation processing on the key frame image, and generating annotation data of the object to be annotated based on the result of the semantic segmentation processing.

9. The data annotation method according to any one of claims 1 to 8, wherein the attribute annotation of the object to be annotated in the key frame image comprises:

generating a preview image based on the key frame image, and displaying the preview image; wherein the preview image has a resolution lower than the resolution of the key frame image;

responding to a second annotation operation on the object to be annotated in the preview image, and generating annotation data corresponding to the preview image;

and obtaining the annotation data of the key frame image based on the annotation data corresponding to the preview image.

10. The data annotation method according to any one of claims 1 to 9, wherein the generating attribute annotation data of the panoramic video based on the annotation data obtained by attribute annotation comprises:

for each frame of video frame image in the panoramic video, in response to the frame of video frame image not being a key frame image, determining a target key frame image matched with the frame of video frame image from the key frame images;

and generating the annotation information of the frame of video frame image based on the annotation information of the target key frame image.

11. The method for annotating data according to claim 10, wherein said determining a target key frame image for the frame of video frame image from said key frame images comprises:

and determining the target key frame image matched with the frame video frame image in the key frame images based on the first position of the key frame image in the panoramic video and the second position of the frame video frame image in the panoramic video.

12. The data annotation method of any one of claims 1-11, wherein the method further comprises:

acquiring the pose of the image acquisition equipment when acquiring the panoramic video;

generating target acquisition data based on the attribute labeling data and the panoramic video, comprising:

and generating the target acquisition data based on the attribute labeling data, the panoramic video and the pose.

13. A data annotation device, comprising:

the first acquisition module is used for acquiring a panoramic video acquired by acquiring an image of a target space by using image acquisition equipment;

the determining module is used for determining a key frame image comprising an object to be marked from the panoramic video;

the processing module is used for responding to attribute labeling of the object to be labeled in the key frame image and generating attribute labeling data of the panoramic video based on labeling data obtained by attribute labeling;

and the generating module is used for generating target acquisition data based on the attribute labeling data and the panoramic video.

14. A computer device, comprising: a processor, a memory storing machine readable instructions executable by the processor, the processor for executing the machine readable instructions stored in the memory, the processor performing the steps of the data annotation method of any one of claims 1 to 12 when the machine readable instructions are executed by the processor.

15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when executed by a computer device, performs the steps of the data annotation method according to any one of claims 1 to 12.