WO2022170844A1 - 一种视频标注方法、装置、设备及计算机可读存储介质 - Google Patents

一种视频标注方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2022170844A1
WO2022170844A1 PCT/CN2021/137580 CN2021137580W WO2022170844A1 WO 2022170844 A1 WO2022170844 A1 WO 2022170844A1 CN 2021137580 W CN2021137580 W CN 2021137580W WO 2022170844 A1 WO2022170844 A1 WO 2022170844A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
point cloud
video frame
marked
dimensional
Prior art date
Application number
PCT/CN2021/137580
Other languages
English (en)
French (fr)
Inventor
莫柠锴
陈世峰
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022170844A1 publication Critical patent/WO2022170844A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the technical field of computer vision and image processing, and in particular, to a video annotation method, apparatus, device, and computer-readable storage medium.
  • Robotic arms have a wide range of applications in industry, retail and service fields, including the grabbing and distribution of goods, the sorting of components on the assembly line, and the depalletizing and palletizing in logistics.
  • the traditional robotic arm lacks the perception of the environment, so it can only determine the action and behavior of the robotic arm through pre-programming in a static environment (such as offline teaching that is widely used by industrial robots), and the traditional robotic arm often needs to customize the fixture.
  • the mechanical structure makes the objects to be sorted move or place according to the specified trajectory, which makes it inflexible, and each different scene requires customized design, resulting in high cost.
  • the visual guidance algorithm mainly includes two aspects: target detection and target pose estimation.
  • target detection and target pose estimation all require a large amount of labeled data for training.
  • the camera can collect a large amount of video data, then manually label the video data, and then train the target detection model and target pose estimation model through the labeled video data.
  • problems such as low labeling efficiency and high labor cost through manual labeling.
  • the embodiments of the present application provide a video labeling method, apparatus, device, and computer-readable storage medium, which can improve the labeling efficiency of video data and reduce labor costs to a certain extent.
  • the video frame sequence about the working scene is collected by the RGB-D image capture device; each video frame in the video frame sequence includes the object to be marked;
  • the first object information of the object to be marked in the three-dimensional scene point cloud is obtained;
  • the object to be annotated included in each video frame is annotated with second object information.
  • the present application can construct a 3D scene point cloud for the work scene, perform a single manual annotation on the objects to be labeled in the 3D scene point cloud, and map the manually annotated information in the 3D scene point cloud to each video frame. It can be seen that the present application realizes semi-automatic video labeling, reduces labeling time, improves labeling efficiency, and avoids the problem of inefficient manual labeling of data.
  • the present application provides a video annotation device, comprising:
  • a video capture module used for collecting a video frame sequence about the working scene through an RGB-D image capture device; each video frame in the video frame sequence includes an object to be marked;
  • a video processing module configured to acquire the target device posture parameters when the RGB-D image acquisition device collects the respective video frames, and construct a three-dimensional scene point cloud of the work scene according to the target device posture parameters;
  • the first object information of the object to be marked in the point cloud of the three-dimensional scene is obtained;
  • the to-be-labeled objects included in the respective video frames are marked with second object information.
  • the present application provides a computer program product that, when the computer program product runs on a border and coastal defense monitoring device, enables the border and coastal defense monitoring device to perform the steps of the method described in the first aspect or any optional manner of the first aspect. .
  • FIG. 1 is a schematic flowchart of a video labeling method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a target detection and target pose estimation provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a video annotation device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a video annotation device provided by an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” and the like described in the specification of this application mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application .
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • the robotic arm can work 24 hours a day, and the operating cost is lower.
  • an AR (augmented reality, augmented reality) scene based on a smart terminal it is necessary to train the target pose estimation model through the labeled video data in advance, so that the user can accurately perform pose estimation when interacting with virtual items.
  • the present application can also be applied to other scenarios in which deep learning is performed on video data that needs to be labeled, and the present application does not make special restrictions on specific application scenarios.
  • the video tagging method provided by the present application is exemplarily described below through specific embodiments.
  • FIG. 1 is a schematic flowchart of a video annotation method provided by the present application. As shown in Figure 1, the video annotation method may include the following steps:
  • a video frame sequence related to a work scene is collected by an RGB-D image collection device; each video frame in the video frame sequence includes an object to be marked.
  • the object to be annotated involved in the present application and the three-dimensional object model involved in the subsequent process may be rigid objects.
  • a single video frame in the video frame sequence can be expressed as shown in Formula 1:
  • ((r,g,b) k ,(u,v) k , dk ) represents the information contained in the kth pixel in the single video frame;
  • (r,g,b ) k represents the color information corresponding to the kth pixel in the single video frame, that is, the gray value of the red color component r, the gray value of the corresponding green color component g, and the gray value of the corresponding blue color component b ;
  • (u, v) k represents the coordinate information of the kth pixel in the single video frame;
  • d k represents the depth value of the kth pixel in the single video frame, and k represents the pixel in the single video frame total number of points.
  • each video frame in the video frame sequence may include a three-channel RGB image and a single-channel depth image, or include a four-channel RGB-D image, which is not particularly limited in this application.
  • the execution subject in this application may be a video annotation device, and the video annotation device has data processing capabilities, such as a terminal device, a server, and the like.
  • the number of RGB-D image acquisition devices may be one or more. If multiple RGB-D image acquisition devices are installed in the work scene, the acquisition efficiency of the work scene can be improved to a certain extent.
  • the RGB-D image acquisition device can be installed on a robotic arm.
  • the working scene can be the scene where the robotic arm works.
  • the RGB-D image acquisition device may be a built-in RGB-D camera of the smart terminal.
  • each video frame may include one or more objects to be labeled, and since each video frame is an image related to a work scene, the same object to be labeled may exist in different video frames. This is not limited.
  • a SLAM simultaneous localization and mapping
  • simultaneous localization and mapping a SLAM (simultaneous localization and mapping, simultaneous localization and mapping) algorithm can be used to obtain the The RGB-D image acquisition device is used for the attitude parameters of the target device when each video frame is collected; the first video frame is a video frame in the video frame sequence.
  • the first video frame may be the first video frame in the sequence of video frames.
  • the first video frame may also be any video frame in the video frame sequence, which is not particularly limited in this application.
  • target device pose parameters corresponding to each video frame in the video frame sequence can be shown in formula 2:
  • P list represents the list of target device attitude parameters corresponding to each video frame in the video frame sequence
  • P n represents the target device attitude parameter when the RGB-D image acquisition device collects the nth video frame in the video frame sequence
  • I n represents The nth video frame in the video frame sequence
  • SLAM() represents the simultaneous localization and mapping algorithm.
  • the target device attitude parameter can be composed of the device rotation matrix R 1 and the device translation matrix T 1 , which can be specifically expressed as: r 11 , r 12 , r 13 , r 21 , r 22 , r 23 , r 31 , r 32 , r 33 are elements in the device rotation matrix R 1 and are calculated by the SLAM algorithm; t 1 , t 2 , t 3 is an element in the device translation matrix T 1 and is also calculated by the SLAM algorithm.
  • a 3D scene point cloud of the work scene can be constructed according to the target device posture parameters. Specifically, the following steps may be included:
  • the transformation of the reference coordinate system can be performed through the following two steps a1 and a2:
  • the pixels in each video frame can be converted to the camera coordinate system, and the corresponding video frame in the camera coordinate system can be obtained.
  • a single video frame in the camera coordinate system may be shown in formula 3:
  • G cam ⁇ ((r,g,b) 1 ,(x cam ,y cam ,z cam ) 1 ),...,((r,g,b) k ,(x cam ,y cam ,z cam ) k ) ⁇ Equation 3
  • G cam represents a single video frame in the camera coordinate system
  • (r, g, b) k represents the color information corresponding to the kth pixel in a single video frame in the camera coordinate system, that is, the red color component r Gray value, corresponding to the gray value of the green color component g, and corresponding to the gray value of the blue color component b
  • c x represents The image coordinates of the image principal point on the x-axis in the RGB-D image acquisition device, c y represents the image coordinates of the image principal point on the y-axis in the RGB-D image acquisition
  • a single video frame in the reference coordinate system may be shown in formula 4:
  • G ref ⁇ ((r,g,b) 1 ,(x ref ,y ref ,z ref ) 1 ),...,((r,g,b) k ,(x ref ,y ref ,z ref ) k ) ⁇ Equation 4
  • P represents the pose parameter of the target device corresponding to the single video frame.
  • each video frame in the video frame sequence can be converted into the reference coordinate system through the steps described in (1) above.
  • the combined video frame can be expressed as shown in formula 5:
  • G raw represents the merged video frame
  • Merge() represents the merging function
  • G n ref represents the nth video frame in the reference coordinate system.
  • a preset processing function in a PCL may be used to perform denoising processing and smoothing processing on the merged video frames to obtain a 3D scene point cloud.
  • the preset processing function may be the MovingLeastSquares function.
  • an SFM structure from motion, structure from motion
  • an SFM structure from motion, structure from motion
  • the method of scene point cloud is not limited.
  • S103 Obtain first object information of the object to be marked in the point cloud of the three-dimensional scene by setting the three-dimensional object model at the position of the object to be marked in the point cloud of the three-dimensional scene.
  • each video frame in the video frame sequence includes the object to be labeled
  • the object to be labeled still exists in the point cloud of the three-dimensional scene constructed by each video frame.
  • the present application can place the corresponding 3D object model at the object to be annotated in the 3D scene point cloud, so that the 3D object model fits the object to be annotated in the 3D scene point cloud to complete a single annotation.
  • placement of the 3D object model can be performed by the following steps:
  • the three-dimensional object model can be expressed as shown in Equation 6:
  • OBJ ⁇ (id,class,(r,g,b) 1 ,(x obj ,y obj ,z obj ) 1 ),...,(id,class,(r,g,b) s ,(x obj ,y obj ,z obj ) s ) ⁇ Equation 6
  • OBJ represents the three-dimensional object model
  • id represents the initial identification information of the three-dimensional object model (such as number, sequence code, etc.), and the initial identification information can be randomly set in advance
  • class represents the object category information of the three-dimensional object model
  • (r, g ,b) s represents the color information of the s-th point in the three-dimensional object model, that is, the gray value corresponding to the red color component r, the gray value corresponding to the green color component g, and the gray value corresponding to the blue color component b
  • (x obj , y obj , z obj ) s represents the coordinate information corresponding to the s-th point in the three-dimensional object model, wherein the coordinate information may be based on the coordinate system of the three-dimensional object model itself.
  • the current identification information of the five objects to be labeled may be set according to the labeling order of the five objects to be labeled in the point cloud of the three-dimensional scene.
  • the present application can also change the initial identification information in the acquired three-dimensional object model to the current identification information corresponding to the object to be annotated.
  • the initial pose parameters include an initial translation matrix and an initial rotation matrix.
  • any point on the object to be marked corresponding to the three-dimensional object model may be selected, and the coordinate information of the point may be used to assign the initial translation matrix of the initial attitude parameter.
  • the initial rotation matrix can be set to the identity matrix.
  • the initial pose parameters can be as shown in Equation 7:
  • P obj represents the initial attitude parameter of the three-dimensional object model
  • R 2 represents the initial rotation matrix of the three-dimensional object model
  • T 2 represents the initial translation matrix of the 3D object model
  • represents the initial rotation angle of the 3D object model in the x-axis direction
  • represents the initial rotation angle of the 3D object model in the y-axis direction
  • tx represents the initial translation distance of the 3D object model in the x-axis direction
  • ty represents the initial translation distance of the 3D object model in the y-axis direction
  • tz represents the 3D object model in the direction of the y-axis.
  • Equation 8 The transformed 3D object model can be expressed as Equation 8:
  • OBJ new (x s , y s , z s ) represents the s-th point in the converted 3D object model
  • P obj represents the initial pose parameters of the 3D object model
  • OBJ (x, y, z) represents the 3D object The sth point in the model.
  • each point in the three-dimensional object model can be converted to the reference coordinate system where the point cloud of the three-dimensional scene is located, and the converted three-dimensional object model can be obtained.
  • the initial pose parameter is the first position of the object to be annotated in the 3D scene point cloud.
  • the fit between the converted 3D object model and the objects to be annotated in the 3D scene point cloud can be understood as: the degree of coincidence between the converted 3D object model and the objects to be annotated in the 3D scene point cloud. for the maximum coincidence.
  • the degree of coincidence can be manually identified to determine whether it is in a fit state.
  • the above-mentioned initial posture parameters are adjusted, according to the adjusted initial
  • the attitude parameters re-acquire the converted 3D object model until the re-acquired converted 3D object model is fitted with the object to be marked in the 3D scene point cloud, and the adjusted initial attitude parameters are used as the first object attitude information .
  • the first object pose information includes an adjusted rotation matrix and an adjusted translation matrix.
  • the first object information of the object to be marked in the point cloud of the three-dimensional scene can be obtained, and the object to be marked is marked with the first object information.
  • the first object information may include object category information of the object to be marked in the 3D scene point cloud, first object pose information and current identification information, and the object category information is the same as the object category information of the 3D object model.
  • the above-mentioned three-dimensional object model is a model obtained in advance through a reverse engineering three-dimensional modeling method.
  • the second object position information may include object category information of the object to be marked in the video frame, second object pose information, current identification information, bounding box information, and mask boundary information.
  • the i-th video frame in the video frame sequence and the j-th object to be annotated in the point cloud of the three-dimensional scene are described as examples. Specifically, the following steps may be included:
  • the object category information and current identification information of the j-th object to be labeled in the 3D scene point cloud are taken as the object category information and current identification information of the j-th object to be labeled in the ith video frame, respectively.
  • Equation 9 the second pose information of the j-th to-be-labeled object relative to the i-th video frame.
  • Equation 9 Represents the second pose information of the j-th object to be labeled relative to the i-th video frame; Represents the rotational attitude information of the j-th object to be labeled relative to the i-th video frame, R i represents the device rotation matrix corresponding to the i-th video frame, and R j represents the adjusted rotation matrix corresponding to the j-th object to be marked; Represents the translation pose information of the j-th object to be labeled relative to the i-th video frame, T i represents the device translation matrix corresponding to the ith video frame, and T j represents the adjusted translation matrix corresponding to the jth object to be labeled.
  • the object point cloud of the jth object to be labeled is mapped to the image coordinate system of the ith video frame, and the jth object to be labeled in the image coordinate system is obtained.
  • the j-th object to be labeled in the image coordinate system may be shown in formula 10:
  • K is the camera internal parameter matrix of the RGB-D image acquisition device; Represents the attitude information of the j-th object to be labeled relative to the i-th video frame; (x obj , y obj , z obj ) represents the coordinate information of any point in the j-th object to be labeled included in the 3D scene point cloud; (u, v) represents the coordinate information obtained by mapping the arbitrary point to the image coordinate system of the ith video frame; s represents the scaling factor.
  • the k points are mapped to the coordinate information obtained under the image coordinate system of the ith video frame.
  • the convex hull algorithm is used to calculate the mask boundary information corresponding to the j-th object to be labeled in the image coordinate system, and the bounding box information corresponding to the j-th object to be labeled in the image coordinate system is obtained according to the mask boundary information.
  • the convex hull algorithm may be the Graham convex hull algorithm or the like.
  • the mask boundary information can be obtained through the Graham convex hull algorithm as shown in Equation 11:
  • Box(M) represents the function of obtaining the bounding box information based on the mask boundary information
  • (u top , v top ) represents the coordinate information of the upper left corner mask boundary point in the mask boundary information
  • (u bottom , v bottom ) represents Coordinate information of the mask boundary point in the lower right corner of the mask boundary information.
  • the annotation record information of the j-th object to be marked in the i-th video frame can be expressed as: Among them, class j represents the object category information of the j-th object to be labeled, Represents the second object pose information of the j-th object to be labeled relative to the i-th video frame, and id j represents the current identification information of the j-th object to be labeled, Represents the bounding box information of the jth object to be labeled in the ith video frame, Represents the mask boundary information of the j-th object to be annotated in the i-th video frame.
  • information annotation can be performed for each object to be annotated in each video frame.
  • the present application can only manually label the objects to be labeled in the point cloud of the 3D scene once, and then automatically label the objects to be labeled in each video frame according to the information of the single manual labeling, without manual labeling in each video frame. Reduce the process of manually labeling each object to be labelled.
  • each scene has 5 to 6 objects to be labelled, and each video has 20 to 30 valid video frames.
  • the traditional manual labeling method is used.
  • each video frame takes 7 minutes; using the video annotation method described in this application, each video frame takes 1.5 minutes. It can be seen that the video annotation method proposed in this application greatly reduces the time-consuming of annotation.
  • the present application can construct a 3D scene point cloud for the work scene, perform a single manual annotation on the objects to be labeled in the 3D scene point cloud, and map the manually annotated information in the 3D scene point cloud to each video frame. It can be seen that the present application realizes semi-automatic video labeling, reduces labeling time, improves labeling efficiency, and avoids the problem of inefficient manual labeling of data.
  • FIG. 2 is a schematic flowchart of a video annotation method provided based on the embodiment shown in FIG. 1 .
  • the video annotation method may further include the following steps:
  • the fixed-focus camera is used to photograph the object to be marked in multiple directions, so as to obtain multiple frames of images of the object to be marked.
  • the RGB-D image acquisition device can be used to photograph the object to be marked in multiple directions to obtain multiple frames of images of the object to be marked.
  • the fixed-focus camera in the present application can shoot the object to be annotated in all directions, and obtain multiple frames of images of the object to be annotated.
  • S105 may be executed before S101, or may be executed before S102, and this application does not limit the sequence of S105.
  • S106 Acquire an object point cloud of the object to be marked according to the multiple frames of images.
  • the present application can use the camera measurement technology to obtain the object point cloud of the object to be marked from the multi-frame images.
  • the object point cloud of the object to be annotated can be obtained directly using the photogrammetry software Meshroom.
  • the object point cloud can be three-dimensional position coordinate information.
  • the object point cloud may also include color information or intensity information, etc., which is not particularly limited in this application.
  • the present application since there may be noise point clouds in the object point cloud, the reconstructed three-dimensional object model surface may not be smooth, so the present application may first perform denoising processing on the object point cloud. Then, a three-dimensional object model of the object to be annotated can be constructed for the denoised object point cloud.
  • the object point clouds can be connected to form a plane, and a three-dimensional object model of the object to be marked can be constructed from the plane.
  • the present application can obtain the 3D object model of each object in advance by reverse engineering, so that the 3D object model can be directly set in the constructed 3D scene point cloud, so as to perform information annotation on the object to be annotated.
  • FIG. 3 shows a schematic diagram of a video annotation method. As shown in Figure 3, it mainly includes two parts: the first part is the construction of the 3D object model; the second part is the construction and information annotation of the 3D scene point cloud.
  • the construction of the three-dimensional object model may include: acquiring multiple frames of images of the object to be marked, then acquiring the object point cloud of the object to be marked from the multi-frame images, and then constructing the three-dimensional object model of the object to be marked according to the object point cloud.
  • the construction and information annotation of the 3D scene point cloud may include: obtaining the target device attitude parameters and the 3D scene point cloud when the RGB-D image acquisition device collects each video frame, and then by analyzing the location of the object to be annotated in the 3D scene point cloud.
  • the corresponding three-dimensional object model is set at the position to mark the object to be marked with a single information, and then according to the target device attitude parameters and the first object information marked once, the second object information is marked on each video frame in the video frame sequence,
  • the second object information includes object category information of the object to be labeled (ie, classes in FIG. 3 ), bounding box information (ie, 2D Boxes in FIG. 3 ), mask boundary information (ie, 2D Masks in FIG. 3 ), and second Object pose information (i.e. 6D Poses in Figure 3).
  • Vision guidance mainly includes target detection and target pose estimation.
  • Target detection can use RCNN (region-based convolution neural networks) algorithm, Fast RCNN algorithm, Faster R-CNN algorithm, SSD (single shot multibox detector) , single-stage multi-frame prediction) algorithm, YOLO (you only look once, you only look once) algorithm, etc.
  • the target pose estimation can use the point cloud template registration method or the ICP (iterative closest point, iterative closest point) registration method Wait.
  • the existing target pose estimation is sensitive to noise data, and it is difficult to deal with problems such as incomplete occlusion.
  • FIG. 4 shows a schematic flowchart of a video annotation method. As shown in Figure 4, after S104, the following steps may also be included:
  • S108 Acquire a to-be-processed video frame, where the to-be-processed video frame includes a to-be-processed RGB image and a to-be-processed depth image.
  • the to-be-processed RGB image and the to-be-processed depth image may be acquired by an RGB-D image acquisition device.
  • S109 Obtain target object category information, target bounding box information, and target mask boundary information of the target object in the RGB image to be processed.
  • Figure 5 shows a schematic diagram of a target detection and target pose estimation process.
  • the target detection model involved in the target detection process may include: the backbone network layer (that is, the backbone in Figure 5, such as the residual network layer ResNet50, etc.), the FPN (feature pyramid) connected to the backbone network layer networks, feature pyramid network layer), multiple RCNN network layers connected to FPN, and NMS (non maximum suppression) network layers connected to multiple RCNN network layers.
  • Figure 5 is an example of FPN outputting feature images of three scales
  • the target detection model includes three RCNN network layers, and the output channels of FPN and RCNN network layers are in a one-to-one correspondence.
  • the target detection model in FIG. 5 is used as an example for description.
  • the application needs to input the RGB image to be processed into the backbone network layer to obtain the first feature image; then input the first feature image into the FPN, the second feature image of multiple scales can be output; then input the second feature image to the
  • the corresponding RCNN network layer outputs the initial object information of the target object in the target RGB image, and the initial object information may include initial category information, initial bounding box information and initial decoding coefficients. Then, the initial category information, initial bounding box information and initial decoding coefficients are screened through the NMS network layer, and the target object category information, target bounding box information and target decoding coefficients of the target object in the target RGB image are obtained.
  • the target object includes object1, object2, .
  • the objectn info is further explained in detail in Figure 5, that is, it includes the target object category information of objectn (ie, the class in Figure 5), the target bounding box information (ie, the 2D box in Figure 5) and the target decoding coefficient (ie, Figure 5). coefficients in ), where the target decoding coefficients can be a 32 ⁇ 1 vector.
  • the above examples are just examples, which are not specifically limited in the present application.
  • the target detection model may further include a first convolutional neural network layer.
  • the first feature image is input into the first convolutional neural network layer, and the third feature image with a preset scale is output.
  • the present application can, after obtaining the first feature block of the target object from the third feature image according to the target bounding box information, encode the first feature block of the target object.
  • a feature block is matrix multiplied with the target decoding coefficient of the target object to obtain the thermal image of the target object, wherein the thermal image is a single-channel image. It can be seen that the effect of obtaining the thermal image can be achieved through the target decoding coefficient;
  • the thermal image is binarized with a preset threshold to obtain the target mask boundary information of the target object in the target RGB image.
  • the preset scale of the third feature image is the same as the scale of the target RGB image, the pixels indicated by the target bounding box information can be directly obtained in the third feature image as the first feature block;
  • the preset scale of the image is not the same as the scale of the target RGB image, then the third feature image can be converted to the scale of the target RGB image, and then the pixels indicated by the target bounding box information can be obtained in the converted third feature image as The first feature block, or, obtaining the scale scaling ratio between the third feature image and the target RGB image, scaling the target bounding box information according to the scale scaling ratio, and obtaining the scaled target boundary in the third feature image
  • the pixel points indicated by the frame information are used as the first feature block, and the above process is only an example to illustrate that the present application does not make any special restrictions on this.
  • S110 Acquire a target point cloud image of the target object according to the target mask boundary information and the depth image to be processed.
  • the pixel points including the target object can be extracted from the depth image to be processed through the target mask boundary information; then coordinate transformation is performed on the pixel points including the target object to obtain the target point cloud information, and then Construct the target point cloud image according to the target point cloud information.
  • the target point cloud image and the target object image into the target pose estimation model obtained by pre-training, and obtain the target object pose information of the target object and the target confidence level corresponding to the target object pose information;
  • the target object image is based on the target bounding box information
  • the target pose estimation model obtained by pre-training is obtained by training according to the video frame marked with the second object information.
  • the pre-trained target pose estimation model may include a feature extraction network layer.
  • the feature extraction network layer may include a local feature extraction network layer, a global feature extraction network layer and a feature aggregation network layer.
  • the present application can input the target point cloud image and the target object image to the local feature extraction network layer to obtain the local feature image of the target object; input the local features of the target object and the coordinate information corresponding to the local features to the global feature extraction network layer.
  • obtain the global feature image of the target object input the local feature image and the global feature image to the feature aggregation network layer to obtain the aggregated feature image, and through the aggregated feature image, obtain the target object pose information of the target object and the corresponding pose information of the target object target confidence.
  • the target pose estimation model may also include three second convolutional neural network layers, wherein the first convolutional neural network layer is used to obtain the target translation matrix in the target object pose information, and the second convolutional neural network layer The convolutional neural network layer is used to obtain the target rotation matrix in the target object pose information, and the third convolutional neural network layer is used to obtain the confidence level of the target object pose information.
  • the aggregated feature images can be input to the first and second convolutional neural network layers to obtain the target object pose information of the target object.
  • the three convolutional neural network layers may be 1 ⁇ 1 convolution kernels.
  • the present application uses an instance segmentation algorithm in the target detection model, and the instance segmentation algorithm can provide the precise outline of the object to be marked, background noise can be reduced in the target pose estimation.
  • both local features and global features are combined in the target pose estimation process. In this way, even if some local features are occluded, the pose estimation can be performed by another unoccluded local feature, which can avoid the problem of object occlusion to a certain extent. .
  • S112 Determine final object posture information of the target object from the target object posture information of the target object according to the target confidence corresponding to the target object posture information, and perform visual guidance through the final object posture information and target object category information.
  • the robotic arm can be controlled to automatically grab objects at any position and posture and achieve sorting.
  • the device used for video annotation in this application may have the following hardware: Intel(R) Xeon(R) 2.4GHz CPU, NVIDIA GTX 1080 Ti GPU.
  • Intel(R) Xeon(R) 2.4GHz CPU Intel(R) Xeon(R) 2.4GHz CPU
  • NVIDIA GTX 1080 Ti GPU NVIDIA GTX 1080 Ti GPU
  • the pose of the object can be quickly acquired, which is convenient for visual guidance.
  • the present application may use each video frame in the embodiment shown in FIG. 1 or FIG. 2 to perform model training on the target detection model and the target pose estimation model.
  • each video frame in the video frame sequence may include an RGB image and a depth image.
  • the following steps may also be included:
  • S113 Input the RGB image into the target detection model to be trained, and obtain object category information to be matched, bounding box information to be matched, and decoding coefficients of the object to be marked in the RGB image.
  • S114 Perform instance segmentation on the RGB image according to the bounding box information to be matched and the decoding coefficients to obtain the mask boundary information of the object to be marked in the RGB image to be matched.
  • S115 Acquire a point cloud image of the object to be marked according to the mask boundary information to be matched and the depth image.
  • S117 Perform model training on the target recognition model and the target posture estimation model according to the category information of the object to be matched, the bounding box information to be matched, the boundary information of the mask to be matched, and the pose information of the object to be matched.
  • the loss can be obtained through object category information, bounding box information, mask boundary information, second object pose information, object category information to be matched, bounding box information to be matched, mask boundary information to be matched, and object pose information to be matched.
  • function, and the target recognition model to be trained and the target pose estimation model are used for model training through the loss function.
  • the present application can quickly perform model training by using semi-automatically labeled video frames, thereby improving the efficiency of model training and ensuring the accuracy of the model to a certain extent.
  • the embodiments of the present invention further provide embodiments of apparatuses for implementing the foregoing method embodiments.
  • FIG. 7 is a schematic structural diagram of a video annotation apparatus provided by an embodiment of the present application.
  • the included modules are used to execute the steps in the embodiments corresponding to FIG. 1 to FIG. 6 .
  • the video annotation device 7 includes:
  • a video capture module 71 configured to capture a video frame sequence related to a work scene through an RGB-D image capture device; each video frame in the video frame sequence includes an object to be marked;
  • a video processing module 72 configured to acquire the target device posture parameters when the RGB-D image acquisition device collects the respective video frames, and construct a three-dimensional scene point cloud of the work scene according to the target device posture parameters;
  • the first object information of the object to be marked in the point cloud of the three-dimensional scene is obtained;
  • the to-be-labeled objects included in the respective video frames are marked with second object information.
  • the video processing module 72 is further configured to adopt the simultaneous positioning and mapping SLAM algorithm to obtain the RGB-D image acquisition device using the initial device attitude parameter when the RGB-D image acquisition device collects the first video frame as a reference coordinate system.
  • the pixel point in each video frame is converted to under the reference coordinate system, obtains the video frame under the reference coordinate system;
  • the video processing module 72 is further configured to convert the pixels in each video frame to the camera coordinate system, and correspondingly obtain the video frame under the camera coordinate system;
  • the video frame in the camera coordinate system is converted into the reference coordinate system to obtain the video frame in the reference coordinate system.
  • the first object information includes: object category information, current identification information and first object pose information of the object to be marked in the three-dimensional scene point cloud;
  • the video processing module 72 is further configured to acquire the three-dimensional object model corresponding to the object to be marked in the three-dimensional scene point cloud, and the three-dimensional object model is set with corresponding object category information;
  • the current identification information of the objects to be labelled in the 3D scene point cloud is set, and the object category information of the 3D object model is used as the object type information in the 3D scene point cloud.
  • the converted three-dimensional object model is obtained;
  • the converted three-dimensional object model By setting the converted three-dimensional object model at the position of the object to be marked in the three-dimensional scene point cloud, the first object pose information of the object to be marked in the three-dimensional scene point cloud is determined.
  • the video processing module 72 is further configured to set initial posture parameters for the three-dimensional object model
  • the converted three-dimensional object model and the object to be marked in the point cloud of the three-dimensional scene are fitted, determine that the initial posture parameter is the first object of the object to be marked in the point cloud of the three-dimensional scene attitude information;
  • the converted three-dimensional object model does not fit the object to be marked in the three-dimensional scene point cloud
  • adjust the initial posture parameter and re-acquire the converted three-dimensional object according to the adjusted initial posture parameter object model until the re-acquired converted three-dimensional object model is fitted with the object to be marked in the three-dimensional scene point cloud
  • the adjusted initial posture parameter is used as the first object posture information
  • the second object information includes: object category information, current identification information, second object pose information, mask boundary information and bounding box information corresponding to the object to be marked in the video frame;
  • the target device posture parameter corresponding to the i-th video frame and the first object posture information of the j-th object to be labeled obtain the second position of the j-th object to be labeled relative to the i-th video frame Object pose information;
  • the convex hull algorithm is used to calculate the mask boundary information of the j-th object to be labeled in the i-th video frame under the image coordinate system, and the mask boundary information under the image coordinate system is obtained according to the mask boundary information.
  • the video acquisition module 71 is further configured to use a fixed-focus camera to photograph the object to be marked in multiple directions to obtain multiple frames of images of the object to be marked;
  • the video processing module 72 is further configured to obtain the object point cloud of the object to be marked according to the multi-frame images; and,
  • a three-dimensional object model of the object to be marked is constructed.
  • FIG. 8 is a schematic structural diagram of a video annotation device provided by an embodiment of the present application.
  • the video annotation device 8 of this embodiment includes: a processor 80 , a memory 81 , and a computer program 82 stored in the memory 81 and executable on the processor 80 , such as a video annotation program.
  • the processor 80 executes the computer program 82, the steps in each of the foregoing video tagging method embodiments are implemented, for example, S101-S104 shown in FIG. 1 .
  • the processor 80 executes the computer program 82
  • the functions of the modules/units in the above-mentioned device embodiments for example, the functions of the modules 71 and 72 shown in FIG. 7 are implemented.
  • the computer program 82 can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete the present application .
  • the one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the video annotation device 8 .
  • the computer program 82 may be divided into an acquisition module and a processing module. For specific functions of each module, please refer to the relevant descriptions in the corresponding embodiments in FIG. 1 to FIG. 6 , which will not be repeated here.
  • the video annotation device may include, but is not limited to, a processor 80 and a memory 81 .
  • FIG. 8 is only an example of the video annotation device 8, and does not constitute a limitation to the video annotation device 8, and may include more or less components than the one shown, or combine some components, or different
  • the video annotation device may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf processors. Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 81 may be an internal storage unit of the video annotation device 8 , such as a hard disk or a memory of the video annotation device 8 .
  • the memory 81 can also be an external storage device of the video labeling device 8, such as a plug-in hard disk equipped on the video labeling device 8, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 81 may also include both an internal storage unit of the video annotation device 8 and an external storage device.
  • the memory 81 is used to store the computer program and other programs and data required by the video annotation device.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the foregoing border and coastal defense monitoring method can be implemented.
  • the embodiment of the present application provides a computer program product, when the computer program product runs on a video annotation device, the above-mentioned border and coastal defense monitoring method can be realized when the video annotation device is executed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请提供一种视频标注方法、装置、设备及计算机可读存储介质,涉及计算机视觉与图像处理技术领域,能够在一定程度上提高对视频数据的标注效率,减少人力成本。该方法包括:通过RGB-D图像采集设备采集关于工作场景的视频帧序列;所述视频帧序列中的各个视频帧包括待标注对象;获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云;通过对所述三维场景点云中待标注对象所在的位置处设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息;根据所述第一对象信息和所述目标设备姿态参数,对所述各个视频帧包括的待标注对象标注第二对象信息。

Description

一种视频标注方法、装置、设备及计算机可读存储介质 技术领域
本申请涉及计算机视觉与图像处理技术领域,尤其涉及一种视频标注方法、装置、设备及计算机可读存储介质。
背景技术
机械臂在工业、零售和服务领域有着广泛的应用,包括货物的抓取分配、流水线上器件的分拣、物流中的拆码垛等等。传统的机械臂缺乏对环境的感知,故只能在静态环境里通过预先编程确定机械臂的动作与行为(比如工业机器人大量使用的离线示教),并且,传统的机械臂往往需要定制固定夹具或者机械结构,使待分拣目标按规定轨迹运动或摆放,这样造成不够灵活,每个不同场景都需要定制化设计,造成成本较高。
随着科技的发展,出现了基于深度学习的视觉引导算法。其中,视觉引导算法主要包括目标检测和目标姿态估计两方面。目前,基于深度学习的算法都需要大量标注数据进行训练。摄像机可以采集大量的视频数据,接着对视频数据进行人工标注,然后通过标注好的视频数据对目标检测模型和目标姿态估计模型进行训练。但通过人工标注的方式存在标注效率较低,以及人力成本较大等问题。
发明内容
本申请实施例提供了一种视频标注方法、装置、设备及计算机可读存储介质,在一定程度上可以提高对视频数据的标注效率,减少人力成本。
有鉴于此,第一方面,本申请提供一种视频标注方法,包括:
通过RGB-D图像采集设备采集关于工作场景的视频帧序列;所述视频帧序列中的各个视频帧包括待标注对象;
获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云;
通过对所述三维场景点云中待标注对象所在的位置处设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息;
根据所述第一对象信息和所述目标设备姿态参数,对所述各个视频帧包括的待标注对象标注第二对象信息。
采用上述所述的方法,本申请可以对工作场景构建三维场景点云,对三维场景点云中的待标注对象进行单次人工标注,并将三维场景点云中人工标注的信息映射至各个视频帧中。由此可见,本申请实现了半自动视频标注,减少了标注时长,提高了标注效率,避免了人工标注数据低效的问题。
第二方面,本申请提供一种视频标注装置,包括:
视频采集模块,用于通过RGB-D图像采集设备采集关于工作场景的视频帧序列;所述视频帧序列中的各个视频帧包括待标注对象;
视频处理模块,用于获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云;
通过对所述三维场景点云中待标注对象设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息;以及,
根据所述第一对象信息,对所述各个视频帧包括的待标注对象标注第二对象信息。
第三方面,本申请提供一种视频标注设备,包括处理器、存储器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如第一方面或第一方面的任意可选方式所述的方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面或第一方面的任意可选方式所述的方法。
第五方面,本申请提供一种计算机程序产品,当计算机程序产品在边海防监控设备上运行时,使得边海防监控设备执行上述第一方面或第一方面的任意可选方式所述方法的步骤。
可以理解的是,上述第二方面至第六方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技 术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种视频标注方法的流程示意图;
图2是本申请实施例提供的另一种视频标注方法的流程示意图;
图3是本申请实施例提供的另一种视频标注方法的流程示意图;
图4为本申请实施例提供的另一种视频标注方法的流程示意图;
图5是本申请实施例提供的一种目标检测和目标姿态估计的流程示意图;
图6是本申请实施例提供的另一种视频标注方法的流程示意图;
图7是本申请实施例提供的一种视频标注装置的结构示意图;
图8是本申请实施例提供的一种视频标注设备的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定***结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的***、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
还应当理解,在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
首先,对本申请涉及的应用场景进行示例性说明。
例如,在货物自动分拣的场景中,需要预先通过标注好的视频数据训练目标识别模型和目标姿态估计模型,从而可以通过训练好的目标识别模型和目标姿态估计模型获取该场景中货物的种类和姿态。这样,使用机械臂对货物进行分拣,相比于人力分拣,机械臂可以达到24小时不间断工作,并且运行成本较低。
又如,在基于智能终端的AR(augmented reality,增强现实)场景中,需要预先通过标注好的视频数据训练目标姿态估计模型,这样实现了用户与虚拟物品进行交互时,准确进行姿态估计。当然,本申请还可以应用于其他需要标注的视频数据进行深度学习的场景,本申请对具体的应用场景不作特殊限制。
下面通过具体实施例,对本申请提供的视频标注方法进行示例性的说明。
请参见图1,图1是本申请提供的一种视频标注方法的流程示意图。如图1所示,该视频标注方法可以包括以下步骤:
S101,通过RGB-D图像采集设备采集关于工作场景的视频帧序列;视频帧序列中的各个视频帧包括待标注对象。
其中,本申请涉及的待标注对象以及后续过程中涉及的三维对象模型可以为刚体对象。
在本申请实施例中,视频帧序列中的单张视频帧可以如公式1所示的方式表示:
I={((r,g,b) 1,(u,v) 1,d 1),...,((r,g,b) k,(u,v) k,d k)}     公式1
在公式1中,((r,g,b) k,(u,v) k,d k)表示该单张视频帧中第k个像素点所包含的信息;其中,(r,g,b) k表示该单张视频帧中第k个像素点对应的颜色信息,即红色颜色分量r的灰度值,对应绿色颜色分量g的灰度值,以及对应蓝色颜色分量b的灰度值;(u,v) k表示该单张视频帧中第k个像素点的坐标信息;d k表示该单张视频帧中第k个像素点的深度值,k表示该单张视频帧中像素点的总数量。
可以理解的是,由于本申请采用RGB-D图像采集设备,这样可以同时采集到不同像素点关于各个颜色分量的灰度值以及不同像素点的深度值。故,视频帧序列中每张视频帧可以包括三通道的RGB图像和单通道的深度图像,或者,包括四通道的RGB-D图像,本申请对此不作特殊限制。
本申请中的执行主体可以为视频标注设备,该视频标注设备具备数据处理能力,例如可以为终端设备、服务器等等。
其中,RGB-D图像采集设备的数量可以为一个或者多个。若在工作场景中安装多个RGB-D图像采集设备,则可以一定程度上提高对工作场景的采集效率。
可选地,若本申请应用于货物自动分拣的场景中,则该RGB-D图像采集设备可以安装在机械臂上。此时,工作场景可以为机械臂工作所在的场景。
可选地,若本申请应用于基于智能终端的AR场景中,则该RGB-D图像采集设备可以为智能终端内置的RGB-D摄像头。
在本申请的可选实施例中,各个视频帧可以包括一个或者多个待标注对象,以及由于各个视频帧均为工作场景相关的图像,故不同视频帧中可以存在相同待标注对象,本申请对此不作限定。
S102,获取RGB-D图像采集设备采集各个视频帧时的目标设备姿态参数,以及根据目标设备姿态参数构建工作场景的三维场景点云。
在本申请的一些实施例中,可以采用SLAM(simultaneous localization and mapping,同时定位与建图)算法,以RGB-D图像采集设备采集第一视频帧时的初始设备姿态参数为参考坐标系,获取RGB-D图像采集设备采集各个视频帧时的目标设备姿态参数;第一视频帧为视频帧序列中的一张视频帧。
示例性的,第一视频帧可以为视频帧序列中的首张视频帧。当然,第一视频帧还可以为视频帧序列中的任一视频帧,本申请对此不作特殊限制。
进一步地,视频帧序列中各个视频帧对应的目标设备姿态参数可以如公式2所示:
P list=(P 1,P 2,...,P n)=SLAM(I 1,I 2,...,I n)          公式2
其中,P list表示视频帧序列中各个视频帧对应的目标设备姿态参数的列表,P n表示RGB-D图像采集设备采集视频帧序列中第n张视频帧时的目标设备姿态参数,I n表示视频帧序列中的第n张视频帧,SLAM()表示同时定位与建图算法。
可以理解的是,若以RGB-D图像采集设备采集首张视频帧I 1时的初始设备姿态参数为参考坐标系,则公式2中获取到的P n为以该初始设备姿态参数为参考坐标系得到的设备姿态参数。
以视频帧序列中的第i张视频帧为例进行说明,i为正整数。目标设备姿态参数可以由设备旋转矩阵R 1和设备平移矩阵T 1组成,具体可以表示为:
Figure PCTCN2021137580-appb-000001
r 11、r 12、r 13、r 21、r 22、r 23、r 31、r 32、r 33为设备旋转矩阵R 1中的元素,且通过SLAM算法计算得到的;t 1、t 2、t 3为设备平移矩阵T 1中的元素,且同样通过SLAM算法计算得到的。
在申请实施例中,在获取到目标设备姿态参数之后,可以根据目标设备姿态参数构建工作场景的三维场景点云。具体地,可以包括以下步骤:
(1)根据各个视频帧对应的目标设备姿态参数,将每张视频帧中的像素点转换至参考坐标系下,得到参考坐标系下的视频帧。
具体地,可以通过以下两个步骤a1和a2进行参考坐标系的转换:
a1、可以将每张视频帧中的像素点转换至相机坐标系下,对应得到相机坐标系下的视频帧。
在本申请实施例中,相机坐标系下的单张视频帧可以如公式3所示:
G cam={((r,g,b) 1,(x cam,y cam,z cam) 1),...,((r,g,b) k,(x cam,y cam,z cam) k)}   公式3
其中,G cam表示相机坐标系下的单张视频帧;(r,g,b) k表示相机坐标系下的单张视频帧中第k个像素点对应的颜色信息,即红色颜色分量r的灰度值,对应绿色颜色分量g的灰度值,以及对应蓝色颜色分量b的灰度值;(x cam,y cam,z cam) k表示相机坐标系下的单张视频帧中第k个像素点的坐标信息,x cam=d k*(u k-c x)/f x,y cam=d k*(v k-c y)/f y,z cam=d k,c x表示RGB-D图像采集设备中像主点在x轴上的图像坐标,c y表示RGB-D图像采集设备中像主点在y轴上的图像坐标,f x表示RGB-D图像采集设备在x轴上的焦距;f y表示RGB-D图像采集设备在y轴上的焦距,d k表示单张视频帧中第k个像素点的深度值,u k表示该单张视频帧中第k个像素点在x轴上的坐标信息,v k表示该单张视频帧中第k个像素点在y轴上的坐标信息。
a2、根据各个视频帧对应的目标设备姿态参数,将相机坐标系下的视频帧转换至参考坐标系下,得到参考坐标系下的视频帧。
在本申请实施例中,参考坐标系下的单张视频帧可以如公式4所示:
G ref={((r,g,b) 1,(x ref,y ref,z ref) 1),...,((r,g,b) k,(x ref,y ref,z ref) k)}   公式4
其中,G ref表示参考坐标系下的单张视频帧;(r,g,b) k表示参考坐标系下的单张视频帧中第k个像素点对应红色颜色分量r的灰度值,对应绿色颜色分量g的灰度值,以及对应蓝色颜色分量b的灰度值;(x ref,y ref,z ref) k表示参考坐标 系下的单张视频帧中第k个像素点的坐标信息。
进一步地,
Figure PCTCN2021137580-appb-000002
P表示该单张视频帧对应的目标设备姿态参数。
由此可见,可以通过上述(1)所述的步骤将视频帧序列中的每张视频帧转换至参考坐标系下。
(2)对参考坐标系下的视频帧进行数据合并处理,得到合并处理后的视频帧。
在本申请实施例中,合并处理后的视频帧可以如公式5所示的方式表示:
G raw=Merge(G 1 ref,G 2 ref,...,G n ref)         公式5
其中,G raw表示合并处理后的视频帧,Merge()表示合并函数,G n ref表示参考坐标系下的第n张视频帧。
(3)对合并处理后的视频帧进行去噪平滑处理,得到工作场景的三维场景点云。
在本申请实施例中,可以采用PCL(point cloud library,点云库)中的预设处理函数对合并处理后的视频帧进行去噪声处理和平滑处理,得到三维场景点云。
示例性的,该预设处理函数可以为MovingLeastSquares函数。此时,去噪平滑处理的过程可以表示为:G final=MovingLeastSquares(G raw),其中,G final表示三维场景点云。
需要说明的是,在本申请的另一些可选实施例中,可以采用SFM(structure from motion,从运动恢复结构)算法,根据视频帧序列构建工作场景的三维场景点云,本申请对构建三维场景点云的方法不作限定。
S103,通过对三维场景点云中待标注对象所在的位置处设置三维对象模型,得到三维场景点云中待标注对象的第一对象信息。
可以理解的是,由于视频帧序列中的各个视频帧包括待标注对象,故通过各个视频帧构建的三维场景点云中仍然存在待标注对象。这样,本申请可以在三维场景点云中的待标注对象处,放置对应的三维对象模型,使得三维对象模型贴合三维场景点云中的待标注对象,以完成单次标注。
进一步地,可以通过以下步骤进行三维对象模型的放置:
(1)获取三维场景点云中待标注对象对应的三维对象模型,三维对象模型 设置有对应的对象类别信息。
在本申请实施例中,三维对象模型可以如公式6所示的方式表示:
OBJ={(id,class,(r,g,b) 1,(x obj,y obj,z obj) 1),...,(id,class,(r,g,b) s,(x obj,y obj,z obj) s)}   公式6
其中,OBJ表示三维对象模型;id表示三维对象模型的初始标识信息(如编号、顺序编码等),初始标识信息可以为预先随机设置的;class表示三维对象模型的对象类别信息;(r,g,b) s表示三维对象模型中第s个点的颜色信息,即对应红色颜色分量r的灰度值,对应绿色颜色分量g的灰度值,以及对应蓝色颜色分量b的灰度值;(x obj,y obj,z obj) s表示三维对象模型中第s个点对应的坐标信息,其中该坐标信息可以以三维对象模型的自身坐标系为基准。
(2)按照三维场景点云中待标注对象的标注顺序,设置三维场景点云中待标注对象的当前标识信息,以及将三维对象模型的对象类别信息作为三维场景点云中待标注对象的对象类别信息。
示例性的,若待标注对象包括5个对象,则可以按照三维场景点云中5个待标注对象的标注顺序,设置5个待标注对象的当前标识信息。另外,本申请还可以将获取到的三维对象模型中的初始标识信息更改为对应待标注对象的当前标识信息。
(3)通过将三维对象模型转换至三维场景点云所在的参考坐标系下,得到转换后的三维对象模型。
在本申请实施例中,可以对三维对象模型设置初始姿态参数,并根据初始姿态参数将三维对象模型转换至三维场景点云所在的参考坐标系下,得到转换后的三维对象模型。
可选地,初始姿态参数包括初始平移矩阵和初始旋转矩阵。其中,可以在三维场景点云中,选取三维对象模型对应的待标注对象上的任一点,并使用该任一点的坐标信息赋值初始姿态参数的初始平移矩阵。另外,初始旋转矩阵可以设置为单位矩阵。
可选地,初始姿态参数可以如公式7所示:
P obj=[R 2 T 2]    公式7
其中,P obj表示三维对象模型的初始姿态参数;R 2表示三维对象模型的初始旋转矩阵,
Figure PCTCN2021137580-appb-000003
T 2表示三维对 象模型的初始平移矩阵,
Figure PCTCN2021137580-appb-000004
φ表示三维对象模型在x轴方向上的初始旋转角度,θ表示三维对象模型在y轴方向上的初始旋转角度,
Figure PCTCN2021137580-appb-000005
表示三维对象模型在z轴方向上的初始旋转角度,tx表示三维对象模型在x轴方向上的初始平移距离,ty表示三维对象模型在y轴方向上的初始平移距离,tz表示三维对象模型在z轴方向上的初始平移距离。
转换后的三维对象模型可以如公式8所示:
OBJ new(x s,y s,z s)=P objOBJ(x s,y s,z s)          公式8
其中,OBJ new(x s,y s,z s)表示转换后的三维对象模型中的第s个点;P obj表示三维对象模型的初始姿态参数;OBJ(x,y,z)表示三维对象模型中的第s个点。
通过公式8,可以将三维对象模型中的各个点均转换至三维场景点云所在的参考坐标系下,得到转换后的三维对象模型。
(4)通过在三维场景点云中的待标注对象所在的位置处设置转换后的三维对象模型,确定三维场景点云中待标注对象的第一对象姿态信息。
在本申请的一些实施例中,在转换后的三维对象模型与三维场景点云中的待标注对象之间贴合的情况下,确定初始姿态参数为三维场景点云中待标注对象的第一对象姿态信息。
可以理解的是,转换后的三维对象模型与三维场景点云中的待标注对象之间贴合可以理解为:转换后的三维对象模型与三维场景点云中的待标注对象之间的重合度为最大重合度。其中,可以通过人工识别重合度,以判断是否处于贴合状态。
在本申请的另一些实施例中,在转换后的三维对象模型与三维场景点云中的待标注对象之间不贴合的情况下,调整上述所述的初始姿态参数,根据调整后的初始姿态参数重新获取转换后的三维对象模型,直至重新获取到的转换后的三维对象模型与三维场景点云中的待标注对象之间贴合,将调整后的初始姿态参数作为第一对象姿态信息。
可以理解的是,第一对象姿态信息包括调整后的旋转矩阵和调整后的平移矩阵。
需要说明的是,由初始姿态参数的公式可知,通过调整φ、θ、
Figure PCTCN2021137580-appb-000006
tx、ty、tz这几个参数,达到调整初始姿态参数的目的。
这样,在对三维对象模型设置完成后,可以获取到该三维场景点云中待标 注对象的第一对象信息,并对待标注对象标注该第一对象信息。其中,第一对象信息可以包括三维场景点云中待标注对象的对象类别信息、第一对象姿态信息以及当前标识信息,对象类别信息与三维对象模型的对象类别信息相同。
可选地,可以对该次标注进行记录,该待标注对象的标注记录信息可以表示为:RES={id,class,P obj},class表示对象类别信息,P obj表示第一对象姿态信息,id表示当前标识信息。
重复执行上述(1)至(4)所述的步骤,直至三维场景点云中的待标注对象全部完成标注,该三维场景点云中全部待标注对象的标注记录信息可以表示为:RES list={RES 1,...,RES m},RES m表示该三维场景点云中对第m个待标注对象对应的标注记录信息。
需要说明的是,上述涉及的三维对象模型为预先通过逆向工程三维建模方式得到的模型。
S104,根据第一对象信息和目标设备姿态参数,对各个视频帧包括的待标注对象标注第二对象信息。
其中,第二对象位置信息可以包括视频帧中待标注对象的对象类别信息、第二对象姿态信息、当前标识信息、边界框信息以及掩膜边界信息。
在本申请实施例中,以对视频帧序列中第i张视频帧,以及针对三维场景点云中第j个待标注对象进行标注为例进行说明。具体地,可以包括以下步骤:
(1)获取视频帧序列中第i张视频帧对应的目标设备姿态参数,以及获取三维场景点云中第j个待标注对象的对象类别信息、当前标识信息以及第一对象姿态信息,i、j均为正整数。
(2)将三维场景点云中第j个待标注对象的对象类别信息、当前标识信息,分别作为第i张视频帧中第j个待标注对象的对象类别信息和当前标识信息。
(3)根据第i张视频帧对应的目标设备姿态参数以及第j个待标注对象的第一对象姿态信息,获取第j个待标注对象相对第i张视频帧的第二对象姿态信息。
其中,第j个待标注对象相对第i张视频帧的第二姿态信息可以如公式9所示:
Figure PCTCN2021137580-appb-000007
在公式9中,
Figure PCTCN2021137580-appb-000008
表示第j个待标注对象相对第i张视频帧的第二姿态信息;
Figure PCTCN2021137580-appb-000009
表示第j个待标注对象相对第i张视频帧的旋转姿态信息,
Figure PCTCN2021137580-appb-000010
R i表示第i张视频帧对应的设备旋转矩阵,R j表示第j个待标注对象对应的调整后的旋转矩阵;
Figure PCTCN2021137580-appb-000011
表示第j个待标注对象相对第i张视频帧的平移姿态信息,
Figure PCTCN2021137580-appb-000012
T i表示第i张视频帧对应的设备平移矩阵,T j表示第j个待标注对象对应的调整后的平移矩阵。
(4)将第j个待标注对象的对象点云映射至第i张视频帧的图像坐标系下,得到图像坐标系下的第j个待标注对象。
在本申请实施例中,图像坐标系下的第j个待标注对象可以如公式10所示:
Figure PCTCN2021137580-appb-000013
其中,K为RGB-D图像采集设备的相机内参矩阵;
Figure PCTCN2021137580-appb-000014
表示第j个待标注对象相对第i张视频帧的姿态信息;(x obj,y obj,z obj)表示三维场景点云包括的第j个待标注对象中任一点的坐标信息;(u,v)表示该任一点映射至第i张视频帧的图像坐标系下得到的坐标信息;s表示缩放系数。
通过公式10,可以得到三维场景点云包括的待标注对象中各个点映射至第i张视频帧的图像坐标系下的坐标信息,故图像坐标系下的第j个待标注对象的坐标信息可以表示为UV={(u 1,v 1),...,(u k,v k)},其中,(u k,v k)表示三维场景点云包括的第j个待标注对象中第k个点映射至第i张视频帧的图像坐标系下得到的坐标信息。
(5)采用凸包算法计算图像坐标系下的第j个待标注对象对应的掩膜边界信息,并根据掩膜边界信息获取图像坐标系下的第j个待标注对象对应的边界框信息。
其中,凸包算法可以为Graham凸包算法等。
示例性的,通过Graham凸包算法获取掩膜边界信息可以如公式11所示:
M={(u 1,v 1),...,(u c,v c)}=Graham(UV)              公式11
其中,M表示图像坐标系下的第j个待标注对象对应的掩膜边界信息,(u c,v c)表示图像坐标系下的第j个待标注对象对应的掩膜边界点的坐标信息,Graham()表示Graham凸包算法的函数。
示例性的,获取边界框信息的过程可以如公式12所示:
B={(u top,v top),(u bottom,v bottom)}=Box(M)             公式12
其中,Box(M)表示基于掩膜边界信息获取边界框信息的函数,(u top,v top)表示掩膜边界信息中左上角掩膜边界点的坐标信息,(u bottom,v bottom)表示掩膜边界信息中右下角掩膜边界点的坐标信息。
这样,第j个待标注对象在第i张视频帧中的标注记录信息可以表示为:
Figure PCTCN2021137580-appb-000015
其中,class j表示第j个待标注对象的对象类别信息,
Figure PCTCN2021137580-appb-000016
表示第j个待标注对象相对第i张视频帧的第二对象姿态信息,id j表示第j个待标注对象的当前标识信息,
Figure PCTCN2021137580-appb-000017
表示第j个待标注对象在第i张视频帧中的边界框信息,
Figure PCTCN2021137580-appb-000018
表示第j个待标注对象在第i张视频帧中的掩膜边界信息。
通过执行上述(1)至(5)所述的步骤,可以针对每张视频帧中的各个待标注对象均进行信息标注。
还应理解,考虑到单张视频帧中可能存在多个待标注对象(例如5个待标注对象),这样,在现有的人工标注过程中,需要对单张视频帧中的每个待标注对象人工标注对象类别信息、对象标识信息、边界框信息、掩膜边界信息以及对象姿态信息,导致耗时较长;并且,对象姿态信息的标注较复杂,人工标注效率较低;以及,视频帧序列包括多张视频帧,需要针对每张视频帧中的待标注对象均进行人工标注。综上所述,现有的人工标注存在标注效率较低的问题。而本申请可以仅对三维场景点云中的待标注对象进行单次人工标注,然后根据单次人工标注的信息自动标注各个视频帧中的待标注对象,无需在各个视频帧中进行人工标注,减少对各个待标注对象人工标注的过程。
为了进一步地验证本申请中视频标注方法的标注效率,以真实场景做测试,每个场景有5~6个待标注对象,每段视频有20~30张有效视频帧,采用传统的人工标注方法,每张视频帧耗时为7分钟;采用本申请所述的视频标注方法,每张视频帧耗时为1.5分钟。由此可见,本申请提出的视频标注方法大大降低了标注耗时。
采用上述所述的方法,本申请可以对工作场景构建三维场景点云,对三维场景点云中的待标注对象进行单次人工标注,并将三维场景点云中人工标注的信息映射至各个视频帧中。由此可见,本申请实现了半自动视频标注,减少了标注时长,提高了标注效率,避免了人工标注数据低效的问题。
请参见图2,图2是基于图1所示实施例提供的一种视频标注方法的流程示意图。如图2所示,该视频标注方法还可以包括以下步骤:
S105,通过定焦相机对待标注对象进行多个方位的拍摄,得到待标注对象的多帧图像。
需要说明的是,若RGB-D图像采集设备为定焦,则本步骤可以通过RGB-D图像采集设备对待标注对象进行多个方位的拍摄,得到待标注对象的多帧图像。
可选地,为了使得待标注对象的信息较为完整,本申请中的定焦相机可以对待标注对象进行全方位的拍摄,得到待标注对象的多帧图像。
需要说明的是,S105可以在S101之前执行,也可以在S102之前执行,本申请对S105的时序不作限定。
S106,根据多帧图像,获取待标注对象的对象点云。
其中,本申请可以采用摄像测量技术,从多帧图像中获取待标注对象的对象点云。例如,可以直接采用摄影测量软件Meshroom获取待标注对象的对象点云。
可以理解的是,对象点云可以为三维的位置坐标信息。当然,对象点云中还可以包括颜色信息或者强度信息等,本申请对此不作特殊限制。
S107,根据对象点云,构建待标注对象的三维对象模型。
在本申请实施例中,由于对象点云中可能存在噪声点云,这样会导致重构的三维对象模型曲面不光滑,故本申请可以首先对对象点云进行去噪处理。然后可以针对去噪处理后的对象点云构建待标注对象的三维对象模型。
可选地,本申请可以将对象点云进行连接构成平面,并由平面构建待标注对象的三维对象模型。
采用上述所述的方法,本申请可以采用逆向工程预先获取到各个对象的三维对象模型,这样可以在构建的三维场景点云中直接设置三维对象模型,以便对待标注对象进行信息标注。
为了便于理解,图3示出了一种视频标注方法的示意图。如图3所示,主要包括两部分:第一部分为三维对象模型的构建;第二部分为三维场景点云的构建和信息标注。
其中,三维对象模型的构建可以包括:获取待标注对象的多帧图像,接着从多帧图像中获取待标注对象的对象点云,然后根据对象点云构建待标注对象的三维对象模型。
三维场景点云的构建和信息标注可以包括:获取RGB-D图像采集设备在 采集各个视频帧时的目标设备姿态参数和三维场景点云,接着通过对三维场景点云中的待标注对象所在的位置处设置对应的三维对象模型,以对待标注对象进行单次信息标注,然后根据目标设备姿态参数和单次标注的第一对象信息,对视频帧序列中的各个视频帧标注第二对象信息,第二对象信息包括待标注对象的对象类别信息(即图3中的classes)、边界框信息(即图3中的2D Boxes)、掩膜边界信息(即图3中的2D Masks)以及第二对象姿态信息(即图3中的6D Poses)。
在本申请的一些实施例中,考虑到基于深度学习的视觉引导具备较好的性能,故应用较为广泛。视觉引导主要包括目标检测和目标姿态估计,目标检测可以使用RCNN(region-based convolution neural networks,基于区域的卷积神经网络)算法、Fast RCNN算法、Faster R-CNN算法,SSD(single shot multibox detector,单阶段的多框预测)算法,YOLO(you only look once,你只看一眼)算法等,目标姿态估计可以使用点云模板配准方法或者ICP(iterative closest point,迭代就近点)配准方法等。但是,现有的目标姿态估计对噪声数据敏感,难以处理遮挡残缺等问题。
基于上述问题,本申请在图1或者图2的基础上,进一步地进行视觉引导。请参考图4,图4示出了一种视频标注方法的流程示意图。如图4所示,在S104之后,还可以包括以下步骤:
S108,获取待处理视频帧,待处理视频帧包括待处理RGB图像和待处理深度图像。
在本申请实施例中,待处理RGB图像和待处理深度图像可以通过RGB-D图像采集设备采集得到。
S109,获取待处理RGB图像中目标对象的目标对象类别信息、目标边界框信息以及目标掩膜边界信息。
在本申请的一些实施例中,可以将待处理RGB图像输入至预先训练得到的目标检测模型中,得到待处理RGB图像中目标对象的目标对象类别信息、目标边界框信息以及目标解码系数;预先训练得到的目标检测模型为根据标注有第二对象信息的视频帧训练得到的。紧接着,可以根据目标边界框信息和目标解码系数,对待处理RGB图像进行实例分割,得到待处理RGB图像中目标对象的目标掩膜边界信息。
图5示出了一种目标检测和目标姿态估计过程的示意图。如图5所示,目标检测过程中涉及的目标检测模型可以包括:主干网络层(即图5中的backbone,例如可以为残差网络层ResNet50等)、与主干网络层连接的FPN(feature pyramid networks,特征金字塔网络层)、与FPN连接的多个RCNN网络层、与多个RCNN网络层连接的NMS(non maximum suppression,非极大值抑制)网络层。其中,图5是以FPN输出三个尺度的特征图像,且目标检测模型包括三个RCNN网络层为例进行说明的,且FPN的输出通道与RCNN网络层为一一对应关系。
进一步地,以图5中的目标检测模型为例进行说明。本申请需要将待处理RGB图像输入至主干网络层,得到第一特征图像;接着将第一特征图像输入至FPN中,可以输出多个尺度的第二特征图像;然后将第二特征图像输入至对应的RCNN网络层,输出目标RGB图像中目标对象的初始对象信息,初始对象信息可以包括初始类别信息、初始边界框信息以及初始解码系数。紧接着通过NMS网络层对初始类别信息、初始边界框信息以及初始解码系数进行筛选,得到目标RGB图像中目标对象的目标对象类别信息、目标边界框信息以及目标解码系数。
示例性的,图5中是以目标对象包括object1、object2、…、objectn为例进行说明的,该目标对象依次对应的初始对象信息为object1info、object2info、…、objectn info。图5中进一步地将objectn info进行详细解释,即包括objectn的目标对象类别信息(即图5中的class)、目标边界框信息(即图5中的2D box)以及目标解码系数(即图5中的coefficients),其中目标解码系数可以为32×1的向量。上述示例只是举例说明,本申请对此不作特殊限制。
如图5所示,目标检测模型还可以包括第一卷积神经网络层。这样,将第一特征图像输入至第一卷积神经网络层中,输出预设尺度的第三特征图像。考虑到通过第一卷积神经网络层可以对第一特征图像进行编码,故本申请可以在根据目标边界框信息从第三特征图像中获取目标对象的第一特征块之后,将目标对象的第一特征块与目标对象的目标解码系数进行矩阵乘法,得到目标对象的热力图像,其中,热力图像为单通道的图像,由此可见,通过目标解码系数可以达到获取热力图像的效果;紧接着根据预设阈值对热力图像进行二值化处理,得到目标RGB图像中目标对象的目标掩膜边界信息。
可以理解的是,若第三特征图像的预设尺度与目标RGB图像的尺度相同,则可以直接在第三特征图像中获取目标边界框信息指示的像素点作为第一特征块;若第三特征图像的预设尺度与目标RGB图像的尺度不相同,则可以将第三特征图像转换为目标RGB图像的尺度,然后可以在转换后的第三特征图像中获取目标边界框信息指示的像素点作为第一特征块,或者,获取第三特征图像与目标RGB图像之间的尺度缩放比例,根据尺度缩放比例将目标边界框信息进行缩放处理,并在第三特征图像中获取缩放处理后的目标边界框信息指示的像素点作为第一特征块,上述过程只是举例说明本申请对此不作特殊限制。
S110,根据目标掩膜边界信息和待处理深度图像,获取目标对象的目标点云图像。
在本申请的可选实施例中,可以通过目标掩膜边界信息从待处理深度图像中提取到包括目标对象的像素点;接着对包括目标对象的像素点进行坐标转换得到目标点云信息,然后根据目标点云信息构建目标点云图像。
S111,将目标点云图像和目标对象图像输入至预先训练得到的目标姿态估计模型,得到目标对象的目标对象姿态信息以及目标对象姿态信息对应的目标置信度;目标对象图像为根据目标边界框信息从待处理RGB图像中裁剪到的关于目标对象的图像;预先训练得到的目标姿态估计模型为根据标注有第二对象信息的视频帧训练得到的。
在本申请的可选实施例中,预先训练得到的目标姿态估计模型可以包括特征提取网络层。进一步地,特征提取网络层可以包括局部特征提取网络层、全局特征提取网络层和特征聚合网络层。这样,本申请可以将目标点云图像和目标对象图像输入至局部特征提取网络层,得到目标对象的局部特征图像;将目标对象的局部特征和局部特征对应的坐标信息输入至全局特征提取网络层,得到目标对象的全局特征图像;将局部特征图像和全局特征图像输入至特征聚合网络层,得到聚合特征图像,并通过聚合特征图像,获取目标对象的目标对象姿态信息以及目标对象姿态信息对应的目标置信度。
如图5所示,目标姿态估计模型中还可以包括三个第二卷积神经网络层,其中第一个卷积神经网络层用于获取目标对象姿态信息中的目标平移矩阵,第二个卷积神经网络层用于获取目标对象姿态信息中的目标旋转矩阵,第三个卷积神经网络层用于获取目标对象姿态信息的置信度。这样,可以将聚合特征图 像输入至第一个和第二个卷积神经网络层,得到目标对象的目标对象姿态信息。可选地,该三个卷积神经网络层可以为1x1卷积核。
可以理解的是,由于本申请在目标检测模型中使用实例分割算法,且实例分割算法能提供待标注对象的精确轮廓,这样在目标姿态估计中可以减少背景噪声。另外,目标姿态估计过程中同时结合了局部特征和全局特征,这样,即使部分局部特征存在遮挡,还可以通过未遮挡的另一部局部特征进行姿态估计,在一定程度上能够避免对象遮挡的问题。
S112,根据目标对象姿态信息对应的目标置信度,从目标对象的目标对象姿态信息中确定目标对象的最终对象姿态信息,并通过最终对象姿态信息和目标对象类别信息进行视觉引导。
可以理解的是,由于本申请在视觉引导过程中可以根据对象的姿态和机械臂的姿态合理进行路径规划,故可以控制机械臂自动抓取任意位置及姿态的对象并实现分拣。
为了进一步地说明通过本申请模型训练得到的目标识别模型和目标姿态估计模型的性能,如表1所示,对不同的目标识别模型进行了比较:
表1目标识别模型的性能对比
Figure PCTCN2021137580-appb-000019
如表2所示,对不同的目标姿态估计模型进行了比较:
表2目标姿态估计模型的性能对比
Figure PCTCN2021137580-appb-000020
由表1和表2可知,通过本申请所述的目标识别模型,可以提高检测平均 精度的同时,缩短了运行时间;同样地,通过本申请所述的目标姿态估计模型,可以提高姿态平均精度的同时,缩短了运行时间。
需要说明的是,本申请中用于进行视频标注的设备可以具备以下硬件:Intel(R)Xeon(R)2.4GHz CPU,NVIDIA GTX 1080 Ti GPU。这样本申请可以通过Intel(R)Xeon(R)2.4GHz CPU获取视频帧序列,并通过NVIDIA GTX 1080 Ti GPU对视频帧序列进行信息标注,以及进行目标检测和目标姿态估计等。
采用上述所述的方法,通过预先训练得到的目标检测模型和目标姿态估计模型,可以快速地获取到对象的姿态,便于进行视觉引导。
在本申请的可选实施例中,本申请可以采用图1或图2所示实施例中的各个视频帧对目标检测模型和目标姿态估计模型进行模型训练。
请参考图6,图6示出了一种视频标注方法的流程示意图。其中,视频帧序列中的各个视频帧可以均包括RGB图像和深度图像,如图6所示,在S104之后,还可以包括以下步骤:
S113,将RGB图像输入至待训练的目标检测模型,得到RGB图像中待标注对象的待匹配对象类别信息、待匹配边界框信息以及解码系数。
S114,根据待匹配边界框信息和解码系数,对RGB图像进行实例分割,得到RGB图像中待标注对象的待匹配掩膜边界信息。
S115,根据待匹配掩膜边界信息和深度图像,获取待标注对象的点云图像。
S116,将点云图像和对象图像输入至待训练的目标姿态估计模型,得到待标注对象的待匹配对象姿态信息;对象图像为根据待匹配边界框信息从RGB图像中裁剪到的关于待标注对象的图像。
S117,根据待匹配对象类别信息、待匹配边界框信息、待匹配掩膜边界信息以及待匹配对象姿态信息,对目标识别模型和目标姿态估计模型进行模型训练。
在本申请实施例中,由于待标注对象预先标注有对象类别信息、边界框信息、掩膜边界信息以及第二对象姿态信息。故本步骤可以通过对象类别信息、边界框信息、掩膜边界信息、第二对象姿态信息、待匹配对象类别信息、待匹配边界框信息、待匹配掩膜边界信息以及待匹配对象姿态信息获取损失函数,并通过损失函数对待训练的目标识别模型和目标姿态估计模型进行模型训练。
图6实施例中的具体内容可以参考图4实施例中的描述内容,此处不再赘 述。
采用上述所述的方法,考虑到进行模型训练的过程中需要标注好的视频数据,且视频数据越多,模型的训练结果越准确。这样,本申请通过使用半自动标注的视频帧可以快速进行模型训练,从而提高了模型训练的效率,以及一定程度上保证了模型的准确率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解,本申请中不同实施例中的步骤可以进行组合,本申请对实施例的实施方式不作任何限定。
基于上述实施例所提供的视频标注方法,本发明实施例进一步给出实现上述方法实施例的装置实施例。
请参见图7,图7是本申请实施例提供的一种视频标注装置的结构示意图。包括的各模块用于执行图1至图6对应的实施例中的各步骤。具体请参阅图1至图6对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图7,视频标注装置7包括:
视频采集模块71,用于通过RGB-D图像采集设备采集关于工作场景的视频帧序列;所述视频帧序列中的各个视频帧包括待标注对象;
视频处理模块72,用于获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云;
通过对所述三维场景点云中待标注对象设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息;以及,
根据所述第一对象信息,对所述各个视频帧包括的待标注对象标注第二对象信息。
可选地,视频处理模块72,进一步用于采用同时定位与建图SLAM算法,以所述RGB-D图像采集设备采集第一视频帧时的初始设备姿态参数为参考坐标系,获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数;所述第一视频帧为所述视频帧序列中的一张视频帧;
根据所述各个视频帧对应的目标设备姿态参数,将每张视频帧中的像素点 转换至参考坐标系下,得到参考坐标系下的视频帧;
对所述参考坐标系下的视频帧进行数据合并处理,得到合并处理后的视频帧;
对所述合并处理后的视频帧进行去噪平滑处理,得到所述工作场景的三维场景点云。
可选地,视频处理模块72,进一步用于将所述每张视频帧中的像素点转换至相机坐标系下,对应得到相机坐标系下的视频帧;
根据所述各个视频帧对应的目标设备姿态参数,将所述相机坐标系下的视频帧转换至参考坐标系下,得到参考坐标系下的视频帧。
可选地,所述第一对象信息包括:所述三维场景点云中待标注对象的对象类别信息、当前标识信息以及第一对象姿态信息;
视频处理模块72,进一步用于获取所述三维场景点云中待标注对象对应的三维对象模型,所述三维对象模型设置有对应的对象类别信息;
按照所述三维场景点云中待标注对象的标注顺序,设置所述三维场景点云中待标注对象的当前标识信息,以及将所述三维对象模型的对象类别信息作为所述三维场景点云中待标注对象的对象类别信息;
通过将所述三维对象模型转换至所述三维场景点云所在的参考坐标系下,得到转换后的三维对象模型;
通过在所述三维场景点云中的待标注对象所在的位置处设置转换后的三维对象模型,确定所述三维场景点云中待标注对象的第一对象姿态信息。
可选地,视频处理模块72,进一步用于对所述三维对象模型设置初始姿态参数;
根据所述初始姿态参数将所述三维对象模型转换至所述三维场景点云所在的参考坐标系下,得到转换后的三维对象模型;以及,
在所述转换后的三维对象模型与所述三维场景点云中的待标注对象之间贴合的情况下,确定所述初始姿态参数为所述三维场景点云中待标注对象的第一对象姿态信息;
在所述转换后的三维对象模型与所述三维场景点云中的待标注对象之间不贴合的情况下,调整所述初始姿态参数,根据调整后的初始姿态参数重新获取转换后的三维对象模型,直至重新获取到的转换后的三维对象模型与所述三维 场景点云中的待标注对象之间贴合,将调整后的初始姿态参数作为第一对象姿态信息。
可选地,所述第二对象信息包括:所述视频帧中待标注对象对应的对象类别信息、当前标识信息、第二对象姿态信息、掩膜边界信息以及边界框信息;
视频处理模块72,进一步用于获取所述视频帧序列中第i张视频帧对应的目标设备姿态参数,以及获取所述三维场景点云中第j个待标注对象的对象类别信息、当前标识信息以及第一对象姿态信息,i、j均为正整数;
将所述三维场景点云中第j个待标注对象的对象类别信息、当前标识信息,分别作为所述第i张视频帧中第j个待标注对象的对象类别信息和当前标识信息;
根据所述第i张视频帧对应的目标设备姿态参数以及所述第j个待标注对象的第一对象姿态信息,获取所述第j个待标注对象相对所述第i张视频帧的第二对象姿态信息;
将所述第j个待标注对象的对象点云映射至所述第i张视频帧的图像坐标系下,得到图像坐标系下的第j个待标注对象;
采用凸包算法计算所述图像坐标系下的第j个待标注对象在所述第i张视频帧中的掩膜边界信息,并根据所述掩膜边界信息获取所述图像坐标系下的第j个待标注象在所述第i张视频帧中的边界框信息。
可选地,视频采集模块71,还用于通过定焦相机对所述待标注对象进行多个方位的拍摄,得到待标注对象的多帧图像;
视频处理模块72,还用于根据所述多帧图像,获取所述待标注对象的对象点云;以及,
根据所述对象点云,构建所述待标注对象的三维对象模型。
图8是本申请实施例提供一种视频标注设备的结构示意图。如图8所示,该实施例的视频标注设备8包括:处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机程序82,例如视频标注程序。处理器80执行所述计算机程序82时实现上述各个视频标注方法实施例中的步骤,例如图1所示的S101-S104。或者,所述处理器80执行所述计算机程序82时实现上述各装置实施例中各模块/单元的功能,例如图7所示模块71、72的功能。
示例性的,所述计算机程序82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由处理器80执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序82在所述视频标注设备8中的执行过程。例如,所述计算机程序82可以被分割成获取模块、处理模块,各模块具体功能请参阅图1至图6对应地实施例中地相关描述,此处不赘述。
所述视频标注设备可包括,但不仅限于,处理器80、存储器81。本领域技术人员可以理解,图8仅仅是视频标注设备8的示例,并不构成对视频标注设备8的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述视频标注设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器80可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific IntegratedCircuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器81可以是所述视频标注设备8的内部存储单元,例如视频标注设备8的硬盘或内存。所述存储器81也可以是所述视频标注设备8的外部存储设备,例如所述视频标注设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述视频标注设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机程序以及所述视频标注设备所需的其他程序和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时可实现上述边海防监控方法。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在视频标注设备上运行时,使得视频标注设备执行时实现可实现上述边海防监控方法。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述***中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种视频标注方法,其特征在于,包括:
    通过RGB-D图像采集设备采集关于工作场景的视频帧序列;所述视频帧序列中的各个视频帧包括待标注对象;
    获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云;
    通过对所述三维场景点云中待标注对象所在的位置处设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息;
    根据所述第一对象信息和所述目标设备姿态参数,对所述各个视频帧包括的待标注对象标注第二对象信息。
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云,包括:
    采用同时定位与建图SLAM算法,以所述RGB-D图像采集设备采集第一视频帧时的初始设备姿态参数为参考坐标系,获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数;所述第一视频帧为所述视频帧序列中的一张视频帧;
    根据所述各个视频帧对应的目标设备姿态参数,将每张视频帧中的像素点转换至参考坐标系下,得到参考坐标系下的视频帧;
    对所述参考坐标系下的视频帧进行数据合并处理,得到合并处理后的视频帧;
    对所述合并处理后的视频帧进行去噪平滑处理,得到所述工作场景的三维场景点云。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述各个视频帧对应的目标设备姿态参数,将每张视频帧中的像素点转换至参考坐标系下,得到参考坐标系下的视频帧,包括:
    将所述每张视频帧中的像素点转换至相机坐标系下,对应得到相机坐标系下的视频帧;
    根据所述各个视频帧对应的目标设备姿态参数,将所述相机坐标系下的视频帧转换至参考坐标系下,得到参考坐标系下的视频帧。
  4. 根据权利要求1所述的方法,其特征在于,所述第一对象信息包括:所述三维场景点云中待标注对象的对象类别信息、当前标识信息以及第一对象姿 态信息;
    所述通过对所述三维场景点云中待标注对象所在的位置处设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息,包括:
    获取所述三维场景点云中待标注对象对应的三维对象模型,所述三维对象模型设置有对应的对象类别信息;
    按照所述三维场景点云中待标注对象的标注顺序,设置所述三维场景点云中待标注对象的当前标识信息,以及将所述三维对象模型的对象类别信息作为所述三维场景点云中待标注对象的对象类别信息;
    通过将所述三维对象模型转换至所述三维场景点云所在的参考坐标系下,得到转换后的三维对象模型;
    通过在所述三维场景点云中的待标注对象所在的位置处设置转换后的三维对象模型,确定所述三维场景点云中待标注对象的第一对象姿态信息。
  5. 根据权利要求4所述的方法,其特征在于,所述通过将所述三维对象模型转换至所述三维场景点云所在的参考坐标系下,得到转换后的三维对象模型,包括:
    对所述三维对象模型设置初始姿态参数;
    根据所述初始姿态参数将所述三维对象模型转换至所述三维场景点云所在的参考坐标系下,得到转换后的三维对象模型;
    对应地,所述通过在所述三维场景点云中的待标注对象所在的位置处设置转换后的三维对象模型,确定所述三维场景点云中待标注对象的第一对象姿态信息,包括:
    在所述转换后的三维对象模型与所述三维场景点云中的待标注对象之间贴合的情况下,确定所述初始姿态参数为所述三维场景点云中待标注对象的第一对象姿态信息;
    在所述转换后的三维对象模型与所述三维场景点云中的待标注对象之间不贴合的情况下,调整所述初始姿态参数,根据调整后的初始姿态参数重新获取转换后的三维对象模型,直至重新获取到的转换后的三维对象模型与所述三维场景点云中的待标注对象之间贴合,将调整后的初始姿态参数作为第一对象姿态信息。
  6. 根据权利要求4所述的方法,其特征在于,所述第二对象信息包括:所述视频帧中待标注对象对应的对象类别信息、当前标识信息、第二对象姿态信 息、掩膜边界信息以及边界框信息;
    所述根据所述第一对象信息和所述目标设备姿态参数,对所述各个视频帧包括的待标注对象标注第二对象信息,包括:
    获取所述视频帧序列中第i张视频帧对应的目标设备姿态参数,以及获取所述三维场景点云中第j个待标注对象的对象类别信息、当前标识信息以及第一对象姿态信息,i、j均为正整数;
    将所述三维场景点云中第j个待标注对象的对象类别信息、当前标识信息,分别作为所述第i张视频帧中第j个待标注对象的对象类别信息和当前标识信息;
    根据所述第i张视频帧对应的目标设备姿态参数以及所述第j个待标注对象的第一对象姿态信息,获取所述第j个待标注对象相对所述第i张视频帧的第二对象姿态信息;
    将所述第j个待标注对象的对象点云映射至所述第i张视频帧的图像坐标系下,得到图像坐标系下的第j个待标注对象;
    采用凸包算法计算所述图像坐标系下的第j个待标注对象在所述第i张视频帧中的掩膜边界信息,并根据所述掩膜边界信息获取所述图像坐标系下的第j个待标注象在所述第i张视频帧中的边界框信息。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    通过定焦相机对所述待标注对象进行多个方位的拍摄,得到待标注对象的多帧图像;
    根据所述多帧图像,获取所述待标注对象的对象点云;
    根据所述对象点云,构建所述待标注对象的三维对象模型。
  8. 一种视频标注装置,其特征在于,包括:
    视频采集模块,用于通过RGB-D图像采集设备采集关于工作场景的视频帧序列;所述视频帧序列中的各个视频帧包括待标注对象;
    视频处理模块,用于获取所述RGB-D图像采集设备采集所述各个视频帧时的目标设备姿态参数,以及根据所述目标设备姿态参数构建所述工作场景的三维场景点云;
    通过对所述三维场景点云中待标注对象设置三维对象模型,得到所述三维场景点云中待标注对象的第一对象信息;以及,
    根据所述第一对象信息,对所述各个视频帧包括的待标注对象标注第二对 象信息。
  9. 一种视频标注设备,包括处理器、存储器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。
PCT/CN2021/137580 2021-02-10 2021-12-13 一种视频标注方法、装置、设备及计算机可读存储介质 WO2022170844A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110185443.2A CN112950667B (zh) 2021-02-10 2021-02-10 一种视频标注方法、装置、设备及计算机可读存储介质
CN202110185443.2 2021-02-10

Publications (1)

Publication Number Publication Date
WO2022170844A1 true WO2022170844A1 (zh) 2022-08-18

Family

ID=76245718

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137580 WO2022170844A1 (zh) 2021-02-10 2021-12-13 一种视频标注方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN112950667B (zh)
WO (1) WO2022170844A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115582827A (zh) * 2022-10-20 2023-01-10 大连理工大学 一种基于2d和3d视觉定位的卸货机器人抓取方法
CN116152783A (zh) * 2023-04-18 2023-05-23 安徽蔚来智驾科技有限公司 目标元素标注数据的获取方法、计算机设备及存储介质
CN116153472A (zh) * 2023-02-24 2023-05-23 萱闱(北京)生物科技有限公司 图像多维可视化方法、装置、介质和计算设备
CN117372632A (zh) * 2023-12-08 2024-01-09 魔视智能科技(武汉)有限公司 二维图像的标注方法、装置、计算机设备及存储介质
CN117746250A (zh) * 2023-12-29 2024-03-22 重庆市地理信息和遥感应用中心(重庆市测绘产品质量检验测试中心) 一种融合实景三维与视频的烟火智能识别与精准定位方法
WO2024124670A1 (zh) * 2022-12-14 2024-06-20 珠海普罗米修斯视觉技术有限公司 视频播放方法、装置、计算机设备及计算机可读存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950667B (zh) * 2021-02-10 2023-12-22 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质
CN113591568A (zh) * 2021-06-28 2021-11-02 北京百度网讯科技有限公司 目标检测方法、目标检测模型的训练方法及其装置
CN113808198B (zh) * 2021-11-17 2022-03-08 季华实验室 一种吸取面的标注方法、装置、电子设备和存储介质
CN117953142A (zh) * 2022-10-20 2024-04-30 华为技术有限公司 一种图像标注方法、装置、电子设备及存储介质
WO2024087067A1 (zh) * 2022-10-26 2024-05-02 北京小米移动软件有限公司 图像标注方法及装置、神经网络训练方法及装置
CN115661577B (zh) * 2022-11-01 2024-04-16 吉咖智能机器人有限公司 用于对象检测的方法、设备和计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047142A (zh) * 2019-03-19 2019-07-23 中国科学院深圳先进技术研究院 无人机三维地图构建方法、装置、计算机设备及存储介质
CN110221690A (zh) * 2019-05-13 2019-09-10 Oppo广东移动通信有限公司 基于ar场景的手势交互方法及装置、存储介质、通信终端
US20200226360A1 (en) * 2019-01-10 2020-07-16 9138-4529 Quebec Inc. System and method for automatically detecting and classifying an animal in an image
CN112950667A (zh) * 2021-02-10 2021-06-11 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678748B (zh) * 2015-12-30 2019-01-15 清华大学 三维监控***中基于三维重构的交互式标定方法和装置
CN110336973B (zh) * 2019-07-29 2021-04-13 联想(北京)有限公司 信息处理方法及其装置、电子设备和介质
CN110503074B (zh) * 2019-08-29 2022-04-15 腾讯科技(深圳)有限公司 视频帧的信息标注方法、装置、设备及存储介质
CN111274426B (zh) * 2020-01-19 2023-09-12 深圳市商汤科技有限公司 类别标注方法及装置、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200226360A1 (en) * 2019-01-10 2020-07-16 9138-4529 Quebec Inc. System and method for automatically detecting and classifying an animal in an image
CN110047142A (zh) * 2019-03-19 2019-07-23 中国科学院深圳先进技术研究院 无人机三维地图构建方法、装置、计算机设备及存储介质
CN110221690A (zh) * 2019-05-13 2019-09-10 Oppo广东移动通信有限公司 基于ar场景的手势交互方法及装置、存储介质、通信终端
CN112950667A (zh) * 2021-02-10 2021-06-11 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115582827A (zh) * 2022-10-20 2023-01-10 大连理工大学 一种基于2d和3d视觉定位的卸货机器人抓取方法
WO2024124670A1 (zh) * 2022-12-14 2024-06-20 珠海普罗米修斯视觉技术有限公司 视频播放方法、装置、计算机设备及计算机可读存储介质
CN116153472A (zh) * 2023-02-24 2023-05-23 萱闱(北京)生物科技有限公司 图像多维可视化方法、装置、介质和计算设备
CN116153472B (zh) * 2023-02-24 2023-10-24 萱闱(北京)生物科技有限公司 图像多维可视化方法、装置、介质和计算设备
CN116152783A (zh) * 2023-04-18 2023-05-23 安徽蔚来智驾科技有限公司 目标元素标注数据的获取方法、计算机设备及存储介质
CN116152783B (zh) * 2023-04-18 2023-08-04 安徽蔚来智驾科技有限公司 目标元素标注数据的获取方法、计算机设备及存储介质
CN117372632A (zh) * 2023-12-08 2024-01-09 魔视智能科技(武汉)有限公司 二维图像的标注方法、装置、计算机设备及存储介质
CN117372632B (zh) * 2023-12-08 2024-04-19 魔视智能科技(武汉)有限公司 二维图像的标注方法、装置、计算机设备及存储介质
CN117746250A (zh) * 2023-12-29 2024-03-22 重庆市地理信息和遥感应用中心(重庆市测绘产品质量检验测试中心) 一种融合实景三维与视频的烟火智能识别与精准定位方法

Also Published As

Publication number Publication date
CN112950667B (zh) 2023-12-22
CN112950667A (zh) 2021-06-11

Similar Documents

Publication Publication Date Title
WO2022170844A1 (zh) 一种视频标注方法、装置、设备及计算机可读存储介质
US20200279121A1 (en) Method and system for determining at least one property related to at least part of a real environment
He et al. Sparse template-based 6-D pose estimation of metal parts using a monocular camera
JP5822322B2 (ja) ローカライズされ、セグメンテーションされた画像のネットワークキャプチャ及び3dディスプレイ
Azad et al. Stereo-based 6d object localization for grasping with humanoid robot systems
CN109886124B (zh) 一种基于线束描述子图像匹配的无纹理金属零件抓取方法
JP5538868B2 (ja) 画像処理装置、その画像処理方法及びプログラム
JP2004334819A (ja) ステレオキャリブレーション装置とそれを用いたステレオ画像監視装置
CN111507908B (zh) 图像矫正处理方法、装置、存储介质及计算机设备
CN113111844B (zh) 一种作业姿态评估方法、装置、本地终端及可读存储介质
WO2021136386A1 (zh) 数据处理方法、终端和服务器
CN109479082A (zh) 图象处理方法及装置
CN109272577B (zh) 一种基于Kinect的视觉SLAM方法
CN112861870B (zh) 指针式仪表图像矫正方法、***及存储介质
WO2024012333A1 (zh) 位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品
CN111191582A (zh) 三维目标检测方法、检测装置、终端设备及计算机可读存储介质
CN113011401A (zh) 人脸图像姿态估计和校正方法、***、介质及电子设备
WO2023284358A1 (zh) 相机标定方法、装置、电子设备及存储介质
CN111488766A (zh) 目标检测方法和装置
CN114742789A (zh) 一种基于面结构光的通用零件拾取方法、***及电子设备
JP5704909B2 (ja) 注目領域検出方法、注目領域検出装置、及びプログラム
WO2022247126A1 (zh) 视觉定位方法、装置、设备、介质及程序
CN111325828A (zh) 一种基于三目相机的三维人脸采集方法及装置
CN111161348A (zh) 一种基于单目相机的物***姿估计方法、装置及设备
CN117253022A (zh) 一种对象识别方法、装置及查验设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21925491

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21925491

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/01/2024)