WO2022170844A1 - Procédé, appareil et dispositif d'annotation de vidéo, et support de stockage lisible par ordinateur - Google Patents

Procédé, appareil et dispositif d'annotation de vidéo, et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2022170844A1
WO2022170844A1 PCT/CN2021/137580 CN2021137580W WO2022170844A1 WO 2022170844 A1 WO2022170844 A1 WO 2022170844A1 CN 2021137580 W CN2021137580 W CN 2021137580W WO 2022170844 A1 WO2022170844 A1 WO 2022170844A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
point cloud
video frame
marked
dimensional
Prior art date
Application number
PCT/CN2021/137580
Other languages
English (en)
Chinese (zh)
Inventor
莫柠锴
陈世峰
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022170844A1 publication Critical patent/WO2022170844A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the technical field of computer vision and image processing, and in particular, to a video annotation method, apparatus, device, and computer-readable storage medium.
  • Robotic arms have a wide range of applications in industry, retail and service fields, including the grabbing and distribution of goods, the sorting of components on the assembly line, and the depalletizing and palletizing in logistics.
  • the traditional robotic arm lacks the perception of the environment, so it can only determine the action and behavior of the robotic arm through pre-programming in a static environment (such as offline teaching that is widely used by industrial robots), and the traditional robotic arm often needs to customize the fixture.
  • the mechanical structure makes the objects to be sorted move or place according to the specified trajectory, which makes it inflexible, and each different scene requires customized design, resulting in high cost.
  • the visual guidance algorithm mainly includes two aspects: target detection and target pose estimation.
  • target detection and target pose estimation all require a large amount of labeled data for training.
  • the camera can collect a large amount of video data, then manually label the video data, and then train the target detection model and target pose estimation model through the labeled video data.
  • problems such as low labeling efficiency and high labor cost through manual labeling.
  • the embodiments of the present application provide a video labeling method, apparatus, device, and computer-readable storage medium, which can improve the labeling efficiency of video data and reduce labor costs to a certain extent.
  • the video frame sequence about the working scene is collected by the RGB-D image capture device; each video frame in the video frame sequence includes the object to be marked;
  • the first object information of the object to be marked in the three-dimensional scene point cloud is obtained;
  • the object to be annotated included in each video frame is annotated with second object information.
  • the present application can construct a 3D scene point cloud for the work scene, perform a single manual annotation on the objects to be labeled in the 3D scene point cloud, and map the manually annotated information in the 3D scene point cloud to each video frame. It can be seen that the present application realizes semi-automatic video labeling, reduces labeling time, improves labeling efficiency, and avoids the problem of inefficient manual labeling of data.
  • the present application provides a video annotation device, comprising:
  • a video capture module used for collecting a video frame sequence about the working scene through an RGB-D image capture device; each video frame in the video frame sequence includes an object to be marked;
  • a video processing module configured to acquire the target device posture parameters when the RGB-D image acquisition device collects the respective video frames, and construct a three-dimensional scene point cloud of the work scene according to the target device posture parameters;
  • the first object information of the object to be marked in the point cloud of the three-dimensional scene is obtained;
  • the to-be-labeled objects included in the respective video frames are marked with second object information.
  • the present application provides a computer program product that, when the computer program product runs on a border and coastal defense monitoring device, enables the border and coastal defense monitoring device to perform the steps of the method described in the first aspect or any optional manner of the first aspect. .
  • FIG. 1 is a schematic flowchart of a video labeling method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a target detection and target pose estimation provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of another video labeling method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a video annotation device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a video annotation device provided by an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” and the like described in the specification of this application mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application .
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • the robotic arm can work 24 hours a day, and the operating cost is lower.
  • an AR (augmented reality, augmented reality) scene based on a smart terminal it is necessary to train the target pose estimation model through the labeled video data in advance, so that the user can accurately perform pose estimation when interacting with virtual items.
  • the present application can also be applied to other scenarios in which deep learning is performed on video data that needs to be labeled, and the present application does not make special restrictions on specific application scenarios.
  • the video tagging method provided by the present application is exemplarily described below through specific embodiments.
  • FIG. 1 is a schematic flowchart of a video annotation method provided by the present application. As shown in Figure 1, the video annotation method may include the following steps:
  • a video frame sequence related to a work scene is collected by an RGB-D image collection device; each video frame in the video frame sequence includes an object to be marked.
  • the object to be annotated involved in the present application and the three-dimensional object model involved in the subsequent process may be rigid objects.
  • a single video frame in the video frame sequence can be expressed as shown in Formula 1:
  • ((r,g,b) k ,(u,v) k , dk ) represents the information contained in the kth pixel in the single video frame;
  • (r,g,b ) k represents the color information corresponding to the kth pixel in the single video frame, that is, the gray value of the red color component r, the gray value of the corresponding green color component g, and the gray value of the corresponding blue color component b ;
  • (u, v) k represents the coordinate information of the kth pixel in the single video frame;
  • d k represents the depth value of the kth pixel in the single video frame, and k represents the pixel in the single video frame total number of points.
  • each video frame in the video frame sequence may include a three-channel RGB image and a single-channel depth image, or include a four-channel RGB-D image, which is not particularly limited in this application.
  • the execution subject in this application may be a video annotation device, and the video annotation device has data processing capabilities, such as a terminal device, a server, and the like.
  • the number of RGB-D image acquisition devices may be one or more. If multiple RGB-D image acquisition devices are installed in the work scene, the acquisition efficiency of the work scene can be improved to a certain extent.
  • the RGB-D image acquisition device can be installed on a robotic arm.
  • the working scene can be the scene where the robotic arm works.
  • the RGB-D image acquisition device may be a built-in RGB-D camera of the smart terminal.
  • each video frame may include one or more objects to be labeled, and since each video frame is an image related to a work scene, the same object to be labeled may exist in different video frames. This is not limited.
  • a SLAM simultaneous localization and mapping
  • simultaneous localization and mapping a SLAM (simultaneous localization and mapping, simultaneous localization and mapping) algorithm can be used to obtain the The RGB-D image acquisition device is used for the attitude parameters of the target device when each video frame is collected; the first video frame is a video frame in the video frame sequence.
  • the first video frame may be the first video frame in the sequence of video frames.
  • the first video frame may also be any video frame in the video frame sequence, which is not particularly limited in this application.
  • target device pose parameters corresponding to each video frame in the video frame sequence can be shown in formula 2:
  • P list represents the list of target device attitude parameters corresponding to each video frame in the video frame sequence
  • P n represents the target device attitude parameter when the RGB-D image acquisition device collects the nth video frame in the video frame sequence
  • I n represents The nth video frame in the video frame sequence
  • SLAM() represents the simultaneous localization and mapping algorithm.
  • the target device attitude parameter can be composed of the device rotation matrix R 1 and the device translation matrix T 1 , which can be specifically expressed as: r 11 , r 12 , r 13 , r 21 , r 22 , r 23 , r 31 , r 32 , r 33 are elements in the device rotation matrix R 1 and are calculated by the SLAM algorithm; t 1 , t 2 , t 3 is an element in the device translation matrix T 1 and is also calculated by the SLAM algorithm.
  • a 3D scene point cloud of the work scene can be constructed according to the target device posture parameters. Specifically, the following steps may be included:
  • the transformation of the reference coordinate system can be performed through the following two steps a1 and a2:
  • the pixels in each video frame can be converted to the camera coordinate system, and the corresponding video frame in the camera coordinate system can be obtained.
  • a single video frame in the camera coordinate system may be shown in formula 3:
  • G cam ⁇ ((r,g,b) 1 ,(x cam ,y cam ,z cam ) 1 ),...,((r,g,b) k ,(x cam ,y cam ,z cam ) k ) ⁇ Equation 3
  • G cam represents a single video frame in the camera coordinate system
  • (r, g, b) k represents the color information corresponding to the kth pixel in a single video frame in the camera coordinate system, that is, the red color component r Gray value, corresponding to the gray value of the green color component g, and corresponding to the gray value of the blue color component b
  • c x represents The image coordinates of the image principal point on the x-axis in the RGB-D image acquisition device, c y represents the image coordinates of the image principal point on the y-axis in the RGB-D image acquisition
  • a single video frame in the reference coordinate system may be shown in formula 4:
  • G ref ⁇ ((r,g,b) 1 ,(x ref ,y ref ,z ref ) 1 ),...,((r,g,b) k ,(x ref ,y ref ,z ref ) k ) ⁇ Equation 4
  • P represents the pose parameter of the target device corresponding to the single video frame.
  • each video frame in the video frame sequence can be converted into the reference coordinate system through the steps described in (1) above.
  • the combined video frame can be expressed as shown in formula 5:
  • G raw represents the merged video frame
  • Merge() represents the merging function
  • G n ref represents the nth video frame in the reference coordinate system.
  • a preset processing function in a PCL may be used to perform denoising processing and smoothing processing on the merged video frames to obtain a 3D scene point cloud.
  • the preset processing function may be the MovingLeastSquares function.
  • an SFM structure from motion, structure from motion
  • an SFM structure from motion, structure from motion
  • the method of scene point cloud is not limited.
  • S103 Obtain first object information of the object to be marked in the point cloud of the three-dimensional scene by setting the three-dimensional object model at the position of the object to be marked in the point cloud of the three-dimensional scene.
  • each video frame in the video frame sequence includes the object to be labeled
  • the object to be labeled still exists in the point cloud of the three-dimensional scene constructed by each video frame.
  • the present application can place the corresponding 3D object model at the object to be annotated in the 3D scene point cloud, so that the 3D object model fits the object to be annotated in the 3D scene point cloud to complete a single annotation.
  • placement of the 3D object model can be performed by the following steps:
  • the three-dimensional object model can be expressed as shown in Equation 6:
  • OBJ ⁇ (id,class,(r,g,b) 1 ,(x obj ,y obj ,z obj ) 1 ),...,(id,class,(r,g,b) s ,(x obj ,y obj ,z obj ) s ) ⁇ Equation 6
  • OBJ represents the three-dimensional object model
  • id represents the initial identification information of the three-dimensional object model (such as number, sequence code, etc.), and the initial identification information can be randomly set in advance
  • class represents the object category information of the three-dimensional object model
  • (r, g ,b) s represents the color information of the s-th point in the three-dimensional object model, that is, the gray value corresponding to the red color component r, the gray value corresponding to the green color component g, and the gray value corresponding to the blue color component b
  • (x obj , y obj , z obj ) s represents the coordinate information corresponding to the s-th point in the three-dimensional object model, wherein the coordinate information may be based on the coordinate system of the three-dimensional object model itself.
  • the current identification information of the five objects to be labeled may be set according to the labeling order of the five objects to be labeled in the point cloud of the three-dimensional scene.
  • the present application can also change the initial identification information in the acquired three-dimensional object model to the current identification information corresponding to the object to be annotated.
  • the initial pose parameters include an initial translation matrix and an initial rotation matrix.
  • any point on the object to be marked corresponding to the three-dimensional object model may be selected, and the coordinate information of the point may be used to assign the initial translation matrix of the initial attitude parameter.
  • the initial rotation matrix can be set to the identity matrix.
  • the initial pose parameters can be as shown in Equation 7:
  • P obj represents the initial attitude parameter of the three-dimensional object model
  • R 2 represents the initial rotation matrix of the three-dimensional object model
  • T 2 represents the initial translation matrix of the 3D object model
  • represents the initial rotation angle of the 3D object model in the x-axis direction
  • represents the initial rotation angle of the 3D object model in the y-axis direction
  • tx represents the initial translation distance of the 3D object model in the x-axis direction
  • ty represents the initial translation distance of the 3D object model in the y-axis direction
  • tz represents the 3D object model in the direction of the y-axis.
  • Equation 8 The transformed 3D object model can be expressed as Equation 8:
  • OBJ new (x s , y s , z s ) represents the s-th point in the converted 3D object model
  • P obj represents the initial pose parameters of the 3D object model
  • OBJ (x, y, z) represents the 3D object The sth point in the model.
  • each point in the three-dimensional object model can be converted to the reference coordinate system where the point cloud of the three-dimensional scene is located, and the converted three-dimensional object model can be obtained.
  • the initial pose parameter is the first position of the object to be annotated in the 3D scene point cloud.
  • the fit between the converted 3D object model and the objects to be annotated in the 3D scene point cloud can be understood as: the degree of coincidence between the converted 3D object model and the objects to be annotated in the 3D scene point cloud. for the maximum coincidence.
  • the degree of coincidence can be manually identified to determine whether it is in a fit state.
  • the above-mentioned initial posture parameters are adjusted, according to the adjusted initial
  • the attitude parameters re-acquire the converted 3D object model until the re-acquired converted 3D object model is fitted with the object to be marked in the 3D scene point cloud, and the adjusted initial attitude parameters are used as the first object attitude information .
  • the first object pose information includes an adjusted rotation matrix and an adjusted translation matrix.
  • the first object information of the object to be marked in the point cloud of the three-dimensional scene can be obtained, and the object to be marked is marked with the first object information.
  • the first object information may include object category information of the object to be marked in the 3D scene point cloud, first object pose information and current identification information, and the object category information is the same as the object category information of the 3D object model.
  • the above-mentioned three-dimensional object model is a model obtained in advance through a reverse engineering three-dimensional modeling method.
  • the second object position information may include object category information of the object to be marked in the video frame, second object pose information, current identification information, bounding box information, and mask boundary information.
  • the i-th video frame in the video frame sequence and the j-th object to be annotated in the point cloud of the three-dimensional scene are described as examples. Specifically, the following steps may be included:
  • the object category information and current identification information of the j-th object to be labeled in the 3D scene point cloud are taken as the object category information and current identification information of the j-th object to be labeled in the ith video frame, respectively.
  • Equation 9 the second pose information of the j-th to-be-labeled object relative to the i-th video frame.
  • Equation 9 Represents the second pose information of the j-th object to be labeled relative to the i-th video frame; Represents the rotational attitude information of the j-th object to be labeled relative to the i-th video frame, R i represents the device rotation matrix corresponding to the i-th video frame, and R j represents the adjusted rotation matrix corresponding to the j-th object to be marked; Represents the translation pose information of the j-th object to be labeled relative to the i-th video frame, T i represents the device translation matrix corresponding to the ith video frame, and T j represents the adjusted translation matrix corresponding to the jth object to be labeled.
  • the object point cloud of the jth object to be labeled is mapped to the image coordinate system of the ith video frame, and the jth object to be labeled in the image coordinate system is obtained.
  • the j-th object to be labeled in the image coordinate system may be shown in formula 10:
  • K is the camera internal parameter matrix of the RGB-D image acquisition device; Represents the attitude information of the j-th object to be labeled relative to the i-th video frame; (x obj , y obj , z obj ) represents the coordinate information of any point in the j-th object to be labeled included in the 3D scene point cloud; (u, v) represents the coordinate information obtained by mapping the arbitrary point to the image coordinate system of the ith video frame; s represents the scaling factor.
  • the k points are mapped to the coordinate information obtained under the image coordinate system of the ith video frame.
  • the convex hull algorithm is used to calculate the mask boundary information corresponding to the j-th object to be labeled in the image coordinate system, and the bounding box information corresponding to the j-th object to be labeled in the image coordinate system is obtained according to the mask boundary information.
  • the convex hull algorithm may be the Graham convex hull algorithm or the like.
  • the mask boundary information can be obtained through the Graham convex hull algorithm as shown in Equation 11:
  • Box(M) represents the function of obtaining the bounding box information based on the mask boundary information
  • (u top , v top ) represents the coordinate information of the upper left corner mask boundary point in the mask boundary information
  • (u bottom , v bottom ) represents Coordinate information of the mask boundary point in the lower right corner of the mask boundary information.
  • the annotation record information of the j-th object to be marked in the i-th video frame can be expressed as: Among them, class j represents the object category information of the j-th object to be labeled, Represents the second object pose information of the j-th object to be labeled relative to the i-th video frame, and id j represents the current identification information of the j-th object to be labeled, Represents the bounding box information of the jth object to be labeled in the ith video frame, Represents the mask boundary information of the j-th object to be annotated in the i-th video frame.
  • information annotation can be performed for each object to be annotated in each video frame.
  • the present application can only manually label the objects to be labeled in the point cloud of the 3D scene once, and then automatically label the objects to be labeled in each video frame according to the information of the single manual labeling, without manual labeling in each video frame. Reduce the process of manually labeling each object to be labelled.
  • each scene has 5 to 6 objects to be labelled, and each video has 20 to 30 valid video frames.
  • the traditional manual labeling method is used.
  • each video frame takes 7 minutes; using the video annotation method described in this application, each video frame takes 1.5 minutes. It can be seen that the video annotation method proposed in this application greatly reduces the time-consuming of annotation.
  • the present application can construct a 3D scene point cloud for the work scene, perform a single manual annotation on the objects to be labeled in the 3D scene point cloud, and map the manually annotated information in the 3D scene point cloud to each video frame. It can be seen that the present application realizes semi-automatic video labeling, reduces labeling time, improves labeling efficiency, and avoids the problem of inefficient manual labeling of data.
  • FIG. 2 is a schematic flowchart of a video annotation method provided based on the embodiment shown in FIG. 1 .
  • the video annotation method may further include the following steps:
  • the fixed-focus camera is used to photograph the object to be marked in multiple directions, so as to obtain multiple frames of images of the object to be marked.
  • the RGB-D image acquisition device can be used to photograph the object to be marked in multiple directions to obtain multiple frames of images of the object to be marked.
  • the fixed-focus camera in the present application can shoot the object to be annotated in all directions, and obtain multiple frames of images of the object to be annotated.
  • S105 may be executed before S101, or may be executed before S102, and this application does not limit the sequence of S105.
  • S106 Acquire an object point cloud of the object to be marked according to the multiple frames of images.
  • the present application can use the camera measurement technology to obtain the object point cloud of the object to be marked from the multi-frame images.
  • the object point cloud of the object to be annotated can be obtained directly using the photogrammetry software Meshroom.
  • the object point cloud can be three-dimensional position coordinate information.
  • the object point cloud may also include color information or intensity information, etc., which is not particularly limited in this application.
  • the present application since there may be noise point clouds in the object point cloud, the reconstructed three-dimensional object model surface may not be smooth, so the present application may first perform denoising processing on the object point cloud. Then, a three-dimensional object model of the object to be annotated can be constructed for the denoised object point cloud.
  • the object point clouds can be connected to form a plane, and a three-dimensional object model of the object to be marked can be constructed from the plane.
  • the present application can obtain the 3D object model of each object in advance by reverse engineering, so that the 3D object model can be directly set in the constructed 3D scene point cloud, so as to perform information annotation on the object to be annotated.
  • FIG. 3 shows a schematic diagram of a video annotation method. As shown in Figure 3, it mainly includes two parts: the first part is the construction of the 3D object model; the second part is the construction and information annotation of the 3D scene point cloud.
  • the construction of the three-dimensional object model may include: acquiring multiple frames of images of the object to be marked, then acquiring the object point cloud of the object to be marked from the multi-frame images, and then constructing the three-dimensional object model of the object to be marked according to the object point cloud.
  • the construction and information annotation of the 3D scene point cloud may include: obtaining the target device attitude parameters and the 3D scene point cloud when the RGB-D image acquisition device collects each video frame, and then by analyzing the location of the object to be annotated in the 3D scene point cloud.
  • the corresponding three-dimensional object model is set at the position to mark the object to be marked with a single information, and then according to the target device attitude parameters and the first object information marked once, the second object information is marked on each video frame in the video frame sequence,
  • the second object information includes object category information of the object to be labeled (ie, classes in FIG. 3 ), bounding box information (ie, 2D Boxes in FIG. 3 ), mask boundary information (ie, 2D Masks in FIG. 3 ), and second Object pose information (i.e. 6D Poses in Figure 3).
  • Vision guidance mainly includes target detection and target pose estimation.
  • Target detection can use RCNN (region-based convolution neural networks) algorithm, Fast RCNN algorithm, Faster R-CNN algorithm, SSD (single shot multibox detector) , single-stage multi-frame prediction) algorithm, YOLO (you only look once, you only look once) algorithm, etc.
  • the target pose estimation can use the point cloud template registration method or the ICP (iterative closest point, iterative closest point) registration method Wait.
  • the existing target pose estimation is sensitive to noise data, and it is difficult to deal with problems such as incomplete occlusion.
  • FIG. 4 shows a schematic flowchart of a video annotation method. As shown in Figure 4, after S104, the following steps may also be included:
  • S108 Acquire a to-be-processed video frame, where the to-be-processed video frame includes a to-be-processed RGB image and a to-be-processed depth image.
  • the to-be-processed RGB image and the to-be-processed depth image may be acquired by an RGB-D image acquisition device.
  • S109 Obtain target object category information, target bounding box information, and target mask boundary information of the target object in the RGB image to be processed.
  • Figure 5 shows a schematic diagram of a target detection and target pose estimation process.
  • the target detection model involved in the target detection process may include: the backbone network layer (that is, the backbone in Figure 5, such as the residual network layer ResNet50, etc.), the FPN (feature pyramid) connected to the backbone network layer networks, feature pyramid network layer), multiple RCNN network layers connected to FPN, and NMS (non maximum suppression) network layers connected to multiple RCNN network layers.
  • Figure 5 is an example of FPN outputting feature images of three scales
  • the target detection model includes three RCNN network layers, and the output channels of FPN and RCNN network layers are in a one-to-one correspondence.
  • the target detection model in FIG. 5 is used as an example for description.
  • the application needs to input the RGB image to be processed into the backbone network layer to obtain the first feature image; then input the first feature image into the FPN, the second feature image of multiple scales can be output; then input the second feature image to the
  • the corresponding RCNN network layer outputs the initial object information of the target object in the target RGB image, and the initial object information may include initial category information, initial bounding box information and initial decoding coefficients. Then, the initial category information, initial bounding box information and initial decoding coefficients are screened through the NMS network layer, and the target object category information, target bounding box information and target decoding coefficients of the target object in the target RGB image are obtained.
  • the target object includes object1, object2, .
  • the objectn info is further explained in detail in Figure 5, that is, it includes the target object category information of objectn (ie, the class in Figure 5), the target bounding box information (ie, the 2D box in Figure 5) and the target decoding coefficient (ie, Figure 5). coefficients in ), where the target decoding coefficients can be a 32 ⁇ 1 vector.
  • the above examples are just examples, which are not specifically limited in the present application.
  • the target detection model may further include a first convolutional neural network layer.
  • the first feature image is input into the first convolutional neural network layer, and the third feature image with a preset scale is output.
  • the present application can, after obtaining the first feature block of the target object from the third feature image according to the target bounding box information, encode the first feature block of the target object.
  • a feature block is matrix multiplied with the target decoding coefficient of the target object to obtain the thermal image of the target object, wherein the thermal image is a single-channel image. It can be seen that the effect of obtaining the thermal image can be achieved through the target decoding coefficient;
  • the thermal image is binarized with a preset threshold to obtain the target mask boundary information of the target object in the target RGB image.
  • the preset scale of the third feature image is the same as the scale of the target RGB image, the pixels indicated by the target bounding box information can be directly obtained in the third feature image as the first feature block;
  • the preset scale of the image is not the same as the scale of the target RGB image, then the third feature image can be converted to the scale of the target RGB image, and then the pixels indicated by the target bounding box information can be obtained in the converted third feature image as The first feature block, or, obtaining the scale scaling ratio between the third feature image and the target RGB image, scaling the target bounding box information according to the scale scaling ratio, and obtaining the scaled target boundary in the third feature image
  • the pixel points indicated by the frame information are used as the first feature block, and the above process is only an example to illustrate that the present application does not make any special restrictions on this.
  • S110 Acquire a target point cloud image of the target object according to the target mask boundary information and the depth image to be processed.
  • the pixel points including the target object can be extracted from the depth image to be processed through the target mask boundary information; then coordinate transformation is performed on the pixel points including the target object to obtain the target point cloud information, and then Construct the target point cloud image according to the target point cloud information.
  • the target point cloud image and the target object image into the target pose estimation model obtained by pre-training, and obtain the target object pose information of the target object and the target confidence level corresponding to the target object pose information;
  • the target object image is based on the target bounding box information
  • the target pose estimation model obtained by pre-training is obtained by training according to the video frame marked with the second object information.
  • the pre-trained target pose estimation model may include a feature extraction network layer.
  • the feature extraction network layer may include a local feature extraction network layer, a global feature extraction network layer and a feature aggregation network layer.
  • the present application can input the target point cloud image and the target object image to the local feature extraction network layer to obtain the local feature image of the target object; input the local features of the target object and the coordinate information corresponding to the local features to the global feature extraction network layer.
  • obtain the global feature image of the target object input the local feature image and the global feature image to the feature aggregation network layer to obtain the aggregated feature image, and through the aggregated feature image, obtain the target object pose information of the target object and the corresponding pose information of the target object target confidence.
  • the target pose estimation model may also include three second convolutional neural network layers, wherein the first convolutional neural network layer is used to obtain the target translation matrix in the target object pose information, and the second convolutional neural network layer The convolutional neural network layer is used to obtain the target rotation matrix in the target object pose information, and the third convolutional neural network layer is used to obtain the confidence level of the target object pose information.
  • the aggregated feature images can be input to the first and second convolutional neural network layers to obtain the target object pose information of the target object.
  • the three convolutional neural network layers may be 1 ⁇ 1 convolution kernels.
  • the present application uses an instance segmentation algorithm in the target detection model, and the instance segmentation algorithm can provide the precise outline of the object to be marked, background noise can be reduced in the target pose estimation.
  • both local features and global features are combined in the target pose estimation process. In this way, even if some local features are occluded, the pose estimation can be performed by another unoccluded local feature, which can avoid the problem of object occlusion to a certain extent. .
  • S112 Determine final object posture information of the target object from the target object posture information of the target object according to the target confidence corresponding to the target object posture information, and perform visual guidance through the final object posture information and target object category information.
  • the robotic arm can be controlled to automatically grab objects at any position and posture and achieve sorting.
  • the device used for video annotation in this application may have the following hardware: Intel(R) Xeon(R) 2.4GHz CPU, NVIDIA GTX 1080 Ti GPU.
  • Intel(R) Xeon(R) 2.4GHz CPU Intel(R) Xeon(R) 2.4GHz CPU
  • NVIDIA GTX 1080 Ti GPU NVIDIA GTX 1080 Ti GPU
  • the pose of the object can be quickly acquired, which is convenient for visual guidance.
  • the present application may use each video frame in the embodiment shown in FIG. 1 or FIG. 2 to perform model training on the target detection model and the target pose estimation model.
  • each video frame in the video frame sequence may include an RGB image and a depth image.
  • the following steps may also be included:
  • S113 Input the RGB image into the target detection model to be trained, and obtain object category information to be matched, bounding box information to be matched, and decoding coefficients of the object to be marked in the RGB image.
  • S114 Perform instance segmentation on the RGB image according to the bounding box information to be matched and the decoding coefficients to obtain the mask boundary information of the object to be marked in the RGB image to be matched.
  • S115 Acquire a point cloud image of the object to be marked according to the mask boundary information to be matched and the depth image.
  • S117 Perform model training on the target recognition model and the target posture estimation model according to the category information of the object to be matched, the bounding box information to be matched, the boundary information of the mask to be matched, and the pose information of the object to be matched.
  • the loss can be obtained through object category information, bounding box information, mask boundary information, second object pose information, object category information to be matched, bounding box information to be matched, mask boundary information to be matched, and object pose information to be matched.
  • function, and the target recognition model to be trained and the target pose estimation model are used for model training through the loss function.
  • the present application can quickly perform model training by using semi-automatically labeled video frames, thereby improving the efficiency of model training and ensuring the accuracy of the model to a certain extent.
  • the embodiments of the present invention further provide embodiments of apparatuses for implementing the foregoing method embodiments.
  • FIG. 7 is a schematic structural diagram of a video annotation apparatus provided by an embodiment of the present application.
  • the included modules are used to execute the steps in the embodiments corresponding to FIG. 1 to FIG. 6 .
  • the video annotation device 7 includes:
  • a video capture module 71 configured to capture a video frame sequence related to a work scene through an RGB-D image capture device; each video frame in the video frame sequence includes an object to be marked;
  • a video processing module 72 configured to acquire the target device posture parameters when the RGB-D image acquisition device collects the respective video frames, and construct a three-dimensional scene point cloud of the work scene according to the target device posture parameters;
  • the first object information of the object to be marked in the point cloud of the three-dimensional scene is obtained;
  • the to-be-labeled objects included in the respective video frames are marked with second object information.
  • the video processing module 72 is further configured to adopt the simultaneous positioning and mapping SLAM algorithm to obtain the RGB-D image acquisition device using the initial device attitude parameter when the RGB-D image acquisition device collects the first video frame as a reference coordinate system.
  • the pixel point in each video frame is converted to under the reference coordinate system, obtains the video frame under the reference coordinate system;
  • the video processing module 72 is further configured to convert the pixels in each video frame to the camera coordinate system, and correspondingly obtain the video frame under the camera coordinate system;
  • the video frame in the camera coordinate system is converted into the reference coordinate system to obtain the video frame in the reference coordinate system.
  • the first object information includes: object category information, current identification information and first object pose information of the object to be marked in the three-dimensional scene point cloud;
  • the video processing module 72 is further configured to acquire the three-dimensional object model corresponding to the object to be marked in the three-dimensional scene point cloud, and the three-dimensional object model is set with corresponding object category information;
  • the current identification information of the objects to be labelled in the 3D scene point cloud is set, and the object category information of the 3D object model is used as the object type information in the 3D scene point cloud.
  • the converted three-dimensional object model is obtained;
  • the converted three-dimensional object model By setting the converted three-dimensional object model at the position of the object to be marked in the three-dimensional scene point cloud, the first object pose information of the object to be marked in the three-dimensional scene point cloud is determined.
  • the video processing module 72 is further configured to set initial posture parameters for the three-dimensional object model
  • the converted three-dimensional object model and the object to be marked in the point cloud of the three-dimensional scene are fitted, determine that the initial posture parameter is the first object of the object to be marked in the point cloud of the three-dimensional scene attitude information;
  • the converted three-dimensional object model does not fit the object to be marked in the three-dimensional scene point cloud
  • adjust the initial posture parameter and re-acquire the converted three-dimensional object according to the adjusted initial posture parameter object model until the re-acquired converted three-dimensional object model is fitted with the object to be marked in the three-dimensional scene point cloud
  • the adjusted initial posture parameter is used as the first object posture information
  • the second object information includes: object category information, current identification information, second object pose information, mask boundary information and bounding box information corresponding to the object to be marked in the video frame;
  • the target device posture parameter corresponding to the i-th video frame and the first object posture information of the j-th object to be labeled obtain the second position of the j-th object to be labeled relative to the i-th video frame Object pose information;
  • the convex hull algorithm is used to calculate the mask boundary information of the j-th object to be labeled in the i-th video frame under the image coordinate system, and the mask boundary information under the image coordinate system is obtained according to the mask boundary information.
  • the video acquisition module 71 is further configured to use a fixed-focus camera to photograph the object to be marked in multiple directions to obtain multiple frames of images of the object to be marked;
  • the video processing module 72 is further configured to obtain the object point cloud of the object to be marked according to the multi-frame images; and,
  • a three-dimensional object model of the object to be marked is constructed.
  • FIG. 8 is a schematic structural diagram of a video annotation device provided by an embodiment of the present application.
  • the video annotation device 8 of this embodiment includes: a processor 80 , a memory 81 , and a computer program 82 stored in the memory 81 and executable on the processor 80 , such as a video annotation program.
  • the processor 80 executes the computer program 82, the steps in each of the foregoing video tagging method embodiments are implemented, for example, S101-S104 shown in FIG. 1 .
  • the processor 80 executes the computer program 82
  • the functions of the modules/units in the above-mentioned device embodiments for example, the functions of the modules 71 and 72 shown in FIG. 7 are implemented.
  • the computer program 82 can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete the present application .
  • the one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the video annotation device 8 .
  • the computer program 82 may be divided into an acquisition module and a processing module. For specific functions of each module, please refer to the relevant descriptions in the corresponding embodiments in FIG. 1 to FIG. 6 , which will not be repeated here.
  • the video annotation device may include, but is not limited to, a processor 80 and a memory 81 .
  • FIG. 8 is only an example of the video annotation device 8, and does not constitute a limitation to the video annotation device 8, and may include more or less components than the one shown, or combine some components, or different
  • the video annotation device may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor 80 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf processors. Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 81 may be an internal storage unit of the video annotation device 8 , such as a hard disk or a memory of the video annotation device 8 .
  • the memory 81 can also be an external storage device of the video labeling device 8, such as a plug-in hard disk equipped on the video labeling device 8, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 81 may also include both an internal storage unit of the video annotation device 8 and an external storage device.
  • the memory 81 is used to store the computer program and other programs and data required by the video annotation device.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the foregoing border and coastal defense monitoring method can be implemented.
  • the embodiment of the present application provides a computer program product, when the computer program product runs on a video annotation device, the above-mentioned border and coastal defense monitoring method can be realized when the video annotation device is executed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

La présente demande se rapporte au domaine technique de la vision artificielle et du traitement d'images, et concerne un procédé, un appareil et un dispositif d'annotation de vidéo, ainsi qu'un support de stockage lisible par ordinateur, ce qui permet d'améliorer l'efficacité d'annotation des données vidéo dans une certaine mesure et de réduire les coûts de main-d'œuvre. Le procédé consiste à : acquérir une séquence de trames vidéo concernant une scène de travail au moyen d'un dispositif d'acquisition d'image RVB-D, les trames vidéo dans la séquence de trames vidéo comprenant un objet à annoter ; obtenir des paramètres d'attitude de dispositif cible lorsque le dispositif d'acquisition d'image RVB-D acquiert les trames vidéo, puis construire un nuage de points de scène tridimensionnel de la scène de travail en fonction des paramètres d'attitude du dispositif cible ; définir un modèle d'objet tridimensionnel à la position où ledit objet se situe dans le nuage de points de scène tridimensionnel afin d'obtenir des premières informations d'objet dudit objet dans le nuage de points de scène tridimensionnel ; et annoter, selon les premières informations d'objet et les paramètres d'attitude du dispositif cible, des secondes informations d'objet sur ledit objet compris dans les trames vidéo.
PCT/CN2021/137580 2021-02-10 2021-12-13 Procédé, appareil et dispositif d'annotation de vidéo, et support de stockage lisible par ordinateur WO2022170844A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110185443.2 2021-02-10
CN202110185443.2A CN112950667B (zh) 2021-02-10 2021-02-10 一种视频标注方法、装置、设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2022170844A1 true WO2022170844A1 (fr) 2022-08-18

Family

ID=76245718

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137580 WO2022170844A1 (fr) 2021-02-10 2021-12-13 Procédé, appareil et dispositif d'annotation de vidéo, et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN112950667B (fr)
WO (1) WO2022170844A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115582827A (zh) * 2022-10-20 2023-01-10 大连理工大学 一种基于2d和3d视觉定位的卸货机器人抓取方法
CN115834983A (zh) * 2022-11-18 2023-03-21 中国船舶重工集团公司第七一九研究所 一种多源信息融合的数字化环境监控方法及***
CN116152783A (zh) * 2023-04-18 2023-05-23 安徽蔚来智驾科技有限公司 目标元素标注数据的获取方法、计算机设备及存储介质
CN116153472A (zh) * 2023-02-24 2023-05-23 萱闱(北京)生物科技有限公司 图像多维可视化方法、装置、介质和计算设备
CN117372632A (zh) * 2023-12-08 2024-01-09 魔视智能科技(武汉)有限公司 二维图像的标注方法、装置、计算机设备及存储介质
CN117746250A (zh) * 2023-12-29 2024-03-22 重庆市地理信息和遥感应用中心(重庆市测绘产品质量检验测试中心) 一种融合实景三维与视频的烟火智能识别与精准定位方法
WO2024124670A1 (fr) * 2022-12-14 2024-06-20 珠海普罗米修斯视觉技术有限公司 Procédé et appareil de lecture vidéo, dispositif informatique et support de stockage lisible par ordinateur

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950667B (zh) * 2021-02-10 2023-12-22 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质
CN113591568A (zh) * 2021-06-28 2021-11-02 北京百度网讯科技有限公司 目标检测方法、目标检测模型的训练方法及其装置
CN113808198B (zh) * 2021-11-17 2022-03-08 季华实验室 一种吸取面的标注方法、装置、电子设备和存储介质
CN117953142A (zh) * 2022-10-20 2024-04-30 华为技术有限公司 一种图像标注方法、装置、电子设备及存储介质
WO2024087067A1 (fr) * 2022-10-26 2024-05-02 北京小米移动软件有限公司 Procédé et appareil d'annotation d'image, et procédé et appareil d'entraînement de réseau neuronal
CN115661577B (zh) * 2022-11-01 2024-04-16 吉咖智能机器人有限公司 用于对象检测的方法、设备和计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047142A (zh) * 2019-03-19 2019-07-23 中国科学院深圳先进技术研究院 无人机三维地图构建方法、装置、计算机设备及存储介质
CN110221690A (zh) * 2019-05-13 2019-09-10 Oppo广东移动通信有限公司 基于ar场景的手势交互方法及装置、存储介质、通信终端
US20200226360A1 (en) * 2019-01-10 2020-07-16 9138-4529 Quebec Inc. System and method for automatically detecting and classifying an animal in an image
CN112950667A (zh) * 2021-02-10 2021-06-11 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678748B (zh) * 2015-12-30 2019-01-15 清华大学 三维监控***中基于三维重构的交互式标定方法和装置
CN110336973B (zh) * 2019-07-29 2021-04-13 联想(北京)有限公司 信息处理方法及其装置、电子设备和介质
CN110503074B (zh) * 2019-08-29 2022-04-15 腾讯科技(深圳)有限公司 视频帧的信息标注方法、装置、设备及存储介质
CN111274426B (zh) * 2020-01-19 2023-09-12 深圳市商汤科技有限公司 类别标注方法及装置、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200226360A1 (en) * 2019-01-10 2020-07-16 9138-4529 Quebec Inc. System and method for automatically detecting and classifying an animal in an image
CN110047142A (zh) * 2019-03-19 2019-07-23 中国科学院深圳先进技术研究院 无人机三维地图构建方法、装置、计算机设备及存储介质
CN110221690A (zh) * 2019-05-13 2019-09-10 Oppo广东移动通信有限公司 基于ar场景的手势交互方法及装置、存储介质、通信终端
CN112950667A (zh) * 2021-02-10 2021-06-11 中国科学院深圳先进技术研究院 一种视频标注方法、装置、设备及计算机可读存储介质

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115582827A (zh) * 2022-10-20 2023-01-10 大连理工大学 一种基于2d和3d视觉定位的卸货机器人抓取方法
CN115834983A (zh) * 2022-11-18 2023-03-21 中国船舶重工集团公司第七一九研究所 一种多源信息融合的数字化环境监控方法及***
WO2024124670A1 (fr) * 2022-12-14 2024-06-20 珠海普罗米修斯视觉技术有限公司 Procédé et appareil de lecture vidéo, dispositif informatique et support de stockage lisible par ordinateur
CN116153472A (zh) * 2023-02-24 2023-05-23 萱闱(北京)生物科技有限公司 图像多维可视化方法、装置、介质和计算设备
CN116153472B (zh) * 2023-02-24 2023-10-24 萱闱(北京)生物科技有限公司 图像多维可视化方法、装置、介质和计算设备
CN116152783A (zh) * 2023-04-18 2023-05-23 安徽蔚来智驾科技有限公司 目标元素标注数据的获取方法、计算机设备及存储介质
CN116152783B (zh) * 2023-04-18 2023-08-04 安徽蔚来智驾科技有限公司 目标元素标注数据的获取方法、计算机设备及存储介质
CN117372632A (zh) * 2023-12-08 2024-01-09 魔视智能科技(武汉)有限公司 二维图像的标注方法、装置、计算机设备及存储介质
CN117372632B (zh) * 2023-12-08 2024-04-19 魔视智能科技(武汉)有限公司 二维图像的标注方法、装置、计算机设备及存储介质
CN117746250A (zh) * 2023-12-29 2024-03-22 重庆市地理信息和遥感应用中心(重庆市测绘产品质量检验测试中心) 一种融合实景三维与视频的烟火智能识别与精准定位方法

Also Published As

Publication number Publication date
CN112950667A (zh) 2021-06-11
CN112950667B (zh) 2023-12-22

Similar Documents

Publication Publication Date Title
WO2022170844A1 (fr) Procédé, appareil et dispositif d'annotation de vidéo, et support de stockage lisible par ordinateur
US20200279121A1 (en) Method and system for determining at least one property related to at least part of a real environment
He et al. Sparse template-based 6-D pose estimation of metal parts using a monocular camera
JP5822322B2 (ja) ローカライズされ、セグメンテーションされた画像のネットワークキャプチャ及び3dディスプレイ
Azad et al. Stereo-based 6d object localization for grasping with humanoid robot systems
CN109479082B (zh) 图象处理方法及装置
CN109886124B (zh) 一种基于线束描述子图像匹配的无纹理金属零件抓取方法
JP2004334819A (ja) ステレオキャリブレーション装置とそれを用いたステレオ画像監視装置
CN111507908B (zh) 图像矫正处理方法、装置、存储介质及计算机设备
WO2021136386A1 (fr) Procédé de traitement de données, terminal et serveur
CN113111844B (zh) 一种作业姿态评估方法、装置、本地终端及可读存储介质
CN108537214B (zh) 一种室内语义地图的自动化构建方法
CN109272577B (zh) 一种基于Kinect的视觉SLAM方法
CN112861870B (zh) 指针式仪表图像矫正方法、***及存储介质
JP2011134012A (ja) 画像処理装置、その画像処理方法及びプログラム
WO2024012333A1 (fr) Procédé et appareil d'estimation de pose, procédé et appareil d'apprentissage de modèle associés, dispositif électronique, support lisible par ordinateur et produit programme d'ordinateur
CN111488766A (zh) 目标检测方法和装置
CN111191582A (zh) 三维目标检测方法、检测装置、终端设备及计算机可读存储介质
CN111325828A (zh) 一种基于三目相机的三维人脸采集方法及装置
WO2023284358A1 (fr) Procédé et appareil d'étalonnage de caméra, dispositif électronique et support de stockage
JP5704909B2 (ja) 注目領域検出方法、注目領域検出装置、及びプログラム
WO2022247126A1 (fr) Procédé et appareil de localisation visuelle, dispositif, support et programme
CN111161348A (zh) 一种基于单目相机的物***姿估计方法、装置及设备
CN112749664A (zh) 一种手势识别方法、装置、设备、***及存储介质
CN117253022A (zh) 一种对象识别方法、装置及查验设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21925491

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21925491

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/01/2024)