CN114119739A - Binocular vision-based hand key point space coordinate acquisition method - Google Patents

Binocular vision-based hand key point space coordinate acquisition method Download PDF

Info

Publication number
CN114119739A
CN114119739A CN202111230723.7A CN202111230723A CN114119739A CN 114119739 A CN114119739 A CN 114119739A CN 202111230723 A CN202111230723 A CN 202111230723A CN 114119739 A CN114119739 A CN 114119739A
Authority
CN
China
Prior art keywords
camera
coordinate system
hand
model
key point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111230723.7A
Other languages
Chinese (zh)
Inventor
胡朕朕
李舒
王俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Innovation Research Institute of Beihang University
Original Assignee
Hangzhou Innovation Research Institute of Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Innovation Research Institute of Beihang University filed Critical Hangzhou Innovation Research Institute of Beihang University
Priority to CN202111230723.7A priority Critical patent/CN114119739A/en
Publication of CN114119739A publication Critical patent/CN114119739A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/80Geometric correction

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a binocular vision-based hand key point space coordinate acquisition method, which comprises the steps of firstly, carrying out three-dimensional calibration on a binocular camera, and establishing a conversion model of each coordinate system, namely acquiring internal and external parameters of the camera, a distortion coefficient and a rotation and translation matrix between the two cameras; secondly, preprocessing videos shot by the binocular camera, including cutting, distortion correction and the like; then, processing the video frame by utilizing a machine learning assembly line to obtain pixel coordinates of 21 key points of the hand; and finally, calculating the real coordinates of the key points of the hand in the three-dimensional space by adopting a least square method based on the optical axis convergence model. The invention utilizes binocular visual information, accurately positions and recovers 21 hand key point three-dimensional space coordinates containing all joint points by simulating the structure of human eyes, is more accurate in hand state reconstruction, and provides accurate technical support for application research of hand key point positioning in human-computer interaction.

Description

Binocular vision-based hand key point space coordinate acquisition method
Technical Field
The invention relates to the field of computer vision, in particular to a binocular vision-based human hand key point three-dimensional coordinate acquisition method, and also relates to the technical fields of digital image processing, space three-dimensional information acquisition, human-computer interaction and the like.
Background
Binocular stereo vision is an important branch of the computer vision field, which simulates the human vision system, senses the three-dimensional spatial information of an object by using the principle of parallax, and reconstructs the shape and position of the scene of the object. The method has the advantages that the method is always a hotspot problem in the field of accurately detecting and positioning the spatial position of a hand from a video, and has high application value in the fields of virtual reality, augmented reality, motion sensing games, human-computer interaction, three-dimensional measurement and the like.
However, the current acquisition of three-dimensional coordinates of the hand is limited to individual key points, such as fingertips and palm centers, which is not enough to reconstruct the motion posture of the whole hand and the location thereof in the space, and most of the schemes that are used based on skin color or edge detection and then perform fingertip search on the boundary are easily affected by background and ambient light, which causes the reduction of algorithm robustness.
Meanwhile, most of feature extraction schemes adopted by the existing space point 3D coordinate recovery technology are SIFT algorithms, although the algorithms have good stability and invariance, the algorithms are insufficient in feature extraction of smooth edge targets, poor in coarse matching and fine matching peer effect which are needed later, and not the best solution scheme for extracting key points of hands.
Although many full-hand posture estimation algorithm researches based on depth images appear in recent years, the finger tip area is small, the motion is fast, the quality of the finger part of the generated depth image is poor, the detection precision is low, and the error of the three-dimensional space coordinate obtained through calculation is larger.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a binocular vision-based hand key point space coordinate acquisition method, which can acquire the three-dimensional space coordinates of the fingertip and the palm center, and the three-dimensional space coordinates of 21 key points of all joint points of a hand, so as to sense the shape and the motion track of the hand, and simultaneously avoid the error influence of the depth imaging process in the depth image-based detection method, so that the positioning of the hand key points is more accurate.
In order to solve the technical problems, the invention provides the following technical scheme:
a binocular vision-based hand key point space coordinate acquisition method comprises the following steps:
step 1: the method comprises the steps of horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing coordinate system conversion models, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras, and then establishing a camera imaging model.
Step 2: and preprocessing the synchronous video shot by the binocular camera.
And step 3: and respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D hand key points in each frame of image, and obtaining pixel coordinates of the key points in the left and right eye camera shooting images.
And 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on a least square method according to the camera parameters obtained in the step (1).
Further, the step 1 comprises:
step 1-1: constructing a conversion model from a world coordinate system to a camera coordinate system: xc=rXw+ t, that is:
Figure BDA0003315815910000031
wherein XcRepresenting the camera coordinate system, XwRepresenting the world coordinate system, r is a 3 × 3 rotation matrix and t is a 3 × 1 translation vector. The rotation matrix r is commonly controlled by X, Y, Z components in three directions, so that with three degrees of freedom rWhich are the sum of the effects of rotation about three axes X, Y, Z, respectively.
Step 1-2: constructing a conversion model from a camera coordinate system to an image coordinate system, namely a pinhole imaging model
Figure BDA0003315815910000032
(homogeneous coordinate form)
Wherein (x, y) is the coordinate in the image coordinate system, (x)c,yc,zc) Is the coordinate in the camera coordinate system, and f is the camera focal length. Wherein z iscCan be obtained by triangulation, i.e.
Figure BDA0003315815910000033
B is the distance between the optical centers of the left and right eye cameras, XL、XRThe abscissa of the pixel point corresponding to the left and right eye images.
Step 1-3: constructing a conversion model from an image coordinate system to a pixel coordinate system:
Figure BDA0003315815910000034
(homogeneous coordinate form)
Wherein (u, v) is pixel coordinate system coordinate, (x, y) is image coordinate system coordinate, dx、dyDenotes the physical size of each pixel on the horizontal axis x and the vertical axis y, respectively, (u)0,v0) Is the coordinate of the origin of the image coordinate system (i.e. the intersection of the camera optical axis and the image plane) in the pixel coordinate system.
Step 1-4: and (3) combining the models in the steps 1-1, 1-2 and 1-3 to obtain a conversion model from a world coordinate system to a pixel coordinate system, wherein the conversion model comprises the following steps:
Figure BDA0003315815910000041
in particular, it is possible to use, for example,
Figure BDA0003315815910000042
is a reference matrix in the camera, and the reference matrix is a reference matrix in the camera,f/dx、f/dyrespectively representing the focal length in units of the actual physical size of each pixel on the horizontal axis x and the vertical axis y,
Figure BDA0003315815910000043
is a camera external reference matrix, where r3×3As a rotation matrix, t3×1For translation vectors, the world coordinate system to pixel coordinate system transformation matrix, i.e., the projection matrix P of the camera, is
Figure BDA0003315815910000044
Step 1-5: and (3) carrying out Taylor series expansion around the principal point (namely the central point of the image) to construct a lens distortion model, taking the first several coefficients to obtain a camera distortion coefficient, and calculating a 3 multiplied by 3 rotation matrix R and a 3 multiplied by 1 translation vector T of the coordinate system of the right eye camera relative to the coordinate system of the left eye camera.
Further, the step 2 comprises:
step 2-1: and cutting the synchronous video shot by the binocular camera to respectively obtain the video shot by the left eye camera and the video shot by the right eye camera.
Step 2-2: and (3) distortion correction is respectively carried out on the videos shot by the two cameras frame by using the distortion coefficients of the two cameras obtained in the step (1-5), so that the imaging process of the videos accords with the pinhole imaging model.
Further, the step 3 comprises:
step 3-1: the entire image is detected using the palm detection model and returned to the hand region bounding box. The palm detection part adopts a BlazePalm model, the model adopts a coding-decoding feature extractor similar to FPN (feature pyramid), and the image of each scale is subjected to feature extraction to generate multi-scale feature representation, so that feature graphs of all levels have stronger semantic information and higher resolution, and the problem of scale change caused by distance change in the palm detection process can be well solved. While Focal local (Loss of focus) is correspondingly adopted to solve the problem of a large number of anchor points due to multi-scale.
In order to improve the detection efficiency, the palm detector is used for detecting only the palm rather than the whole hand, because the hand lacks a high-contrast characteristic region, reliable hand detection is difficult to realize only by visual characteristics, and compared with the detection of the hand with joints and fingers, the palm detector only needs to detect the boundary frame of a fixed object such as the palm and the fist, which obviously needs much simpler task.
Step 3-2: and (3) positioning 21 3D key point coordinates in the hand region detected in the step (3-1) by predicting a Gaussian heatmap by using a hand key part detection model.
The hand key part detection model adopts a CNN convolutional neural network to predict a Gaussian heat map of key points, and then argmax is carried out on the heat map to find out indexes corresponding to peak values so as to obtain coordinates of each key point. The model regresses 21 key points on the hand, the predicted output feature map is 21 channels, and each channel is a heat map for predicting one key point. Here, the loss function adopts a euclidean distance loss function, that is:
Figure BDA0003315815910000051
in the formula, R is the marked real coordinate, and P is the model prediction coordinate.
Further, the step 4 comprises:
obtaining a coordinate conversion formula from a world coordinate system to a pixel coordinate system according to the optical axis convergence model, wherein the conversion formula comprises the following components for a left camera and a right camera:
Figure BDA0003315815910000061
wherein
Figure BDA0003315815910000062
Is a projection matrix of the left eye camera,
Figure BDA0003315815910000063
is the projection matrix of the right eye camera (u)1,v1)、(u2,v2) The pixel coordinates of the key points on the left eye image and the right eye image, zc1、zc2The z coordinates of the key points in the left and right eye camera coordinate systems can be obtained by the step 1-2. Specifically, the left-eye camera rotation matrix is set as a 3 × 3 identity matrix, the translation vector is set as a 3 × 1 zero vector, the right-eye camera rotation matrix is set as the matrix R obtained in step 1-5, and the translation vector is set as the vector T obtained in step 1-5, where the origin of the world coordinate system is the optical center of the left-eye camera.
The formula is as follows:
Figure BDA0003315815910000064
Figure BDA0003315815910000065
Figure BDA0003315815910000066
Figure BDA0003315815910000067
obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square methodw,yw,zw)。
According to the technical scheme, the binocular vision-based hand key point space coordinate acquisition method has high identification precision. The invention adopts a binocular camera to simulate human eyes, establishes coordinate system conversion models and obtains pixel coordinates of the same target point in the two cameras, and is different from the traditional scheme of obtaining image characteristics by adopting an SIFT algorithm.
The invention has the advantages and beneficial effects that:
1. according to the technical scheme, the machine learning model is adopted, the images acquired by the left and right cameras are processed in a pipeline mode, the pixel coordinates of the key points of the hands in each frame of image are acquired, the traditional SIFT algorithm is not adopted for feature point extraction, the captured key points are more accurate, and the accuracy of three-dimensional space coordinate positioning of the key points is improved.
2. According to the technical scheme, after the pixel coordinates of the key points on the left and right eye images are obtained, the three-dimensional space coordinates of the key points are solved by adopting a least square method, so that the final result is more accurate.
3. Compared with the prior technical scheme of only detecting the fingertips and the palms, the technical scheme of the invention has the advantages that the detected key points are more comprehensive, all 21 key points including the fingertips, the palms and the joint points are covered, and the dynamic reconstruction of the fingertips is more accurate.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings needed to be used in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a binocular vision-based hand key point space coordinate acquisition method of the present invention.
Fig. 2 is a schematic flow chart of coordinate system conversion in an embodiment of the present invention.
Fig. 3 is a schematic diagram of 21 hand key points in an embodiment of the present invention.
Fig. 4 is a flow chart of a step 3 machine learning pipeline in an embodiment of the present invention.
Fig. 5 is a schematic diagram of an optical axis convergence model in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The method does not need a specific operating environment, hardware equipment only needs a computer and a binocular camera, and fig. 1 is a flow chart of steps of a preferred embodiment of the method for acquiring the hand key point space coordinates based on binocular vision. And operating a binocular camera to shoot, transmitting the acquired video stream to a computer, and then preprocessing the video stream, including left and right eye image separation and image correction. And then, respectively detecting key points of the left and right video frames by using a machine learning model to obtain coordinates, and finally recovering the three-dimensional space coordinates of the key points of the hands by using a least square method. The binocular vision-based hand key point space coordinate acquisition method of the invention is described in detail below with reference to fig. 1:
step 1: the method comprises the steps of horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing coordinate system conversion models, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras as shown in figure 2, and establishing a camera imaging geometric model.
Step 1-1: fix two mesh cameras, ensure the straight line level at two camera light centers place, and two cameras homoenergetic shoot complete hand.
Step 1-2: and carrying out binocular calibration on the left and right eye cameras. Firstly, a printed calibration board needs to be prepared, wherein the calibration board in the embodiment is a 7 × 10 checkerboard formed by alternately arranging black and white squares, the side length of each square is 25mm, and the intersection point of the black and white squares is taken as a characteristic point, and the total number of the characteristic points is 54.
Step 1-3: the calibration plate is shot by using the binocular camera from different directions, in order to enable a calibration result to be more accurate, the calibration plate needs to be shot from 10 or more different directions, and the calibration plate needs to be arranged in a shooting area of the camera all the time in the process to obtain left and right target images of the calibration plate.
Step 1-4: and (3) respectively extracting checkerboard feature points in the left and right eye images, wherein the origin of an image coordinate system is the intersection point of the camera imaging surface and the optical axis of the camera imaging surface. And respectively matching characteristic points of left and right eye chessboard images in the same direction by using an epipolar constraint principle, solving a homography matrix by using a Levenberg-Marquardt algorithm, further solving internal and external parameters of left and right eye cameras, acquiring attitude parameters between the cameras, and writing the obtained parameters into a system. The intrinsic parameters are intrinsic parameters determined by the internal optical and geometric properties of the camera, including the actual size (d) of the pixelx,dy) Principal point pixel coordinate (u)0,v0) Focal length f, coordinate axis tilt coefficient s, distortion coefficient (k)1,k2,k3,p1,p2) The extrinsic parameters are parameters representing relative position and orientation information between the pixel coordinate system and the world coordinate system, and include a rotation matrix R, a translation vector T, and a rotation matrix R and a translation vector T of the right-eye camera coordinate system relative to the left-eye camera coordinate system.
Further, the epipolar constraint principle described in steps 1-4, that is, the projection point of any point in space on the image plane, is necessarily on the epipolar plane composed of the point and the centers of the two cameras, so that for a certain feature point on the image, the matching point on another view is necessarily on the corresponding epipolar line. Epipolar constraint reduces feature matching from two-dimensional search to one-dimensional search, thereby greatly increasing computation speed and reducing mismatching. The homography matrix describes a mapping relation between a world coordinate system and a pixel coordinate system, namely a projection matrix of the camera. The Levenberg-Marquardt (LM) algorithm is an optimization algorithm, and aims to obtain the optimal solution of the homography matrix under the condition that the calculated characteristic point pairs have noise and even have mismatching of the characteristic point pairs. The pose parameters between the cameras include a rotation matrix R and a translation vector T between the cameras.
Step 2: and preprocessing the synchronous video shot by the binocular camera.
Step 2-1: after calibration is finished, a binocular camera is used for shooting hand pictures, the collected images are 720 multiplied by 2560 digital images of RGB color space, and the hands need to be always in the shooting range of the camera in the shooting process. Since the binocular camera used in this embodiment is synchronous shooting, the shot synchronous video needs to be divided to obtain the left eye camera video and the right eye camera video, respectively, and the resolution of the divided videos is 720 × 1280.
Step 2-2: and (3) distortion correction is carried out on the videos shot by the two cameras by using the distortion coefficients of the two cameras obtained in the step (1) to step (4), so that the imaging process of the videos accords with a pinhole imaging model, and the correction formula is as follows:
x0=x(1+k1r2+k2r4+k3r6)
and (3) correcting radial distortion: y is0=y(1+k1r2+k2r4+k3r6)
x0=2p1xy+p2(r2+2x2)+1
Tangential distortion correction: y is0=p2(r2+2y2)+2p2xy+1
In the formula (x)0,y0) Is the original position of the distortion point in the image plane, (x, y) is the new position after distortion correction, r2=x2+y2,k1、k2、k3As radial distortion coefficient, p1、p2Is the tangential distortion coefficient.
And step 3: and (3) respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D key points in each frame of image, wherein a schematic diagram of the key points is shown in FIG. 3, and obtaining pixel coordinates of the key points in the left and right eye images. In particular, the machine learning pipeline constructs machine learning tasks as data flow pipelines, which can effectively manage computing resources to achieve low latency performance. In the step, the machine learning production line mainly comprises two models, namely a palm detection model and a hand key part detection model, the partial flow is shown in fig. 4, and the specific steps are as follows:
in order to improve the processing efficiency, after the preprocessed left and right eye video streams are obtained, the left and right eye video streams are processed in parallel by using multiple threads, specifically:
step 3-1: the palm detector is used to detect the ROI (i.e., the hand region) in the first frame image of the video stream and return to the bounding box.
Step 3-2: and (4) cutting the image of the ROI, accurately positioning key points of the cut image by using a hand key part detection model, and outputting coordinates of the key points and confidence coefficients of hand existence and reasonable alignment in the cut image.
In order to improve the detection efficiency and reduce the calculation time, in the image processing of the subsequent frame, the palm detector is not operated any more, but the hand region in the current frame is deduced from the hand key point calculated from the previous frame, so that the palm detector is avoided being used in each frame, and the palm detection model is reapplied to the whole frame only when the confidence coefficient output by the hand key part detection model is lower than the set threshold or the hand is lost, wherein the threshold is set to 0.8 in the embodiment.
When performing ROI detection, the present embodiment employs the palm detector to perform palm detection, rather than detecting the entire hand region, because the hand detection task is more complex for palm detection: hand detection has to solve the problem of various hand sizes, which requires a larger detection range, and the hand lacks a high-contrast feature area, and it is difficult to realize reliable hand detection only by visual features. In contrast to hand detection, which requires detection of a hand having joints and fingers, the palm detector only needs to detect a bounding box of a rigid object such as a palm and a fist, which is obviously much simpler. This step will greatly reduce the need for data enhancement such as rotation, translation, scaling of the image, allowing the network to use more capacity for key point positioning accuracy improvement.
In order to improve the palm detection accuracy, the palm detection in this embodiment adopts a BlazePalm model, which has a large zoom range and can identify a variety of different palm sizes. The NMS algorithm is adopted, so that the palm area can be well detected even under the condition that the hands are shielded, the palms can be accurately positioned through the identification of the arms, the trunk or the personal characteristics, and the defect of high-contrast texture characteristics of the hands is made up.
The hand key part detection model in the embodiment needs to collect enough human hand samples for training, and the model adopts the CNN convolutional neural network to predict the Gaussian heat map of the key points, so that label is the Gaussian map generated based on each key point. The model needs to regress 21 hand key points, the predicted output characteristic graph is 21 channels, namely each channel is a heat map for predicting one key point, and then the integer coordinates of the key points can be obtained by performing argmax on each channel. In particular, in order to reduce the deviation of the prediction result, the loss function of gaussian heat map regression in this embodiment does not adopt MSE of the main stream, but uses euclidean distance loss function, that is, euclidean distance loss function
Figure BDA0003315815910000121
In the formula, R is the marked real coordinate, and P is the model prediction coordinate. The present embodiment learns the heatmap indirectly by optimizing the loss of predicted coordinates output by the entire model, i.e., the computation of the loss is based on predicted keypoints and real keypoints, the learning of the heatmap being network-spontaneous. Compared with a mode of using a full-connection direct regression coordinate point, the scheme for predicting the Gaussian heatmap has stronger spatial generalization capability and higher accuracy of predicted coordinates.
And 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on the camera parameters obtained in the step (1) and the pixel coordinates of the key points of the hand on the left and right eye images obtained in the step (3) and a least square method.
The optical axis convergence model is shown in fig. 5:
the left camera and the right camera obtained by the model respectively have the following characteristics:
Figure BDA0003315815910000122
wherein
Figure BDA0003315815910000123
Is a projection matrix of the left eye camera,
Figure BDA0003315815910000124
is the projection matrix of the right eye camera (u)1,v1)、(u2,v2) The pixel coordinates of the key points on the left and right eye images are respectively. Specifically, the left-eye camera rotation matrix is set as a 3 × 3 unit matrix, the translation vector is set as a 3 × 1 zero vector, the right-eye camera rotation matrix is set as the rotation matrix R obtained in step 1-4, and the right-eye camera translation vector is set as the vector T obtained in step 1-4, where the origin of the world coordinate system is the left-eye camera optical center.
The formula is as follows:
Figure BDA0003315815910000131
Figure BDA0003315815910000132
Figure BDA0003315815910000133
Figure BDA0003315815910000134
obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square methodw,yw,zw)。
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the exemplary embodiments of the present invention, and all such modifications and alterations should therefore fall within the scope of the appended claims.

Claims (7)

1. A binocular vision-based hand key point space coordinate obtaining method is characterized by comprising the following steps: the method comprises the following steps:
step 1: horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing each coordinate system conversion model, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras, namely establishing a camera imaging model;
step 2: preprocessing a synchronous video shot by a binocular camera;
and step 3: respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D hand key points in each frame of image, and obtaining pixel coordinates of the key points in the left and right eye camera shooting images;
and 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on a least square method according to the camera parameters obtained in the step (1).
2. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 1 comprises:
step 1-1: constructing a conversion model from a world coordinate system to a camera coordinate system: xc=rXw+ t, that is:
Figure FDA0003315815900000011
wherein, XcRepresenting the camera coordinate system, XwRepresenting a world coordinate system, r is a 3 multiplied by 3 rotation matrix, and t is a 3 multiplied by 1 translation vector; the rotation matrix r is controlled by X, Y, Z components in three directions, so that the matrix has three degrees of freedom, and r is the sum of the effects of rotating around X, Y, Z three axes respectively;
step 1-2: constructing a conversion model from a camera coordinate system to an image coordinate system, namely a pinhole imaging model
Figure FDA0003315815900000021
Wherein (x, y) is the coordinate in the image coordinate system, (x)c,yc,zc) Is the coordinate in the camera coordinate system, f is the camera focal length; wherein z iscObtained by triangulation, i.e.
Figure FDA0003315815900000022
B is the distance between the optical centers of the left and right eye cameras, XL、XRThe horizontal coordinates of the corresponding pixel points of the left eye image and the right eye image are obtained;
step 1-3: constructing a conversion model from an image coordinate system to a pixel coordinate system:
Figure FDA0003315815900000023
wherein (u, v) is pixel coordinate system coordinate, (x, y) is image coordinate system coordinate, dx、dyDenotes the physical size of each pixel on the horizontal axis x and the vertical axis y, respectively, (u)0,v0) As the origin of the image coordinate systemCoordinates in a pixel coordinate system;
step 1-4: and (3) combining the models in the steps 1-1, 1-2 and 1-3 to obtain a conversion model from a world coordinate system to a pixel coordinate system, wherein the conversion model comprises the following steps:
Figure FDA0003315815900000024
wherein the content of the first and second substances,
Figure FDA0003315815900000025
is a camera internal reference matrix, f/dx、f/dyRespectively representing the focal length in units of the actual physical size of each pixel on the horizontal axis x and the vertical axis y,
Figure FDA0003315815900000026
is a camera external reference matrix, where r3×3As a rotation matrix, t3×1For the translation vector, the transformation matrix from the world coordinate system to the pixel coordinate system, i.e. the projection matrix P of the camera, is:
Figure FDA0003315815900000031
step 1-5: and carrying out Taylor series expansion around the principal point to construct a lens distortion model, taking the first few coefficients to obtain a camera distortion coefficient, and calculating a 3 multiplied by 3 rotation matrix R and a 3 multiplied by 1 translation vector T of the coordinate system of the right eye camera relative to the coordinate system of the left eye camera.
3. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 2 comprises:
step 2-1: cutting synchronous videos shot by a binocular camera to obtain a left-eye camera shooting video and a right-eye camera shooting video respectively;
step 2-2: and (3) distortion correction is respectively carried out on the videos shot by the two cameras frame by using the distortion coefficients of the two cameras obtained in the step (1-5), so that the imaging process of the videos accords with the pinhole imaging model.
4. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 3 comprises:
step 3-1: detecting the whole image by using a palm detection model and returning to a hand region boundary frame; the palm detection adopts a BlazePalm model, the model adopts a coding-decoding feature extractor similar to a feature pyramid FPN, and the image of each scale is subjected to feature extraction to generate multi-scale feature representation, so that feature graphs of all levels have stronger semantic information and higher resolution, and the problem of scale change caused by distance change in the palm detection process can be well solved; meanwhile, the problem of a large number of anchor points generated due to multi-scale is solved by correspondingly adopting Focal Loss Focal local;
step 3-2: and (3) positioning 21 3D key point coordinates in the hand region detected in the step (3-1) by predicting a Gaussian heatmap by using a hand key part detection model.
5. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: in step 3-1, the palm detector is used to detect only the palm rather than the entire hand, because the hand lacks a high-contrast feature region, it is difficult to achieve reliable hand detection only by visual features, and the palm detector only needs to detect the bounding box of the fixed object of the palm and the fist compared to detecting the hand with joints and fingers.
6. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: in step 3-2, the hand key part detection model predicts the Gaussian heatmap of the key points by adopting a CNN convolutional neural network, and then performs argmax on the heatmap to find out indexes corresponding to peak values to obtain the coordinates of each key point; model regression hand 21 key points, the prediction output characteristic graph is 21 channels, each channel is a heat map for predicting one key point; the loss function adopts a Euclidean distance loss function, namely:
Figure FDA0003315815900000041
in the formula, R is the marked real coordinate, and P is the model prediction coordinate.
7. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 4 comprises:
obtaining a coordinate conversion formula from a world coordinate system to a pixel coordinate system according to the optical axis convergence model, wherein the conversion formula comprises the following components for a left camera and a right camera:
Figure FDA0003315815900000042
wherein
Figure FDA0003315815900000051
Is a projection matrix of the left eye camera,
Figure FDA0003315815900000052
is the projection matrix of the right eye camera (u)1,v1)、(u2,v2) The pixel coordinates of the key points on the left eye image and the right eye image, zc1、zc2Respectively obtaining the z coordinates of the key points under the coordinate systems of the left eye camera and the right eye camera through the step 1-2; setting a rotation matrix of the left eye camera as a 3 x 3 unit matrix, setting a translation vector as a 3 x 1 zero vector, setting a rotation matrix of the right eye camera as a matrix R obtained in the step 1-5, setting the translation vector as a vector T obtained in the step 1-5, and setting the origin of a world coordinate system as the optical center of the left eye camera;
the formula is as follows:
Figure FDA0003315815900000053
Figure FDA0003315815900000054
Figure FDA0003315815900000055
Figure FDA0003315815900000056
obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square methodw,yw,zw)。
CN202111230723.7A 2021-10-22 2021-10-22 Binocular vision-based hand key point space coordinate acquisition method Pending CN114119739A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111230723.7A CN114119739A (en) 2021-10-22 2021-10-22 Binocular vision-based hand key point space coordinate acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111230723.7A CN114119739A (en) 2021-10-22 2021-10-22 Binocular vision-based hand key point space coordinate acquisition method

Publications (1)

Publication Number Publication Date
CN114119739A true CN114119739A (en) 2022-03-01

Family

ID=80376605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111230723.7A Pending CN114119739A (en) 2021-10-22 2021-10-22 Binocular vision-based hand key point space coordinate acquisition method

Country Status (1)

Country Link
CN (1) CN114119739A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112362034A (en) * 2020-11-11 2021-02-12 上海电器科学研究所(集团)有限公司 Solid engine multi-cylinder section butt joint guiding measurement algorithm based on binocular vision
CN114757822A (en) * 2022-06-14 2022-07-15 之江实验室 Binocular-based human body three-dimensional key point detection method and system
CN114820485A (en) * 2022-04-15 2022-07-29 华南理工大学 Method for measuring wave climbing height based on airborne image
CN114842091A (en) * 2022-04-29 2022-08-02 广东工业大学 Binocular egg size assembly line measuring method
CN114979611A (en) * 2022-05-19 2022-08-30 国网智能科技股份有限公司 Binocular sensing system and method
CN117218320A (en) * 2023-11-08 2023-12-12 济南大学 Space labeling method based on mixed reality

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112362034A (en) * 2020-11-11 2021-02-12 上海电器科学研究所(集团)有限公司 Solid engine multi-cylinder section butt joint guiding measurement algorithm based on binocular vision
CN114820485A (en) * 2022-04-15 2022-07-29 华南理工大学 Method for measuring wave climbing height based on airborne image
CN114820485B (en) * 2022-04-15 2024-03-26 华南理工大学 Method for measuring wave climbing based on airborne image
CN114842091A (en) * 2022-04-29 2022-08-02 广东工业大学 Binocular egg size assembly line measuring method
CN114842091B (en) * 2022-04-29 2023-05-23 广东工业大学 Binocular egg size assembly line measuring method
CN114979611A (en) * 2022-05-19 2022-08-30 国网智能科技股份有限公司 Binocular sensing system and method
CN114757822A (en) * 2022-06-14 2022-07-15 之江实验室 Binocular-based human body three-dimensional key point detection method and system
CN117218320A (en) * 2023-11-08 2023-12-12 济南大学 Space labeling method based on mixed reality
CN117218320B (en) * 2023-11-08 2024-02-27 济南大学 Space labeling method based on mixed reality

Similar Documents

Publication Publication Date Title
CN114119739A (en) Binocular vision-based hand key point space coordinate acquisition method
WO2022002150A1 (en) Method and device for constructing visual point cloud map
Wang et al. 360sd-net: 360 stereo depth estimation with learnable cost volume
US10109055B2 (en) Multiple hypotheses segmentation-guided 3D object detection and pose estimation
CN106251399B (en) A kind of outdoor scene three-dimensional rebuilding method and implementing device based on lsd-slam
CN107833181B (en) Three-dimensional panoramic image generation method based on zoom stereo vision
CN103839277B (en) A kind of mobile augmented reality register method of outdoor largescale natural scene
CN111968129A (en) Instant positioning and map construction system and method with semantic perception
CN106485207B (en) A kind of Fingertip Detection and system based on binocular vision image
CN107204010A (en) A kind of monocular image depth estimation method and system
CN110555408B (en) Single-camera real-time three-dimensional human body posture detection method based on self-adaptive mapping relation
CN109359514B (en) DeskVR-oriented gesture tracking and recognition combined strategy method
CN115205489A (en) Three-dimensional reconstruction method, system and device in large scene
CN103337094A (en) Method for realizing three-dimensional reconstruction of movement by using binocular camera
CN109758756B (en) Gymnastics video analysis method and system based on 3D camera
CN106155299B (en) A kind of pair of smart machine carries out the method and device of gesture control
CN113706699A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN109272577B (en) Kinect-based visual SLAM method
CN115272271A (en) Pipeline defect detecting and positioning ranging system based on binocular stereo vision
CN113850865A (en) Human body posture positioning method and system based on binocular vision and storage medium
CN113393439A (en) Forging defect detection method based on deep learning
CN115035546B (en) Three-dimensional human body posture detection method and device and electronic equipment
CN111582036A (en) Cross-view-angle person identification method based on shape and posture under wearable device
CN115359127A (en) Polarization camera array calibration method suitable for multilayer medium environment
CN117711066A (en) Three-dimensional human body posture estimation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination