CN114119739A - Binocular vision-based hand key point space coordinate acquisition method - Google Patents
Binocular vision-based hand key point space coordinate acquisition method Download PDFInfo
- Publication number
- CN114119739A CN114119739A CN202111230723.7A CN202111230723A CN114119739A CN 114119739 A CN114119739 A CN 114119739A CN 202111230723 A CN202111230723 A CN 202111230723A CN 114119739 A CN114119739 A CN 114119739A
- Authority
- CN
- China
- Prior art keywords
- camera
- coordinate system
- hand
- model
- key point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000011159 matrix material Substances 0.000 claims abstract description 48
- 238000006243 chemical reaction Methods 0.000 claims abstract description 20
- 238000013519 translation Methods 0.000 claims abstract description 17
- 230000003287 optical effect Effects 0.000 claims abstract description 13
- 238000010801 machine learning Methods 0.000 claims abstract description 10
- 238000012937 correction Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 238000005520 cutting process Methods 0.000 claims abstract description 4
- 230000000007 visual effect Effects 0.000 claims abstract description 4
- 238000001514 detection method Methods 0.000 claims description 34
- 239000013598 vector Substances 0.000 claims description 21
- 238000003384 imaging method Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 230000001360 synchronised effect Effects 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 238000004519 manufacturing process Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 239000000126 substance Substances 0.000 claims 1
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000011160 research Methods 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000008901 benefit Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012897 Levenberg–Marquardt algorithm Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003702 image correction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/80—Geometric correction
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a binocular vision-based hand key point space coordinate acquisition method, which comprises the steps of firstly, carrying out three-dimensional calibration on a binocular camera, and establishing a conversion model of each coordinate system, namely acquiring internal and external parameters of the camera, a distortion coefficient and a rotation and translation matrix between the two cameras; secondly, preprocessing videos shot by the binocular camera, including cutting, distortion correction and the like; then, processing the video frame by utilizing a machine learning assembly line to obtain pixel coordinates of 21 key points of the hand; and finally, calculating the real coordinates of the key points of the hand in the three-dimensional space by adopting a least square method based on the optical axis convergence model. The invention utilizes binocular visual information, accurately positions and recovers 21 hand key point three-dimensional space coordinates containing all joint points by simulating the structure of human eyes, is more accurate in hand state reconstruction, and provides accurate technical support for application research of hand key point positioning in human-computer interaction.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a binocular vision-based human hand key point three-dimensional coordinate acquisition method, and also relates to the technical fields of digital image processing, space three-dimensional information acquisition, human-computer interaction and the like.
Background
Binocular stereo vision is an important branch of the computer vision field, which simulates the human vision system, senses the three-dimensional spatial information of an object by using the principle of parallax, and reconstructs the shape and position of the scene of the object. The method has the advantages that the method is always a hotspot problem in the field of accurately detecting and positioning the spatial position of a hand from a video, and has high application value in the fields of virtual reality, augmented reality, motion sensing games, human-computer interaction, three-dimensional measurement and the like.
However, the current acquisition of three-dimensional coordinates of the hand is limited to individual key points, such as fingertips and palm centers, which is not enough to reconstruct the motion posture of the whole hand and the location thereof in the space, and most of the schemes that are used based on skin color or edge detection and then perform fingertip search on the boundary are easily affected by background and ambient light, which causes the reduction of algorithm robustness.
Meanwhile, most of feature extraction schemes adopted by the existing space point 3D coordinate recovery technology are SIFT algorithms, although the algorithms have good stability and invariance, the algorithms are insufficient in feature extraction of smooth edge targets, poor in coarse matching and fine matching peer effect which are needed later, and not the best solution scheme for extracting key points of hands.
Although many full-hand posture estimation algorithm researches based on depth images appear in recent years, the finger tip area is small, the motion is fast, the quality of the finger part of the generated depth image is poor, the detection precision is low, and the error of the three-dimensional space coordinate obtained through calculation is larger.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a binocular vision-based hand key point space coordinate acquisition method, which can acquire the three-dimensional space coordinates of the fingertip and the palm center, and the three-dimensional space coordinates of 21 key points of all joint points of a hand, so as to sense the shape and the motion track of the hand, and simultaneously avoid the error influence of the depth imaging process in the depth image-based detection method, so that the positioning of the hand key points is more accurate.
In order to solve the technical problems, the invention provides the following technical scheme:
a binocular vision-based hand key point space coordinate acquisition method comprises the following steps:
step 1: the method comprises the steps of horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing coordinate system conversion models, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras, and then establishing a camera imaging model.
Step 2: and preprocessing the synchronous video shot by the binocular camera.
And step 3: and respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D hand key points in each frame of image, and obtaining pixel coordinates of the key points in the left and right eye camera shooting images.
And 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on a least square method according to the camera parameters obtained in the step (1).
Further, the step 1 comprises:
step 1-1: constructing a conversion model from a world coordinate system to a camera coordinate system: xc=rXw+ t, that is:
wherein XcRepresenting the camera coordinate system, XwRepresenting the world coordinate system, r is a 3 × 3 rotation matrix and t is a 3 × 1 translation vector. The rotation matrix r is commonly controlled by X, Y, Z components in three directions, so that with three degrees of freedom rWhich are the sum of the effects of rotation about three axes X, Y, Z, respectively.
Step 1-2: constructing a conversion model from a camera coordinate system to an image coordinate system, namely a pinhole imaging model
Wherein (x, y) is the coordinate in the image coordinate system, (x)c,yc,zc) Is the coordinate in the camera coordinate system, and f is the camera focal length. Wherein z iscCan be obtained by triangulation, i.e.B is the distance between the optical centers of the left and right eye cameras, XL、XRThe abscissa of the pixel point corresponding to the left and right eye images.
Step 1-3: constructing a conversion model from an image coordinate system to a pixel coordinate system:
Wherein (u, v) is pixel coordinate system coordinate, (x, y) is image coordinate system coordinate, dx、dyDenotes the physical size of each pixel on the horizontal axis x and the vertical axis y, respectively, (u)0,v0) Is the coordinate of the origin of the image coordinate system (i.e. the intersection of the camera optical axis and the image plane) in the pixel coordinate system.
Step 1-4: and (3) combining the models in the steps 1-1, 1-2 and 1-3 to obtain a conversion model from a world coordinate system to a pixel coordinate system, wherein the conversion model comprises the following steps:
in particular, it is possible to use, for example,is a reference matrix in the camera, and the reference matrix is a reference matrix in the camera,f/dx、f/dyrespectively representing the focal length in units of the actual physical size of each pixel on the horizontal axis x and the vertical axis y,is a camera external reference matrix, where r3×3As a rotation matrix, t3×1For translation vectors, the world coordinate system to pixel coordinate system transformation matrix, i.e., the projection matrix P of the camera, is
Step 1-5: and (3) carrying out Taylor series expansion around the principal point (namely the central point of the image) to construct a lens distortion model, taking the first several coefficients to obtain a camera distortion coefficient, and calculating a 3 multiplied by 3 rotation matrix R and a 3 multiplied by 1 translation vector T of the coordinate system of the right eye camera relative to the coordinate system of the left eye camera.
Further, the step 2 comprises:
step 2-1: and cutting the synchronous video shot by the binocular camera to respectively obtain the video shot by the left eye camera and the video shot by the right eye camera.
Step 2-2: and (3) distortion correction is respectively carried out on the videos shot by the two cameras frame by using the distortion coefficients of the two cameras obtained in the step (1-5), so that the imaging process of the videos accords with the pinhole imaging model.
Further, the step 3 comprises:
step 3-1: the entire image is detected using the palm detection model and returned to the hand region bounding box. The palm detection part adopts a BlazePalm model, the model adopts a coding-decoding feature extractor similar to FPN (feature pyramid), and the image of each scale is subjected to feature extraction to generate multi-scale feature representation, so that feature graphs of all levels have stronger semantic information and higher resolution, and the problem of scale change caused by distance change in the palm detection process can be well solved. While Focal local (Loss of focus) is correspondingly adopted to solve the problem of a large number of anchor points due to multi-scale.
In order to improve the detection efficiency, the palm detector is used for detecting only the palm rather than the whole hand, because the hand lacks a high-contrast characteristic region, reliable hand detection is difficult to realize only by visual characteristics, and compared with the detection of the hand with joints and fingers, the palm detector only needs to detect the boundary frame of a fixed object such as the palm and the fist, which obviously needs much simpler task.
Step 3-2: and (3) positioning 21 3D key point coordinates in the hand region detected in the step (3-1) by predicting a Gaussian heatmap by using a hand key part detection model.
The hand key part detection model adopts a CNN convolutional neural network to predict a Gaussian heat map of key points, and then argmax is carried out on the heat map to find out indexes corresponding to peak values so as to obtain coordinates of each key point. The model regresses 21 key points on the hand, the predicted output feature map is 21 channels, and each channel is a heat map for predicting one key point. Here, the loss function adopts a euclidean distance loss function, that is:
in the formula, R is the marked real coordinate, and P is the model prediction coordinate.
Further, the step 4 comprises:
obtaining a coordinate conversion formula from a world coordinate system to a pixel coordinate system according to the optical axis convergence model, wherein the conversion formula comprises the following components for a left camera and a right camera:
whereinIs a projection matrix of the left eye camera,is the projection matrix of the right eye camera (u)1,v1)、(u2,v2) The pixel coordinates of the key points on the left eye image and the right eye image, zc1、zc2The z coordinates of the key points in the left and right eye camera coordinate systems can be obtained by the step 1-2. Specifically, the left-eye camera rotation matrix is set as a 3 × 3 identity matrix, the translation vector is set as a 3 × 1 zero vector, the right-eye camera rotation matrix is set as the matrix R obtained in step 1-5, and the translation vector is set as the vector T obtained in step 1-5, where the origin of the world coordinate system is the optical center of the left-eye camera.
The formula is as follows:
obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square methodw,yw,zw)。
According to the technical scheme, the binocular vision-based hand key point space coordinate acquisition method has high identification precision. The invention adopts a binocular camera to simulate human eyes, establishes coordinate system conversion models and obtains pixel coordinates of the same target point in the two cameras, and is different from the traditional scheme of obtaining image characteristics by adopting an SIFT algorithm.
The invention has the advantages and beneficial effects that:
1. according to the technical scheme, the machine learning model is adopted, the images acquired by the left and right cameras are processed in a pipeline mode, the pixel coordinates of the key points of the hands in each frame of image are acquired, the traditional SIFT algorithm is not adopted for feature point extraction, the captured key points are more accurate, and the accuracy of three-dimensional space coordinate positioning of the key points is improved.
2. According to the technical scheme, after the pixel coordinates of the key points on the left and right eye images are obtained, the three-dimensional space coordinates of the key points are solved by adopting a least square method, so that the final result is more accurate.
3. Compared with the prior technical scheme of only detecting the fingertips and the palms, the technical scheme of the invention has the advantages that the detected key points are more comprehensive, all 21 key points including the fingertips, the palms and the joint points are covered, and the dynamic reconstruction of the fingertips is more accurate.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings needed to be used in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a binocular vision-based hand key point space coordinate acquisition method of the present invention.
Fig. 2 is a schematic flow chart of coordinate system conversion in an embodiment of the present invention.
Fig. 3 is a schematic diagram of 21 hand key points in an embodiment of the present invention.
Fig. 4 is a flow chart of a step 3 machine learning pipeline in an embodiment of the present invention.
Fig. 5 is a schematic diagram of an optical axis convergence model in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The method does not need a specific operating environment, hardware equipment only needs a computer and a binocular camera, and fig. 1 is a flow chart of steps of a preferred embodiment of the method for acquiring the hand key point space coordinates based on binocular vision. And operating a binocular camera to shoot, transmitting the acquired video stream to a computer, and then preprocessing the video stream, including left and right eye image separation and image correction. And then, respectively detecting key points of the left and right video frames by using a machine learning model to obtain coordinates, and finally recovering the three-dimensional space coordinates of the key points of the hands by using a least square method. The binocular vision-based hand key point space coordinate acquisition method of the invention is described in detail below with reference to fig. 1:
step 1: the method comprises the steps of horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing coordinate system conversion models, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras as shown in figure 2, and establishing a camera imaging geometric model.
Step 1-1: fix two mesh cameras, ensure the straight line level at two camera light centers place, and two cameras homoenergetic shoot complete hand.
Step 1-2: and carrying out binocular calibration on the left and right eye cameras. Firstly, a printed calibration board needs to be prepared, wherein the calibration board in the embodiment is a 7 × 10 checkerboard formed by alternately arranging black and white squares, the side length of each square is 25mm, and the intersection point of the black and white squares is taken as a characteristic point, and the total number of the characteristic points is 54.
Step 1-3: the calibration plate is shot by using the binocular camera from different directions, in order to enable a calibration result to be more accurate, the calibration plate needs to be shot from 10 or more different directions, and the calibration plate needs to be arranged in a shooting area of the camera all the time in the process to obtain left and right target images of the calibration plate.
Step 1-4: and (3) respectively extracting checkerboard feature points in the left and right eye images, wherein the origin of an image coordinate system is the intersection point of the camera imaging surface and the optical axis of the camera imaging surface. And respectively matching characteristic points of left and right eye chessboard images in the same direction by using an epipolar constraint principle, solving a homography matrix by using a Levenberg-Marquardt algorithm, further solving internal and external parameters of left and right eye cameras, acquiring attitude parameters between the cameras, and writing the obtained parameters into a system. The intrinsic parameters are intrinsic parameters determined by the internal optical and geometric properties of the camera, including the actual size (d) of the pixelx,dy) Principal point pixel coordinate (u)0,v0) Focal length f, coordinate axis tilt coefficient s, distortion coefficient (k)1,k2,k3,p1,p2) The extrinsic parameters are parameters representing relative position and orientation information between the pixel coordinate system and the world coordinate system, and include a rotation matrix R, a translation vector T, and a rotation matrix R and a translation vector T of the right-eye camera coordinate system relative to the left-eye camera coordinate system.
Further, the epipolar constraint principle described in steps 1-4, that is, the projection point of any point in space on the image plane, is necessarily on the epipolar plane composed of the point and the centers of the two cameras, so that for a certain feature point on the image, the matching point on another view is necessarily on the corresponding epipolar line. Epipolar constraint reduces feature matching from two-dimensional search to one-dimensional search, thereby greatly increasing computation speed and reducing mismatching. The homography matrix describes a mapping relation between a world coordinate system and a pixel coordinate system, namely a projection matrix of the camera. The Levenberg-Marquardt (LM) algorithm is an optimization algorithm, and aims to obtain the optimal solution of the homography matrix under the condition that the calculated characteristic point pairs have noise and even have mismatching of the characteristic point pairs. The pose parameters between the cameras include a rotation matrix R and a translation vector T between the cameras.
Step 2: and preprocessing the synchronous video shot by the binocular camera.
Step 2-1: after calibration is finished, a binocular camera is used for shooting hand pictures, the collected images are 720 multiplied by 2560 digital images of RGB color space, and the hands need to be always in the shooting range of the camera in the shooting process. Since the binocular camera used in this embodiment is synchronous shooting, the shot synchronous video needs to be divided to obtain the left eye camera video and the right eye camera video, respectively, and the resolution of the divided videos is 720 × 1280.
Step 2-2: and (3) distortion correction is carried out on the videos shot by the two cameras by using the distortion coefficients of the two cameras obtained in the step (1) to step (4), so that the imaging process of the videos accords with a pinhole imaging model, and the correction formula is as follows:
x0=x(1+k1r2+k2r4+k3r6)
and (3) correcting radial distortion: y is0=y(1+k1r2+k2r4+k3r6)
x0=2p1xy+p2(r2+2x2)+1
Tangential distortion correction: y is0=p2(r2+2y2)+2p2xy+1
In the formula (x)0,y0) Is the original position of the distortion point in the image plane, (x, y) is the new position after distortion correction, r2=x2+y2,k1、k2、k3As radial distortion coefficient, p1、p2Is the tangential distortion coefficient.
And step 3: and (3) respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D key points in each frame of image, wherein a schematic diagram of the key points is shown in FIG. 3, and obtaining pixel coordinates of the key points in the left and right eye images. In particular, the machine learning pipeline constructs machine learning tasks as data flow pipelines, which can effectively manage computing resources to achieve low latency performance. In the step, the machine learning production line mainly comprises two models, namely a palm detection model and a hand key part detection model, the partial flow is shown in fig. 4, and the specific steps are as follows:
in order to improve the processing efficiency, after the preprocessed left and right eye video streams are obtained, the left and right eye video streams are processed in parallel by using multiple threads, specifically:
step 3-1: the palm detector is used to detect the ROI (i.e., the hand region) in the first frame image of the video stream and return to the bounding box.
Step 3-2: and (4) cutting the image of the ROI, accurately positioning key points of the cut image by using a hand key part detection model, and outputting coordinates of the key points and confidence coefficients of hand existence and reasonable alignment in the cut image.
In order to improve the detection efficiency and reduce the calculation time, in the image processing of the subsequent frame, the palm detector is not operated any more, but the hand region in the current frame is deduced from the hand key point calculated from the previous frame, so that the palm detector is avoided being used in each frame, and the palm detection model is reapplied to the whole frame only when the confidence coefficient output by the hand key part detection model is lower than the set threshold or the hand is lost, wherein the threshold is set to 0.8 in the embodiment.
When performing ROI detection, the present embodiment employs the palm detector to perform palm detection, rather than detecting the entire hand region, because the hand detection task is more complex for palm detection: hand detection has to solve the problem of various hand sizes, which requires a larger detection range, and the hand lacks a high-contrast feature area, and it is difficult to realize reliable hand detection only by visual features. In contrast to hand detection, which requires detection of a hand having joints and fingers, the palm detector only needs to detect a bounding box of a rigid object such as a palm and a fist, which is obviously much simpler. This step will greatly reduce the need for data enhancement such as rotation, translation, scaling of the image, allowing the network to use more capacity for key point positioning accuracy improvement.
In order to improve the palm detection accuracy, the palm detection in this embodiment adopts a BlazePalm model, which has a large zoom range and can identify a variety of different palm sizes. The NMS algorithm is adopted, so that the palm area can be well detected even under the condition that the hands are shielded, the palms can be accurately positioned through the identification of the arms, the trunk or the personal characteristics, and the defect of high-contrast texture characteristics of the hands is made up.
The hand key part detection model in the embodiment needs to collect enough human hand samples for training, and the model adopts the CNN convolutional neural network to predict the Gaussian heat map of the key points, so that label is the Gaussian map generated based on each key point. The model needs to regress 21 hand key points, the predicted output characteristic graph is 21 channels, namely each channel is a heat map for predicting one key point, and then the integer coordinates of the key points can be obtained by performing argmax on each channel. In particular, in order to reduce the deviation of the prediction result, the loss function of gaussian heat map regression in this embodiment does not adopt MSE of the main stream, but uses euclidean distance loss function, that is, euclidean distance loss function
In the formula, R is the marked real coordinate, and P is the model prediction coordinate. The present embodiment learns the heatmap indirectly by optimizing the loss of predicted coordinates output by the entire model, i.e., the computation of the loss is based on predicted keypoints and real keypoints, the learning of the heatmap being network-spontaneous. Compared with a mode of using a full-connection direct regression coordinate point, the scheme for predicting the Gaussian heatmap has stronger spatial generalization capability and higher accuracy of predicted coordinates.
And 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on the camera parameters obtained in the step (1) and the pixel coordinates of the key points of the hand on the left and right eye images obtained in the step (3) and a least square method.
The optical axis convergence model is shown in fig. 5:
the left camera and the right camera obtained by the model respectively have the following characteristics:
whereinIs a projection matrix of the left eye camera,is the projection matrix of the right eye camera (u)1,v1)、(u2,v2) The pixel coordinates of the key points on the left and right eye images are respectively. Specifically, the left-eye camera rotation matrix is set as a 3 × 3 unit matrix, the translation vector is set as a 3 × 1 zero vector, the right-eye camera rotation matrix is set as the rotation matrix R obtained in step 1-4, and the right-eye camera translation vector is set as the vector T obtained in step 1-4, where the origin of the world coordinate system is the left-eye camera optical center.
The formula is as follows:
obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square methodw,yw,zw)。
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the exemplary embodiments of the present invention, and all such modifications and alterations should therefore fall within the scope of the appended claims.
Claims (7)
1. A binocular vision-based hand key point space coordinate obtaining method is characterized by comprising the following steps: the method comprises the following steps:
step 1: horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing each coordinate system conversion model, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras, namely establishing a camera imaging model;
step 2: preprocessing a synchronous video shot by a binocular camera;
and step 3: respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D hand key points in each frame of image, and obtaining pixel coordinates of the key points in the left and right eye camera shooting images;
and 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on a least square method according to the camera parameters obtained in the step (1).
2. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 1 comprises:
step 1-1: constructing a conversion model from a world coordinate system to a camera coordinate system: xc=rXw+ t, that is:
wherein, XcRepresenting the camera coordinate system, XwRepresenting a world coordinate system, r is a 3 multiplied by 3 rotation matrix, and t is a 3 multiplied by 1 translation vector; the rotation matrix r is controlled by X, Y, Z components in three directions, so that the matrix has three degrees of freedom, and r is the sum of the effects of rotating around X, Y, Z three axes respectively;
step 1-2: constructing a conversion model from a camera coordinate system to an image coordinate system, namely a pinhole imaging model
Wherein (x, y) is the coordinate in the image coordinate system, (x)c,yc,zc) Is the coordinate in the camera coordinate system, f is the camera focal length; wherein z iscObtained by triangulation, i.e.B is the distance between the optical centers of the left and right eye cameras, XL、XRThe horizontal coordinates of the corresponding pixel points of the left eye image and the right eye image are obtained;
step 1-3: constructing a conversion model from an image coordinate system to a pixel coordinate system:
wherein (u, v) is pixel coordinate system coordinate, (x, y) is image coordinate system coordinate, dx、dyDenotes the physical size of each pixel on the horizontal axis x and the vertical axis y, respectively, (u)0,v0) As the origin of the image coordinate systemCoordinates in a pixel coordinate system;
step 1-4: and (3) combining the models in the steps 1-1, 1-2 and 1-3 to obtain a conversion model from a world coordinate system to a pixel coordinate system, wherein the conversion model comprises the following steps:
wherein the content of the first and second substances,is a camera internal reference matrix, f/dx、f/dyRespectively representing the focal length in units of the actual physical size of each pixel on the horizontal axis x and the vertical axis y,is a camera external reference matrix, where r3×3As a rotation matrix, t3×1For the translation vector, the transformation matrix from the world coordinate system to the pixel coordinate system, i.e. the projection matrix P of the camera, is:
step 1-5: and carrying out Taylor series expansion around the principal point to construct a lens distortion model, taking the first few coefficients to obtain a camera distortion coefficient, and calculating a 3 multiplied by 3 rotation matrix R and a 3 multiplied by 1 translation vector T of the coordinate system of the right eye camera relative to the coordinate system of the left eye camera.
3. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 2 comprises:
step 2-1: cutting synchronous videos shot by a binocular camera to obtain a left-eye camera shooting video and a right-eye camera shooting video respectively;
step 2-2: and (3) distortion correction is respectively carried out on the videos shot by the two cameras frame by using the distortion coefficients of the two cameras obtained in the step (1-5), so that the imaging process of the videos accords with the pinhole imaging model.
4. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 3 comprises:
step 3-1: detecting the whole image by using a palm detection model and returning to a hand region boundary frame; the palm detection adopts a BlazePalm model, the model adopts a coding-decoding feature extractor similar to a feature pyramid FPN, and the image of each scale is subjected to feature extraction to generate multi-scale feature representation, so that feature graphs of all levels have stronger semantic information and higher resolution, and the problem of scale change caused by distance change in the palm detection process can be well solved; meanwhile, the problem of a large number of anchor points generated due to multi-scale is solved by correspondingly adopting Focal Loss Focal local;
step 3-2: and (3) positioning 21 3D key point coordinates in the hand region detected in the step (3-1) by predicting a Gaussian heatmap by using a hand key part detection model.
5. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: in step 3-1, the palm detector is used to detect only the palm rather than the entire hand, because the hand lacks a high-contrast feature region, it is difficult to achieve reliable hand detection only by visual features, and the palm detector only needs to detect the bounding box of the fixed object of the palm and the fist compared to detecting the hand with joints and fingers.
6. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: in step 3-2, the hand key part detection model predicts the Gaussian heatmap of the key points by adopting a CNN convolutional neural network, and then performs argmax on the heatmap to find out indexes corresponding to peak values to obtain the coordinates of each key point; model regression hand 21 key points, the prediction output characteristic graph is 21 channels, each channel is a heat map for predicting one key point; the loss function adopts a Euclidean distance loss function, namely:
in the formula, R is the marked real coordinate, and P is the model prediction coordinate.
7. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 4 comprises:
obtaining a coordinate conversion formula from a world coordinate system to a pixel coordinate system according to the optical axis convergence model, wherein the conversion formula comprises the following components for a left camera and a right camera:
whereinIs a projection matrix of the left eye camera,is the projection matrix of the right eye camera (u)1,v1)、(u2,v2) The pixel coordinates of the key points on the left eye image and the right eye image, zc1、zc2Respectively obtaining the z coordinates of the key points under the coordinate systems of the left eye camera and the right eye camera through the step 1-2; setting a rotation matrix of the left eye camera as a 3 x 3 unit matrix, setting a translation vector as a 3 x 1 zero vector, setting a rotation matrix of the right eye camera as a matrix R obtained in the step 1-5, setting the translation vector as a vector T obtained in the step 1-5, and setting the origin of a world coordinate system as the optical center of the left eye camera;
the formula is as follows:
obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square methodw,yw,zw)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111230723.7A CN114119739A (en) | 2021-10-22 | 2021-10-22 | Binocular vision-based hand key point space coordinate acquisition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111230723.7A CN114119739A (en) | 2021-10-22 | 2021-10-22 | Binocular vision-based hand key point space coordinate acquisition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114119739A true CN114119739A (en) | 2022-03-01 |
Family
ID=80376605
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111230723.7A Pending CN114119739A (en) | 2021-10-22 | 2021-10-22 | Binocular vision-based hand key point space coordinate acquisition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114119739A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112362034A (en) * | 2020-11-11 | 2021-02-12 | 上海电器科学研究所(集团)有限公司 | Solid engine multi-cylinder section butt joint guiding measurement algorithm based on binocular vision |
CN114757822A (en) * | 2022-06-14 | 2022-07-15 | 之江实验室 | Binocular-based human body three-dimensional key point detection method and system |
CN114820485A (en) * | 2022-04-15 | 2022-07-29 | 华南理工大学 | Method for measuring wave climbing height based on airborne image |
CN114842091A (en) * | 2022-04-29 | 2022-08-02 | 广东工业大学 | Binocular egg size assembly line measuring method |
CN114979611A (en) * | 2022-05-19 | 2022-08-30 | 国网智能科技股份有限公司 | Binocular sensing system and method |
CN117218320A (en) * | 2023-11-08 | 2023-12-12 | 济南大学 | Space labeling method based on mixed reality |
-
2021
- 2021-10-22 CN CN202111230723.7A patent/CN114119739A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112362034A (en) * | 2020-11-11 | 2021-02-12 | 上海电器科学研究所(集团)有限公司 | Solid engine multi-cylinder section butt joint guiding measurement algorithm based on binocular vision |
CN114820485A (en) * | 2022-04-15 | 2022-07-29 | 华南理工大学 | Method for measuring wave climbing height based on airborne image |
CN114820485B (en) * | 2022-04-15 | 2024-03-26 | 华南理工大学 | Method for measuring wave climbing based on airborne image |
CN114842091A (en) * | 2022-04-29 | 2022-08-02 | 广东工业大学 | Binocular egg size assembly line measuring method |
CN114842091B (en) * | 2022-04-29 | 2023-05-23 | 广东工业大学 | Binocular egg size assembly line measuring method |
CN114979611A (en) * | 2022-05-19 | 2022-08-30 | 国网智能科技股份有限公司 | Binocular sensing system and method |
CN114757822A (en) * | 2022-06-14 | 2022-07-15 | 之江实验室 | Binocular-based human body three-dimensional key point detection method and system |
CN117218320A (en) * | 2023-11-08 | 2023-12-12 | 济南大学 | Space labeling method based on mixed reality |
CN117218320B (en) * | 2023-11-08 | 2024-02-27 | 济南大学 | Space labeling method based on mixed reality |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114119739A (en) | Binocular vision-based hand key point space coordinate acquisition method | |
WO2022002150A1 (en) | Method and device for constructing visual point cloud map | |
Wang et al. | 360sd-net: 360 stereo depth estimation with learnable cost volume | |
US10109055B2 (en) | Multiple hypotheses segmentation-guided 3D object detection and pose estimation | |
CN106251399B (en) | A kind of outdoor scene three-dimensional rebuilding method and implementing device based on lsd-slam | |
CN107833181B (en) | Three-dimensional panoramic image generation method based on zoom stereo vision | |
CN103839277B (en) | A kind of mobile augmented reality register method of outdoor largescale natural scene | |
CN111968129A (en) | Instant positioning and map construction system and method with semantic perception | |
CN106485207B (en) | A kind of Fingertip Detection and system based on binocular vision image | |
CN107204010A (en) | A kind of monocular image depth estimation method and system | |
CN110555408B (en) | Single-camera real-time three-dimensional human body posture detection method based on self-adaptive mapping relation | |
CN109359514B (en) | DeskVR-oriented gesture tracking and recognition combined strategy method | |
CN115205489A (en) | Three-dimensional reconstruction method, system and device in large scene | |
CN103337094A (en) | Method for realizing three-dimensional reconstruction of movement by using binocular camera | |
CN109758756B (en) | Gymnastics video analysis method and system based on 3D camera | |
CN106155299B (en) | A kind of pair of smart machine carries out the method and device of gesture control | |
CN113706699A (en) | Data processing method and device, electronic equipment and computer readable storage medium | |
CN109272577B (en) | Kinect-based visual SLAM method | |
CN115272271A (en) | Pipeline defect detecting and positioning ranging system based on binocular stereo vision | |
CN113850865A (en) | Human body posture positioning method and system based on binocular vision and storage medium | |
CN113393439A (en) | Forging defect detection method based on deep learning | |
CN115035546B (en) | Three-dimensional human body posture detection method and device and electronic equipment | |
CN111582036A (en) | Cross-view-angle person identification method based on shape and posture under wearable device | |
CN115359127A (en) | Polarization camera array calibration method suitable for multilayer medium environment | |
CN117711066A (en) | Three-dimensional human body posture estimation method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |