CN114119739A

CN114119739A - Binocular vision-based hand key point space coordinate acquisition method

Info

Publication number: CN114119739A
Application number: CN202111230723.7A
Authority: CN
Inventors: 胡朕朕; 李舒; 王俊
Original assignee: Hangzhou Innovation Research Institute of Beihang University
Current assignee: Hangzhou Innovation Research Institute of Beihang University
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2022-03-01

Abstract

The invention provides a binocular vision-based hand key point space coordinate acquisition method, which comprises the steps of firstly, carrying out three-dimensional calibration on a binocular camera, and establishing a conversion model of each coordinate system, namely acquiring internal and external parameters of the camera, a distortion coefficient and a rotation and translation matrix between the two cameras; secondly, preprocessing videos shot by the binocular camera, including cutting, distortion correction and the like; then, processing the video frame by utilizing a machine learning assembly line to obtain pixel coordinates of 21 key points of the hand; and finally, calculating the real coordinates of the key points of the hand in the three-dimensional space by adopting a least square method based on the optical axis convergence model. The invention utilizes binocular visual information, accurately positions and recovers 21 hand key point three-dimensional space coordinates containing all joint points by simulating the structure of human eyes, is more accurate in hand state reconstruction, and provides accurate technical support for application research of hand key point positioning in human-computer interaction.

Description

Binocular vision-based hand key point space coordinate acquisition method

Technical Field

The invention relates to the field of computer vision, in particular to a binocular vision-based human hand key point three-dimensional coordinate acquisition method, and also relates to the technical fields of digital image processing, space three-dimensional information acquisition, human-computer interaction and the like.

Background

Binocular stereo vision is an important branch of the computer vision field, which simulates the human vision system, senses the three-dimensional spatial information of an object by using the principle of parallax, and reconstructs the shape and position of the scene of the object. The method has the advantages that the method is always a hotspot problem in the field of accurately detecting and positioning the spatial position of a hand from a video, and has high application value in the fields of virtual reality, augmented reality, motion sensing games, human-computer interaction, three-dimensional measurement and the like.

However, the current acquisition of three-dimensional coordinates of the hand is limited to individual key points, such as fingertips and palm centers, which is not enough to reconstruct the motion posture of the whole hand and the location thereof in the space, and most of the schemes that are used based on skin color or edge detection and then perform fingertip search on the boundary are easily affected by background and ambient light, which causes the reduction of algorithm robustness.

Meanwhile, most of feature extraction schemes adopted by the existing space point 3D coordinate recovery technology are SIFT algorithms, although the algorithms have good stability and invariance, the algorithms are insufficient in feature extraction of smooth edge targets, poor in coarse matching and fine matching peer effect which are needed later, and not the best solution scheme for extracting key points of hands.

Although many full-hand posture estimation algorithm researches based on depth images appear in recent years, the finger tip area is small, the motion is fast, the quality of the finger part of the generated depth image is poor, the detection precision is low, and the error of the three-dimensional space coordinate obtained through calculation is larger.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a binocular vision-based hand key point space coordinate acquisition method, which can acquire the three-dimensional space coordinates of the fingertip and the palm center, and the three-dimensional space coordinates of 21 key points of all joint points of a hand, so as to sense the shape and the motion track of the hand, and simultaneously avoid the error influence of the depth imaging process in the depth image-based detection method, so that the positioning of the hand key points is more accurate.

In order to solve the technical problems, the invention provides the following technical scheme:

a binocular vision-based hand key point space coordinate acquisition method comprises the following steps:

step 1: the method comprises the steps of horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing coordinate system conversion models, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras, and then establishing a camera imaging model.

Step 2: and preprocessing the synchronous video shot by the binocular camera.

And step 3: and respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D hand key points in each frame of image, and obtaining pixel coordinates of the key points in the left and right eye camera shooting images.

And 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on a least square method according to the camera parameters obtained in the step (1).

Further, the step 1 comprises:

step 1-1: constructing a conversion model from a world coordinate system to a camera coordinate system: x_c＝rX_w+ t, that is:

wherein X_cRepresenting the camera coordinate system, X_wRepresenting the world coordinate system, r is a 3 × 3 rotation matrix and t is a 3 × 1 translation vector. The rotation matrix r is commonly controlled by X, Y, Z components in three directions, so that with three degrees of freedom rWhich are the sum of the effects of rotation about three axes X, Y, Z, respectively.

Step 1-2: constructing a conversion model from a camera coordinate system to an image coordinate system, namely a pinhole imaging model

(homogeneous coordinate form)

Wherein (x, y) is the coordinate in the image coordinate system, (x)_c，y_c，z_c) Is the coordinate in the camera coordinate system, and f is the camera focal length. Wherein z is_cCan be obtained by triangulation, i.e.

B is the distance between the optical centers of the left and right eye cameras, X_L、X_RThe abscissa of the pixel point corresponding to the left and right eye images.

Step 1-3: constructing a conversion model from an image coordinate system to a pixel coordinate system:

(homogeneous coordinate form)

Wherein (u, v) is pixel coordinate system coordinate, (x, y) is image coordinate system coordinate, d_x、d_yDenotes the physical size of each pixel on the horizontal axis x and the vertical axis y, respectively, (u)₀，v₀) Is the coordinate of the origin of the image coordinate system (i.e. the intersection of the camera optical axis and the image plane) in the pixel coordinate system.

Step 1-4: and (3) combining the models in the steps 1-1, 1-2 and 1-3 to obtain a conversion model from a world coordinate system to a pixel coordinate system, wherein the conversion model comprises the following steps:

in particular, it is possible to use, for example,

is a reference matrix in the camera, and the reference matrix is a reference matrix in the camera,f/d_x、f/d_yrespectively representing the focal length in units of the actual physical size of each pixel on the horizontal axis x and the vertical axis y,

is a camera external reference matrix, where r_3×3As a rotation matrix, t_3×1For translation vectors, the world coordinate system to pixel coordinate system transformation matrix, i.e., the projection matrix P of the camera, is

Step 1-5: and (3) carrying out Taylor series expansion around the principal point (namely the central point of the image) to construct a lens distortion model, taking the first several coefficients to obtain a camera distortion coefficient, and calculating a 3 multiplied by 3 rotation matrix R and a 3 multiplied by 1 translation vector T of the coordinate system of the right eye camera relative to the coordinate system of the left eye camera.

Further, the step 2 comprises:

step 2-1: and cutting the synchronous video shot by the binocular camera to respectively obtain the video shot by the left eye camera and the video shot by the right eye camera.

Step 2-2: and (3) distortion correction is respectively carried out on the videos shot by the two cameras frame by using the distortion coefficients of the two cameras obtained in the step (1-5), so that the imaging process of the videos accords with the pinhole imaging model.

Further, the step 3 comprises:

step 3-1: the entire image is detected using the palm detection model and returned to the hand region bounding box. The palm detection part adopts a BlazePalm model, the model adopts a coding-decoding feature extractor similar to FPN (feature pyramid), and the image of each scale is subjected to feature extraction to generate multi-scale feature representation, so that feature graphs of all levels have stronger semantic information and higher resolution, and the problem of scale change caused by distance change in the palm detection process can be well solved. While Focal local (Loss of focus) is correspondingly adopted to solve the problem of a large number of anchor points due to multi-scale.

In order to improve the detection efficiency, the palm detector is used for detecting only the palm rather than the whole hand, because the hand lacks a high-contrast characteristic region, reliable hand detection is difficult to realize only by visual characteristics, and compared with the detection of the hand with joints and fingers, the palm detector only needs to detect the boundary frame of a fixed object such as the palm and the fist, which obviously needs much simpler task.

Step 3-2: and (3) positioning 21 3D key point coordinates in the hand region detected in the step (3-1) by predicting a Gaussian heatmap by using a hand key part detection model.

The hand key part detection model adopts a CNN convolutional neural network to predict a Gaussian heat map of key points, and then argmax is carried out on the heat map to find out indexes corresponding to peak values so as to obtain coordinates of each key point. The model regresses 21 key points on the hand, the predicted output feature map is 21 channels, and each channel is a heat map for predicting one key point. Here, the loss function adopts a euclidean distance loss function, that is:

in the formula, R is the marked real coordinate, and P is the model prediction coordinate.

Further, the step 4 comprises:

obtaining a coordinate conversion formula from a world coordinate system to a pixel coordinate system according to the optical axis convergence model, wherein the conversion formula comprises the following components for a left camera and a right camera:

wherein

Is a projection matrix of the left eye camera,

is the projection matrix of the right eye camera (u)₁，v₁)、(u₂，v₂) The pixel coordinates of the key points on the left eye image and the right eye image, z_c1、z_c2The z coordinates of the key points in the left and right eye camera coordinate systems can be obtained by the step 1-2. Specifically, the left-eye camera rotation matrix is set as a 3 × 3 identity matrix, the translation vector is set as a 3 × 1 zero vector, the right-eye camera rotation matrix is set as the matrix R obtained in step 1-5, and the translation vector is set as the vector T obtained in step 1-5, where the origin of the world coordinate system is the optical center of the left-eye camera.

The formula is as follows:

obtaining the real 3D space coordinate (x) of the key point of the hand by solving the above formula by adopting a least square method_w，y_w，z_w)。

According to the technical scheme, the binocular vision-based hand key point space coordinate acquisition method has high identification precision. The invention adopts a binocular camera to simulate human eyes, establishes coordinate system conversion models and obtains pixel coordinates of the same target point in the two cameras, and is different from the traditional scheme of obtaining image characteristics by adopting an SIFT algorithm.

The invention has the advantages and beneficial effects that:

1. according to the technical scheme, the machine learning model is adopted, the images acquired by the left and right cameras are processed in a pipeline mode, the pixel coordinates of the key points of the hands in each frame of image are acquired, the traditional SIFT algorithm is not adopted for feature point extraction, the captured key points are more accurate, and the accuracy of three-dimensional space coordinate positioning of the key points is improved.

2. According to the technical scheme, after the pixel coordinates of the key points on the left and right eye images are obtained, the three-dimensional space coordinates of the key points are solved by adopting a least square method, so that the final result is more accurate.

3. Compared with the prior technical scheme of only detecting the fingertips and the palms, the technical scheme of the invention has the advantages that the detected key points are more comprehensive, all 21 key points including the fingertips, the palms and the joint points are covered, and the dynamic reconstruction of the fingertips is more accurate.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings needed to be used in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a binocular vision-based hand key point space coordinate acquisition method of the present invention.

Fig. 2 is a schematic flow chart of coordinate system conversion in an embodiment of the present invention.

Fig. 3 is a schematic diagram of 21 hand key points in an embodiment of the present invention.

Fig. 4 is a flow chart of a step 3 machine learning pipeline in an embodiment of the present invention.

Fig. 5 is a schematic diagram of an optical axis convergence model in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The method does not need a specific operating environment, hardware equipment only needs a computer and a binocular camera, and fig. 1 is a flow chart of steps of a preferred embodiment of the method for acquiring the hand key point space coordinates based on binocular vision. And operating a binocular camera to shoot, transmitting the acquired video stream to a computer, and then preprocessing the video stream, including left and right eye image separation and image correction. And then, respectively detecting key points of the left and right video frames by using a machine learning model to obtain coordinates, and finally recovering the three-dimensional space coordinates of the key points of the hands by using a least square method. The binocular vision-based hand key point space coordinate acquisition method of the invention is described in detail below with reference to fig. 1:

step 1: the method comprises the steps of horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing coordinate system conversion models, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras as shown in figure 2, and establishing a camera imaging geometric model.

Step 1-1: fix two mesh cameras, ensure the straight line level at two camera light centers place, and two cameras homoenergetic shoot complete hand.

Step 1-2: and carrying out binocular calibration on the left and right eye cameras. Firstly, a printed calibration board needs to be prepared, wherein the calibration board in the embodiment is a 7 × 10 checkerboard formed by alternately arranging black and white squares, the side length of each square is 25mm, and the intersection point of the black and white squares is taken as a characteristic point, and the total number of the characteristic points is 54.

Step 1-3: the calibration plate is shot by using the binocular camera from different directions, in order to enable a calibration result to be more accurate, the calibration plate needs to be shot from 10 or more different directions, and the calibration plate needs to be arranged in a shooting area of the camera all the time in the process to obtain left and right target images of the calibration plate.

Step 1-4: and (3) respectively extracting checkerboard feature points in the left and right eye images, wherein the origin of an image coordinate system is the intersection point of the camera imaging surface and the optical axis of the camera imaging surface. And respectively matching characteristic points of left and right eye chessboard images in the same direction by using an epipolar constraint principle, solving a homography matrix by using a Levenberg-Marquardt algorithm, further solving internal and external parameters of left and right eye cameras, acquiring attitude parameters between the cameras, and writing the obtained parameters into a system. The intrinsic parameters are intrinsic parameters determined by the internal optical and geometric properties of the camera, including the actual size (d) of the pixel_x，d_y) Principal point pixel coordinate (u)₀，v₀) Focal length f, coordinate axis tilt coefficient s, distortion coefficient (k)₁，k₂，k₃，p₁，p₂) The extrinsic parameters are parameters representing relative position and orientation information between the pixel coordinate system and the world coordinate system, and include a rotation matrix R, a translation vector T, and a rotation matrix R and a translation vector T of the right-eye camera coordinate system relative to the left-eye camera coordinate system.

Further, the epipolar constraint principle described in steps 1-4, that is, the projection point of any point in space on the image plane, is necessarily on the epipolar plane composed of the point and the centers of the two cameras, so that for a certain feature point on the image, the matching point on another view is necessarily on the corresponding epipolar line. Epipolar constraint reduces feature matching from two-dimensional search to one-dimensional search, thereby greatly increasing computation speed and reducing mismatching. The homography matrix describes a mapping relation between a world coordinate system and a pixel coordinate system, namely a projection matrix of the camera. The Levenberg-Marquardt (LM) algorithm is an optimization algorithm, and aims to obtain the optimal solution of the homography matrix under the condition that the calculated characteristic point pairs have noise and even have mismatching of the characteristic point pairs. The pose parameters between the cameras include a rotation matrix R and a translation vector T between the cameras.

Step 2: and preprocessing the synchronous video shot by the binocular camera.

Step 2-1: after calibration is finished, a binocular camera is used for shooting hand pictures, the collected images are 720 multiplied by 2560 digital images of RGB color space, and the hands need to be always in the shooting range of the camera in the shooting process. Since the binocular camera used in this embodiment is synchronous shooting, the shot synchronous video needs to be divided to obtain the left eye camera video and the right eye camera video, respectively, and the resolution of the divided videos is 720 × 1280.

Step 2-2: and (3) distortion correction is carried out on the videos shot by the two cameras by using the distortion coefficients of the two cameras obtained in the step (1) to step (4), so that the imaging process of the videos accords with a pinhole imaging model, and the correction formula is as follows:

x₀＝x(1+k₁r²+k₂r⁴+k₃r⁶)

and (3) correcting radial distortion: y is₀＝y(1+k₁r²+k₂r⁴+k₃r⁶)

x₀＝2p₁xy+p₂(r²+2x²)+1

Tangential distortion correction: y is₀＝p₂(r²+2y²)+2p₂xy+1

In the formula (x)₀，y₀) Is the original position of the distortion point in the image plane, (x, y) is the new position after distortion correction, r²＝x²+y²，k₁、k₂、k₃As radial distortion coefficient, p₁、p₂Is the tangential distortion coefficient.

And step 3: and (3) respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D key points in each frame of image, wherein a schematic diagram of the key points is shown in FIG. 3, and obtaining pixel coordinates of the key points in the left and right eye images. In particular, the machine learning pipeline constructs machine learning tasks as data flow pipelines, which can effectively manage computing resources to achieve low latency performance. In the step, the machine learning production line mainly comprises two models, namely a palm detection model and a hand key part detection model, the partial flow is shown in fig. 4, and the specific steps are as follows:

in order to improve the processing efficiency, after the preprocessed left and right eye video streams are obtained, the left and right eye video streams are processed in parallel by using multiple threads, specifically:

step 3-1: the palm detector is used to detect the ROI (i.e., the hand region) in the first frame image of the video stream and return to the bounding box.

Step 3-2: and (4) cutting the image of the ROI, accurately positioning key points of the cut image by using a hand key part detection model, and outputting coordinates of the key points and confidence coefficients of hand existence and reasonable alignment in the cut image.

In order to improve the detection efficiency and reduce the calculation time, in the image processing of the subsequent frame, the palm detector is not operated any more, but the hand region in the current frame is deduced from the hand key point calculated from the previous frame, so that the palm detector is avoided being used in each frame, and the palm detection model is reapplied to the whole frame only when the confidence coefficient output by the hand key part detection model is lower than the set threshold or the hand is lost, wherein the threshold is set to 0.8 in the embodiment.

When performing ROI detection, the present embodiment employs the palm detector to perform palm detection, rather than detecting the entire hand region, because the hand detection task is more complex for palm detection: hand detection has to solve the problem of various hand sizes, which requires a larger detection range, and the hand lacks a high-contrast feature area, and it is difficult to realize reliable hand detection only by visual features. In contrast to hand detection, which requires detection of a hand having joints and fingers, the palm detector only needs to detect a bounding box of a rigid object such as a palm and a fist, which is obviously much simpler. This step will greatly reduce the need for data enhancement such as rotation, translation, scaling of the image, allowing the network to use more capacity for key point positioning accuracy improvement.

In order to improve the palm detection accuracy, the palm detection in this embodiment adopts a BlazePalm model, which has a large zoom range and can identify a variety of different palm sizes. The NMS algorithm is adopted, so that the palm area can be well detected even under the condition that the hands are shielded, the palms can be accurately positioned through the identification of the arms, the trunk or the personal characteristics, and the defect of high-contrast texture characteristics of the hands is made up.

The hand key part detection model in the embodiment needs to collect enough human hand samples for training, and the model adopts the CNN convolutional neural network to predict the Gaussian heat map of the key points, so that label is the Gaussian map generated based on each key point. The model needs to regress 21 hand key points, the predicted output characteristic graph is 21 channels, namely each channel is a heat map for predicting one key point, and then the integer coordinates of the key points can be obtained by performing argmax on each channel. In particular, in order to reduce the deviation of the prediction result, the loss function of gaussian heat map regression in this embodiment does not adopt MSE of the main stream, but uses euclidean distance loss function, that is, euclidean distance loss function

In the formula, R is the marked real coordinate, and P is the model prediction coordinate. The present embodiment learns the heatmap indirectly by optimizing the loss of predicted coordinates output by the entire model, i.e., the computation of the loss is based on predicted keypoints and real keypoints, the learning of the heatmap being network-spontaneous. Compared with a mode of using a full-connection direct regression coordinate point, the scheme for predicting the Gaussian heatmap has stronger spatial generalization capability and higher accuracy of predicted coordinates.

And 4, step 4: and (3) calculating the space coordinates of each key point in a world coordinate system based on the camera parameters obtained in the step (1) and the pixel coordinates of the key points of the hand on the left and right eye images obtained in the step (3) and a least square method.

The optical axis convergence model is shown in fig. 5:

the left camera and the right camera obtained by the model respectively have the following characteristics:

wherein

Is a projection matrix of the left eye camera,

is the projection matrix of the right eye camera (u)₁，v₁)、(u₂，v₂) The pixel coordinates of the key points on the left and right eye images are respectively. Specifically, the left-eye camera rotation matrix is set as a 3 × 3 unit matrix, the translation vector is set as a 3 × 1 zero vector, the right-eye camera rotation matrix is set as the rotation matrix R obtained in step 1-4, and the right-eye camera translation vector is set as the vector T obtained in step 1-4, where the origin of the world coordinate system is the left-eye camera optical center.

The formula is as follows:

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solution of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the exemplary embodiments of the present invention, and all such modifications and alterations should therefore fall within the scope of the appended claims.

Claims

1. A binocular vision-based hand key point space coordinate obtaining method is characterized by comprising the following steps: the method comprises the following steps:

step 1: horizontally and fixedly placing a binocular camera, calibrating a left camera and a right camera of the binocular camera, establishing each coordinate system conversion model, respectively obtaining an internal reference matrix, a distortion coefficient and posture parameters between the two cameras, namely establishing a camera imaging model;

step 2: preprocessing a synchronous video shot by a binocular camera;

and step 3: respectively processing the preprocessed left and right eye videos by using a machine learning production line, deducing 21 3D hand key points in each frame of image, and obtaining pixel coordinates of the key points in the left and right eye camera shooting images;

2. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 1 comprises:

wherein, X_cRepresenting the camera coordinate system, X_wRepresenting a world coordinate system, r is a 3 multiplied by 3 rotation matrix, and t is a 3 multiplied by 1 translation vector; the rotation matrix r is controlled by X, Y, Z components in three directions, so that the matrix has three degrees of freedom, and r is the sum of the effects of rotating around X, Y, Z three axes respectively;

Wherein (x, y) is the coordinate in the image coordinate system, (x)_c，y_c，z_c) Is the coordinate in the camera coordinate system, f is the camera focal length; wherein z is_cObtained by triangulation, i.e.

B is the distance between the optical centers of the left and right eye cameras, X_L、X_RThe horizontal coordinates of the corresponding pixel points of the left eye image and the right eye image are obtained;

wherein (u, v) is pixel coordinate system coordinate, (x, y) is image coordinate system coordinate, d_x、d_yDenotes the physical size of each pixel on the horizontal axis x and the vertical axis y, respectively, (u)₀，v₀) As the origin of the image coordinate systemCoordinates in a pixel coordinate system;

wherein the content of the first and second substances,

is a camera internal reference matrix, f/d_x、f/d_yRespectively representing the focal length in units of the actual physical size of each pixel on the horizontal axis x and the vertical axis y,

is a camera external reference matrix, where r_3×3As a rotation matrix, t_3×1For the translation vector, the transformation matrix from the world coordinate system to the pixel coordinate system, i.e. the projection matrix P of the camera, is:

step 1-5: and carrying out Taylor series expansion around the principal point to construct a lens distortion model, taking the first few coefficients to obtain a camera distortion coefficient, and calculating a 3 multiplied by 3 rotation matrix R and a 3 multiplied by 1 translation vector T of the coordinate system of the right eye camera relative to the coordinate system of the left eye camera.

3. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 2 comprises:

step 2-1: cutting synchronous videos shot by a binocular camera to obtain a left-eye camera shooting video and a right-eye camera shooting video respectively;

4. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 3 comprises:

step 3-1: detecting the whole image by using a palm detection model and returning to a hand region boundary frame; the palm detection adopts a BlazePalm model, the model adopts a coding-decoding feature extractor similar to a feature pyramid FPN, and the image of each scale is subjected to feature extraction to generate multi-scale feature representation, so that feature graphs of all levels have stronger semantic information and higher resolution, and the problem of scale change caused by distance change in the palm detection process can be well solved; meanwhile, the problem of a large number of anchor points generated due to multi-scale is solved by correspondingly adopting Focal Loss Focal local;

5. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: in step 3-1, the palm detector is used to detect only the palm rather than the entire hand, because the hand lacks a high-contrast feature region, it is difficult to achieve reliable hand detection only by visual features, and the palm detector only needs to detect the bounding box of the fixed object of the palm and the fist compared to detecting the hand with joints and fingers.

6. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: in step 3-2, the hand key part detection model predicts the Gaussian heatmap of the key points by adopting a CNN convolutional neural network, and then performs argmax on the heatmap to find out indexes corresponding to peak values to obtain the coordinates of each key point; model regression hand 21 key points, the prediction output characteristic graph is 21 channels, each channel is a heat map for predicting one key point; the loss function adopts a Euclidean distance loss function, namely:

7. The binocular vision-based hand key point space coordinate acquisition method according to claim 1, wherein the method comprises the following steps: further, the step 4 comprises:

wherein

Is a projection matrix of the left eye camera,

is the projection matrix of the right eye camera (u)₁，v₁)、(u₂，v₂) The pixel coordinates of the key points on the left eye image and the right eye image, z_c1、z_c2Respectively obtaining the z coordinates of the key points under the coordinate systems of the left eye camera and the right eye camera through the step 1-2; setting a rotation matrix of the left eye camera as a 3 x 3 unit matrix, setting a translation vector as a 3 x 1 zero vector, setting a rotation matrix of the right eye camera as a matrix R obtained in the step 1-5, setting the translation vector as a vector T obtained in the step 1-5, and setting the origin of a world coordinate system as the optical center of the left eye camera;

the formula is as follows: