CN115576426A

CN115576426A - Hand interaction method for mixed reality flight simulator

Info

Publication number: CN115576426A
Application number: CN202211316738.XA
Authority: CN
Inventors: 郝天宇; 赵永嘉; 雷小永; 戴树岭
Original assignee: Jiangxi Research Institute Of Beijing University Of Aeronautics And Astronautics
Current assignee: Jiangxi Research Institute Of Beijing University Of Aeronautics And Astronautics
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-01-06

Abstract

The invention discloses a hand interaction method for a mixed reality flight simulator, and belongs to the technical field of mixed reality. The method comprises the following steps: establishing and training a deep learning network model for identifying hands and key points of the hands wearing the data gloves; establishing a hand three-dimensional model, and manufacturing an operation part positioning template based on the model; during recognition, adjusting a hand three-dimensional model suitable for a user, fusing key point states acquired by a data glove sensor by using Kalman filtering, and fusing key point states tracked by a camera positioned on a helmet display; performing three-dimensional reconstruction on the hand pose in the virtual environment by using the fused key point state; when interacting with the operation component, the virtual hand is mapped to the operation component model according to the relative position relation between the operation component characteristics and the hand key points. The hand tracking reconstruction is carried out through the fusion of various sensing information, the influence of environmental factors is reduced, and the stability and the accuracy of tracking simulation are improved.

Description

Hand interaction method for mixed reality flight simulator

Technical Field

The invention belongs to the technical field of mixed reality, and particularly relates to a hand interaction method for a mixed reality flight simulator.

Background

The flight simulator based on the mixed reality technology has the advantages of high reality, strong immersion and the like, and is increasingly applied to scenes such as driving training and the like. In the flight driving simulation process, a large number of human hands are involved in interactive operation of various control components in an aircraft cockpit, and as an important ring of virtual-real fusion, a method capable of correctly reflecting the hand pose of a driver and the interactive state of the control components in a virtual driving environment is needed to enhance the interactive experience of the system.

Common hand interaction schemes in flight simulators are handle device based interaction, wearable data glove based interaction, and computer vision based bare hand interaction. The handle device simulates fixed interaction action in a handheld state, cannot reflect the natural state of a human hand, and is also not suitable for the condition that interaction with an entity control part exists. The data glove scheme and the visual scheme can express the space position, rotation and complex gestures of the hand, but the cost for obtaining touch sense is high, the two schemes are adopted to provide visual feedback of the hand in a virtual environment at present, and the touch feedback matched with a touch entity control part is a relatively ideal man-machine interaction mode.

The number of operating components in the cockpit is large and very dense, which puts high demands on the accuracy of the interaction. The interaction effect obtained when actually testing the above two schemes of interaction based on wearable data gloves and interaction based on computer vision in an entity cockpit is not good, and the following problems specifically exist:

1. the data gloves have relatively fixed mechanical connection shape, are difficult to adapt to the matching automatically to the hand of equidimension not, and data gloves have integrateed numerous sensors in addition, receive complicated electromagnetic environment interference in the entity cockpit easily. These factors all cause the spatial positioning and posture of the hand to be incorrect, and the real and virtual interactive actions cannot be accurately matched.

2. The hand interaction based on computer vision is influenced by environmental factors such as illumination, noise, shielding and the like, the identification tracking is unstable, the range is limited, and the inconsistent and even lost hand tracking in the interaction process is easily caused, so that the correctness of the interaction information is influenced.

3. In order to deal with errors during interaction, the interaction of the hand on the control component usually adopts the action of fitting fixed gestures to be attached to the control component, so that the freedom of interaction is limited, and the requirement of flight driving simulation on natural interaction cannot be met.

Disclosure of Invention

Aiming at the problem of hand interaction in the mixed reality flight simulator, the invention provides a hand interaction method for the mixed reality flight simulator, which integrates data gloves and visual hand identification information, determines accurate and stable hand key points together to perform hand three-dimensional reconstruction, and maps real-time gesture actions to an operation part by using the relative relation between feature points on the operation part and the hand key points.

The invention provides a hand interaction method for a mixed reality flight simulator, which comprises the following steps:

step 1, acquiring hand images wearing data gloves, and constructing and training a deep learning network model for detecting key points of hands in the images in real time;

step 2, establishing a hand three-dimensional model, and setting static constraint and dynamic constraint of joints;

step 3, manufacturing an operation component positioning template based on a model, wherein the operation component positioning template comprises the steps of calibrating the outer edge outline of an entity operation component area, the characteristic points on an operation component and the relative coordinates of each operation component in the area;

step 4, the user places the hand in the recognition range of the deep learning network model for detecting the key points of the hand, and adjusts the three-dimensional hand model according to the detected key points so as to be suitable for the current user;

step 5, modeling the motion state of the hand key points, wherein the state vector of the key points comprises the current position coordinates and the speed on three axes of a virtual space coordinate system; for each hand keypoint: estimating the state of the key point of the hand at the current moment according to the state of the key point of the hand at the previous moment through a motion model; acquiring three-dimensional coordinates of hand key points in the information of the data glove sensor in real time to generate observation vectors; updating and outputting the current state of the hand key point by using the Kalman filtering fusion estimated hand key point state and the observation vector;

step 6, acquiring a hand image in the current visual field range of a user by using a camera arranged on a helmet display, and detecting a hand key point in the image by using the deep learning network model obtained by training in the step 1; for each hand key point, fusing the state of the hand key point output in the step 5 and the state of the hand key point detected by the deep learning network model by using Kalman filtering and outputting;

step 7, fitting the output states of the key points of the hand with the joints of the hand three-dimensional model adjusted in the step 4 to form the optimal posture of the current hand model;

and 8, in the interaction process, updating the three-dimensional coordinates of the key points of the hand and the gesture of the hand model through the steps 5 to 7, acquiring an image through a fixed camera placed in the concentrated region of the operating part, detecting the region of the operating part and the feature points on the operating part in the image by using the positioning template in the step 3, calculating the relative gesture of the feature points on the operating part and the hand, and fitting the interaction state of the hand model and the operating part.

The invention has the advantages and positive effects that: (1) The method carries out hand tracking reconstruction by fusing various sensing information, effectively reduces the influence of environmental factors on the reconstruction result, and improves the stability and the accuracy of the mixed reality flight simulator system. (2) The method provided by the invention designs a fusion and fitting algorithm based on the hand key points, reduces the complexity of hand model reconstruction, and is beneficial to saving system overhead. (3) The method of the invention is not limited by fixed gesture actions, can adapt to various interactive operations, and is easy for system expansion.

Drawings

FIG. 1 is a block diagram of one implementation of the hand interaction method of the present invention for a mixed reality flight simulator;

FIG. 2 is a sample diagram of the regression training of the hand key points in the embodiment of the present invention;

fig. 3 is an exemplary diagram of fitting the position of the fingertip on the operation member by the least square method in the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The hand interaction method for the mixed reality flight simulator, which is realized by the embodiment of the invention, identifies and tracks hand key points in an image through deep learning, fuses corresponding joint point pose information acquired by a sensor in a data glove by using Kalman filtering, and three-dimensionally reconstructs hand pose by using the fused hand key point information in a virtual environment; when interacting with the operation component, the virtual hand is mapped to the operation component model according to the relative position relation between the operation component characteristics and the hand key points.

The specific process of the embodiment of the invention can be divided into an off-line stage and an on-line stage, and the whole process is shown in figure 1.

And in the off-line stage, a method for acquiring key points of the hand in the image is mainly realized. And training a deep learning network model for detecting the hand key points by using the manufactured training data set to detect the hand key points in the image in real time, wherein the detection process comprises the steps of hand detection, hand key point regression and the like.

Specifically, the operation process in the offline stage includes the following:

step 1.1, constructing a deep learning network model for hand detection, wherein the process specifically comprises the following steps:

step 1.1.1, acquiring hand images of the gloves wearing data, performing data enhancement by adopting modes of turning, random cutting, zooming and the like, labeling the image data containing hand information, and making a hand detection training set. In this embodiment, the rectangular frame is used to label the hand boundary, and the label information includes the category, center point, width, height, and the like of the label area.

Step 1.1.2, inputting the training set manufactured in step 1.1.1 into a YOLOv5 network for training to realize hand detection in the image, and modifying the YOLOv5 network according to the characteristics of the training set as follows:

(1) And the detection box regression Loss function adopts CIoU _ Loss to replace GIoU _ Loss.

The CIoU calculation formula is as follows:

wherein, ioU is the intersection ratio of the predicted value and the actual value of the detection frame, rho is an Euclidean distance function, b and b ^gt The center points of the predicted value and the actual value of the detection frame are respectively, c is the diagonal distance of the minimum closure area of the predicted value and the actual value of the detection frame, alpha is a weight coefficient, and v is a similarity parameter of the length-width ratio of the prediction frame and the length-width ratio of the real frame.

(2) Compared with a bare hand, the data glove has fewer texture details, and the feature dimension reduction processing by using a PCA (principal component analysis) method is added in the feature extraction network to form a new feature map in the feature fusion network.

Step 1.2, constructing a deep learning model of hand key point regression, wherein the process specifically comprises the following steps:

step 1.2.1, a hand key point data set is made, and the data set consists of images marked with 21 key points of a single hand, a boundary frame and a three-dimensional coordinate set of the key points under the current camera coordinate system. A hand key point data set is created based on a hand image of a wearer of the data glove. As shown in fig. 2, the left side is the hand image of the data glove and the marked key points, and the right side is the three-dimensional coordinate diagram of the key points of the hand.

And step 1.2.2, inputting the training set manufactured in the step 1.2.1 into a Squeezenet network for training to realize the calculation of the coordinates of key points in the detected hand areas, wherein the regression Loss function of the key points adopts Wing Loss.

The Wing Loss function is defined as:

where ω is the extent of the nonlinear portion and ε is used to constrain the nonlinear regionCurvature, C is a constant used to connect the linear and nonlinear portions of the function, and x is the input quantity.

Preferably, the open source model of the bare-handed key point regression that has been trained is used in the training process in step 1.2.2 for migration to solve the problem of small magnitude of data set and speed up model convergence.

When the model obtained by training in the steps 1.1 and 1.2 is used for detecting an input video image, a hand detection deep learning network model is used for detecting a hand region, the detected hand region is used for detecting hand key point information by using a deep learning model of hand key point regression, and three-dimensional coordinates of 21 key points are detected in the embodiment of the invention.

Step 1.3, establishing a hand three-dimensional model, taking 21 key points of a single hand as joints, connecting the joints to form a skeleton, and covering the outer layer with a skin. In addition, kinematic constraints exist among joints, and the specific design process of the joint comprises the following steps:

step 1.3.1, designing joint static constraint, comprising the following steps:

the metacarpal joint, the proximal phalanx joint and the distal phalanx joint swing back and forth by an angle theta _M 、θ _P 、θ _D In the range of [0 degree to 90 degree ]](ii) a Left-right swinging angle for metacarpal joint

The little finger is [ -30 degree to 0 degree °]The ring finger is [ -15 degree to 0 degree °]The middle finger is 0 degree, the index finger is 0 degree to 15 degree]The thumb is 0-45 degree]。

Step 1.3.2, designing joint dynamic constraint, which comprises the following steps: the relation of the back-and-forth swinging angle of the far-end phalangeal joint and the near-end phalangeal joint is

The relationship of the forward and backward swinging angles of the metacarpal joint and the proximal phalangeal joint is

The dynamic constraint relation of the front-back and left-right swinging angles of the metacarpal joints is

Step 1.4, manufacturing an operation part positioning template based on a model, wherein the operation part positioning template mainly comprises the steps of calibrating the outer edge outline of an entity operation part area, the characteristic point information on an operation part and the relative coordinate position of each operation part in the area.

And in an online stage, fusing data glove sensor information and visual information of key points of hands to perform stable hand tracking, and fitting the optimal interaction posture when interacting with an operation part. The visual information in the process of carrying out three-dimensional hand reconstruction on the hand key point information fusion is the hand information in the current visual field range of a user obtained by a camera arranged on a helmet display; the visual information in the interaction process is that the hand interaction gesture is calculated through a fixed camera arranged at the concentrated area of the operation part, and then the current observation position of a user is properly corrected through a camera arranged on the helmet display.

Specifically, the process of performing three-dimensional hand reconstruction by hand key point information fusion in the online stage includes:

and 2.1, placing the hand in an identifiable range of a hand key point detection network trained in an off-line stage, and adjusting the joint position and the skeleton length of the hand three-dimensional model established in the step 1.3 according to the detected key point coordinates to adapt to hands with different sizes.

Step 2.2, when no hand key point is detected in the image, respectively fusing a motion model and data glove sensor information for each hand key point by using Kalman filtering to determine a three-dimensional coordinate, wherein the process specifically comprises the following steps:

and 2.2.1, acquiring three-dimensional coordinates of 21 key points of a single hand in the information of the data glove sensor, converting the three-dimensional coordinates into a current virtual space coordinate system, and marking the three-axis directions of the virtual space coordinate system as x, y and z.

And 2.2.2, describing the motion state of the hand key points.

The state of a certain key point of the hand at the time t can be decomposed into offThe current position coordinates and motion speed of the key points are expressed as a state vector X _t ＝[x(t),y(t),z(t),v _x (t),v _y (t),v _z (t)] ^T (ii) a Wherein, (x (t), y (t), z (t)) is the three-dimensional coordinate of the key point at the time t, (v) _x (t),v _y (t),v _z (t)) is the speed of the key point in x, y, z direction at time t.

Step 2.2.3, describing the observation state of the key points of the hands by using the data of the data glove sensor, and observing a vector Z _t And the state vector X in step 2.2.2 _t The expression of (2) is consistent, wherein the speed of the key point in the x, y and z directions at the moment t is calculated by the coordinate of the key point at the moment t-1 and the current coordinate.

Step 2.2.4, equation of motion

To update the estimated value of the motion state of the key point of the hand at the time point t

Where Δ t represents the time interval between two sampling time points, such as the interval from time t-1 to time t. State vector X _t-1 ＝[x(t-1),y(t-1),z(t-1),v _x (t-1),v _y (t-1),v _z (t-1)]Is the state vector of the known hand keypoint at the previous moment.

Step 2.2.5, using

Updating each hand keypoint state X, wherein

Indicating the state estimate at time t, K _t Kalman gain, Z, at time t _t And H is a transformation matrix from the observed value to the state value.

Step 2.3, when the hand key points are detected in the image, the key point visual information is fused again for each hand key point based on the motion state obtained in step 2.2 to determine a new state, and the process specifically comprises the following steps:

and 2.3.1, acquiring three-dimensional coordinates of 21 key points of a single hand output by the hand key point detection network, and converting the current camera coordinate system into a virtual space coordinate system.

Step 2.3.2, the observation state Z of the hand key points acquired by the image _t ' describe, describe the method consistent with step 2.2.3.

Step 2.3.3, using the method of step 2.2.5, using Z _t ' further update the state X, and in the update, the state X at the time k updated in step 2.2.5 is used as the state estimation value at the time k in the present step

Will Z _t ' As observed value of state at time k, the formula in step 2.2.5 is introduced to update state of key point

And 2.4, fitting the three-dimensional coordinates of the 21 key points of the single hand continuously calculated in the steps 2.2 to 2.3 with the joints in the three-dimensional hand model adjusted in the step 2.1 by a PnP method to form the optimal posture of the hand model.

After the hand gesture is subjected to real-time three-dimensional reconstruction through the steps, the hand reconstruction result and the interaction state of the control component need to be further fitted during interaction, and the specific process comprises the following steps:

and 3.1, matching the outer edge contour and the internal feature point of the operation component region in the detected image with the positioning template by using a fixed camera placed in the operation component concentration region, obtaining a two-dimensional and three-dimensional projection transformation relation of the operation component concentration region, and converting a matching result into a current virtual space coordinate system.

Step 3.2, calculating the relative gesture of the hand according to the feature points on the operation component, fusing the detection result of the hand key point of the fixed camera with the hand reconstruction result again, and determining the final hand interaction gesture after adjustment, wherein the specific process comprises the following steps:

step 3.2.1, the positions where the hand and the operation part are interacted are mostly fingertips and the like, and the position relation between the key points of the hand and the feature points on the operation part is used as D _i ² ＝(x-x _i ) ² +(y-y _i ) ² +(z-z _i ) ² Is shown in which D _i The distance from the key point of the hand to the feature point of the operating part, (x, y, z) is the undetermined coordinate of the key point of the hand, (x) _i ,y _i ,z _i ) Is the i-th feature point coordinate on the operating part. And calculating the optimal estimated positions of the hand key points relative to the operation part by using a least square method, and fitting the postures of other hand joints through a reverse dynamic model. The optimal position of the hand key point estimated from the feature points of the operation member is obtained by this step.

And 3.2.2, according to the processes of the steps 2.2 and 2.3, selecting and carrying out information fusion on the result of the step 3.2.1 and the result of the step 2.2 or the step 2.3 according to whether the hand key points are detected in the head camera. In this step, the positions of the key points of the hands estimated in step 3.2.1 and the positions of the key points of the hands output in step 2.2 or 2.3 may be fused by kalman filtering, and the fused positions of the key points of the hands may be output.

By fixing the camera at the operation member concentration region, it is possible to stably perform hand tracking even when the head camera is moved to a position where the operation member is not visible. The shooting angles of the fixed camera and the mobile camera are different, and the position of the key point of the hand is updated through Kalman filtering in consideration of the error and the precision existing after the position is converted into a virtual coordinate system.

And 3.2.3, calculating the distance between the key point of the hand and the operating component in the step 3.2.2, if the calculated distance is smaller than a preset distance threshold value, performing collision detection on the hand model and the operating component, attaching a joint to the operating component in a virtual interface, and using a larger smoothing coefficient to ensure stable hand posture in the process.

And 3.2.4, calculating the relative postures of the joints of the plurality of characteristic points with the strongest treatment reliability at the observation position on the operation part detected by the camera arranged on the helmet display by utilizing a least square method according to the change of the observation position of the user, and adjusting the postures of the joints obtained in the steps 3.2.2 and 3.2.3 by a filtering method.

And (3) the camera arranged on the helmet display also shoots an image of the operating part at the current visual angle in real time, the shape (whether shielding exists) of all the characteristic points on the operating part, stability, distance and other factors are detected, the characteristic points are sequenced after being weighted, a plurality of characteristic points larger than a confidence coefficient threshold value are taken, as in the step 3.2.1, the position of the key point of the hand is calculated by using the least square method again, and the posture of the hand model is adjusted.

In addition to the technical features described in the specification, the technology is known to those skilled in the art. Descriptions of well-known components and techniques are omitted so as to not unnecessarily obscure the present invention. The embodiments described in the above embodiments do not represent all embodiments consistent with the present application, and various modifications or variations which may be made by those skilled in the art without inventive efforts based on the technical solution of the present invention are still within the protective scope of the present invention.

Claims

1. A hand interaction method for a mixed reality flight simulator is characterized by comprising the following steps:

step 5, modeling the motion state of the key points of the hands, wherein the state vector of the key points comprises the current position coordinate and the speed on three axes of a virtual space coordinate system; for each hand keypoint: estimating the state of the key point of the hand at the current moment according to the state of the key point of the hand at the previous moment through a motion model; acquiring three-dimensional coordinates of key points of hands in data glove sensor information in real time, and generating observation vectors; updating and outputting the current state of the hand key points by using the Kalman filtering fusion estimated state of the hand key points and the observation vectors;

step 6, acquiring a hand image in the current visual field range of a user by using a camera arranged on a helmet display, and detecting key points of the hand in the image by using the deep learning network model obtained by training in the step 1; for each hand key point, fusing the state of the hand key point output in the step 5 and the state of the hand key point detected by the deep learning network model by using Kalman filtering and outputting;

step 7, fitting the output states of the key points of the hand and the joints of the hand three-dimensional model adjusted in the step 4 to form the optimal posture of the current hand model;

2. The method according to claim 1, wherein the step 1 comprises:

(1) Constructing a deep learning network model for hand detection, comprising: acquiring hand images of the gloves wearing the data, enhancing the data, and labeling the images to obtain a hand detection training set, wherein the labeled areas are hand areas, and the labeling information comprises categories, center points, widths and heights; the YOLOv5 network was modified as follows: (a) A CIoU _ Loss function is adopted as a regression Loss function of the detection frame, (b) feature dimension reduction processing by using a principal component analysis method is added in a feature extraction network, and a new feature graph is formed in a feature fusion network; training the YOLOv5 network by using a hand detection training set to obtain a hand detection deep learning network model;

(2) Constructing a deep learning model of hand key point regression, comprising: marking information of hand key points on a hand image wearing the data glove, making a hand key point detection training set, training a SqueezeNet network, and obtaining a deep learning network model for detecting the hand key points; the SqueezeNet network is a SqueezeNet network which is trained by using bare hand key point samples in advance.

3. The method according to claim 1, wherein in step 5, for each hand key point, the motion state of the hand key point at the current time t is estimated by using the motion model as

Acquiring three-dimensional coordinates of hand key points in data glove sensor information in real time, converting the three-dimensional coordinates into a virtual space coordinate system, and acquiring an observation vector Z at the current time t _t Obtaining the motion state of the key point t moment of the hand by Kalman filtering fusion

Wherein, K _t And H is a transformation matrix from the observed value to the state value.

4. The method according to claim 1, wherein in step 6, the deep learning network model trained in step 1 is used to detect the hand key points in the image, and the hand key points are converted into the virtual space coordinate system, and the state vector of the hand key point at the current time t is set as Z _t ' the state of the hand key point at time t is output in step 5Vector as state estimate

Obtaining the state vector of the hand key point at the time t by Kalman filtering fusion

5. The method according to claim 1, wherein in step 6, when the hand key points cannot be recognized from the photographed image, the state of the hand key points output in step 5 is input to step 7.

6. The method of claim 1, wherein step 8 comprises:

(1) Detecting the outer edge outline of an operation component area in the image and the characteristic points on the operation component through a positioning template, matching with the positioning template, converting the matching result into a virtual space coordinate system, and obtaining the positions of the characteristic points on the operation component;

(2) Calculating the relative posture of the feature point on the operation component and the hand, comprising:

(2.1) calculating the optimal estimated positions of the key points of the hand positioned on the fingertip relative to the operation part by using a least square method, and fitting the positions of other key points of the hand through a reverse dynamic model;

(2.2) performing Kalman filtering fusion on the positions of the hand key points obtained in the step (2.1) and the positions of the hand key points output in the step (5) or the step (6), and updating the positions of the hand key points;

(2.3) calculating the distance between the output position of the key point of the hand and the operation part in the step (2.2), and when the calculated distance is smaller than a preset distance threshold value, performing collision detection on the hand model and the operation part, and attaching a hand joint to the operation part;

and (2.4) shooting the image of the operation part in real time by a camera arranged on the helmet display, detecting the characteristic points on the operation part, selecting the characteristic points larger than the confidence threshold value, calculating the positions of the key points of the hand part in the step (2.1), and adjusting the posture of the hand part model.