CN111460976A

CN111460976A - Data-driven real-time hand motion evaluation method based on RGB video

Info

Publication number: CN111460976A
Application number: CN202010237076.1A
Authority: CN
Inventors: 李冕; 王天予; 王毅杰
Original assignee: Shanghai Pnx Information Technology Co ltd; Shanghai Jiaotong University
Current assignee: Shanghai Pnx Information Technology Co ltd; Shanghai Jiaotong University
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-28
Anticipated expiration: 2040-03-30
Also published as: CN111460976B

Abstract

A data-driven RGB video-based real-time hand motion evaluation method belongs to the field of human behavior analysis based on video processing. Comprises a hand posture estimation unit and an action evaluation unit; the hand posture estimation unit is used for extracting hand key point coordinates from the frame image; the action evaluation unit is used for predicting the score of the hand action quality and giving suggestions on how to improve the score; and performing gesture estimation and organization through a deep learning-based method, and performing action quality evaluation. The method solves the matching accuracy of the extracted features and the human hand details in the real scene in the continuous change process of the visual angle of the camera, can improve the calculation/operation efficiency of overall action recognition and evaluation, realizes real-time virtual reconstruction of the hand action, can evaluate the human hand action in real time and accurately, and improves the accuracy and robustness of the whole action evaluation. The method can be widely applied to the fields of vision-based hand posture estimation, motion quality evaluation methods and the like.

Description

Data-driven real-time hand motion evaluation method based on RGB video

Technical Field

The invention belongs to the field of human behavior analysis based on video processing, and particularly relates to a hand motion real-time evaluation method based on RGB video.

Background

In recent years, rapid developments in the field of computer vision have led to a number of reliable methods for object detection and motion recognition from images and videos. For this reason, the academic world gradually starts exploring the field of video-based human motion quality assessment.

At present, many advances have been made in the macroscopic action of the human body.

In the paper "assess the Quality of action" (Hammed Pierce gas, Carl Vondskirt and Antonio Torraba, European Computer Vision Conference 2014. Sppringe International Press 556-571. HamedPirsia vasosh, Carl Vondrick, and Antonio Torralb.2014. Assessing the Quality of action. in 2014European Conference on Computer Vision Vision (ECCV). Springer International publication 556-571.) a method based on linear support vector regression (L-SVR) was proposed that is trained on the spatio-temporal characteristics of the bottom and top layers.

In the thesis "Kinect-based body-building exercise identification and evaluation" (Wangyi et al, computer science and application, 7.2018, 27.7.1134-1145), a KNN-based fine-tuning method is proposed, which classifies the expression of moving objects according to the similarity between static skeleton data and template models.

However, these existing methods described above cannot ensure that a consistent one-to-one matching relationship is established between the extracted human features and the actual physical portions reflected by the video. For example, a change in the angle of the camera in the paper "assessing the quality of action" would map the real right leg to the features of the left leg in the image accordingly. Great difficulties arise when human hands are involved instead of the body, since the cooperation of one hand and two hands is more complicated and the impact of such matching errors on the evaluation is not negligible.

Therefore, the current exploration for evaluation of hand movements is still very limited. In other words, to date, no method has been investigated for a corresponding assessment of hand motion.

In fact, the evaluation of hand motion is substantially different from the evaluation of human motion in the general sense.

For those hand-based actions, performance is typically dependent on the details in the hand gesture. For example, trainees may be evaluated for surgical training based in part on their hand posture (e.g., a scalpel-holding posture). Also, hand movements involve both hands of a person and different states of each hand (front and back, etc.), which need to be recognized when making an assessment in order to constitute a reasonable feature.

Video-based hand motion quality assessment is important, and many scenarios require this technology to facilitate automated assessment of the hand-based training process. This assessment typically comprises two parts: which part of the performance score and gesture needs the most improvement and how to make the adjustment. The trainees can autonomously improve their performance according to the feedback provided by the technology. Thus, the technique addresses the absence or lack of expert instruction in the normal case. In addition, since this technique is based on a camera, the trainee can get rid of the wearable sensor, thereby training in a more realistic and natural manner.

The evaluation of hand motion is based on hand pose estimation. Hand pose estimation refers to a process and method of extracting two-dimensional or three-dimensional coordinates of each joint of a hand to estimate a pose.

In the conventional gesture estimation method, whether the method for generating a class or the method for distinguishing the class depends on an RGB-D (RGB + Depth Map) picture or video acquired by a Depth camera, which results in very high implementation cost and high requirements on device performance.

In recent years, the academia has proposed some effective methods based on deep learning that rely only on common RGB images and video. This method generally comprises three parts: hand segmentation, pose estimation, and refinement of estimated poses.

However, these methods still have some unsolved problems. First, they do little to consider the issue of computational efficiency. For practical application scenarios, real-time online evaluation is a very important requirement, and therefore, there is a high demand for computational efficiency. In addition, when the estimation posture is improved, the integrality of the two hands is not considered, the integrality is considered to correspond the two hands in the video to the real left hand and the real right hand, and meanwhile, the extracted characteristics of the hands are ensured to be correctly corresponding to each finger joint in the real situation.

In addition to the development in academic circles, there are also many patent documents that propose techniques related to hand motion analysis, for example, chinese invention patent with publication number CN 105160323B, publication number 11/27/2018, and publication number CN 105160323B, which discloses a gesture recognition method aiming at quickly and accurately recognizing a user's gesture based on depth information and color information of an image. The technical solution relies heavily on preset information: the preset hand structure template determines a feature point sequence to be detected of the hand outline, the preset feature point sequence is matched with the action name and the position, and the gesture table is matched with the gesture. This causes scene changes to require modification of the pre-set templates. Moreover, the technical scheme is based on the image containing the depth information, and the required feature points are extracted by depending on the depth information, so that additional requirements are added to the data acquisition equipment.

For another example, in chinese patent invention with publication number CN 103034851B, publication date of 2015, 8, 26 and publication number of CN 103034851B, a self-learning skin color model-based hand tracking device and method are disclosed, which provide a precise hand motion recognition method based on color depth information and scale invariant feature transformation, which can maintain stability and accuracy when the hand is disturbed or occluded. However, the technical solution is to realize tracking based on hand contour and fingertip position, which is not enough for the evaluation of hand motion, and the characteristics capable of reflecting complete gesture are required for realizing the evaluation of hand motion. This solution also relies on depth information to extract the required feature points. Meanwhile, the technical scheme is also based on comparatively identifying the hand motion type in the video, so that a comparatively positive sample is established at a higher cost in a new scene.

The technical solutions in the above patent documents are mainly directed to hand recognition and tracking, and do not relate to the problem of real-time hand motion evaluation. Meanwhile, the technical schemes analyze the macroscopic characteristics of the hand and do not have detailed information deep into the hand structure (such as the condition or action change of a certain finger). Further, these solutions do not consider the problem of incorrect correspondence caused by the change of the camera angle, i.e. a stable camera angle is assumed.

How to solve the matching accuracy of the extracted features and the hand details in the real scene in the continuous change process of the visual angle of the camera, improve the calculation/operation efficiency of the whole action recognition and evaluation, improve the accuracy and robustness of the whole action evaluation, and solve the problem of urgent need in the real-time evaluation work of the hand action.

Disclosure of Invention

The invention aims to provide a data-driven real-time hand motion evaluation method based on RGB video. Aiming at the research blank of hand motion evaluation in the field of human behavior analysis based on video processing, the problem of corresponding errors of features and real physical parts in human motion evaluation and the real-time problem of a motion evaluation system, a real-time hand motion recognition and evaluation method based on RGB video is provided, so that the matching accuracy of the extracted features and the human hand details in a real scene in the continuous change process of the visual angle of a camera is solved, the calculation/operation efficiency of overall motion recognition and evaluation can be improved, the real-time virtual reconstruction of hand motion is realized, the human hand motion can be evaluated in real time and accurately, and the accuracy and robustness of the whole motion evaluation are improved.

The technical scheme of the invention is as follows: the data-driven real-time hand motion evaluation method based on the RGB video is characterized by comprising the following steps of:

1) acquiring a video to be recognized of a hand;

2) performing hand region segmentation on the video to be identified;

3) calculating the probability of selecting each position as a key point according to the segmented hand area to obtain the position of the two-dimensional hand key point;

4) predicting three-dimensional hand key point positions according to the extracted two-dimensional hand key point positions;

5) recognizing the state of the hand according to the skin color and the structure of the hand, correspondingly adjusting the gesture characteristics, and obtaining space-time characteristics for the whole video;

6) based on the space-time characteristics, three models of a long short-term memory network, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier are used for comparison and verification, and the quality level of the hand motion is predicted;

7) the gesture estimation and organization are carried out through a deep learning-based method, real-time virtual reconstruction of hand motions is achieved, and quality evaluation of the hand motions is carried out.

Wherein the gesture estimation and organization at least comprises: for conventional RGB video containing image frames showing both hands of a person, the pose of both hands is extracted as a feature of each still frame.

The gesture estimation and organization are realized through a hand segmentation module, a 2D hand posture estimation module, a 3D hand posture estimation module and a hand posture organization module.

Specifically, the hand segmentation module is used for identifying and segmenting an area where a human hand is located in each frame of image, and constructing a model by using data from an Egohands data set;

the 2D hand posture estimation module is used for extracting two-dimensional coordinate information of each joint point of the hand, and obtaining a two-dimensional hand posture by utilizing a key point score map of the probability that each pixel point is selected as a key point;

the 3D hand posture estimation module is used for lifting the two-dimensional hand posture extracted by the previous module to three-dimensional, and predicting relative and standardized three-dimensional coordinates according to incomplete and noisy key point score maps obtained from the 2D hand posture estimation module;

the hand gesture organization module is used to distinguish the left and right hands in each frame of the video, and the different geometric states of each hand, and then the original rough gesture should be adjusted to conform to the actual situation.

Furthermore, the hand segmentation module greatly improves the calculation efficiency of the module by cutting a flexible area covering the part to be segmented without influencing the accuracy.

Specifically, the computational efficiency is represented by a reconstruction computation cost RCC and an evaluation computation ratio ACR;

the reconstruction calculation cost RCC is used for representing the calculation time of each frame of gesture estimation;

the evaluation calculation ratio ACR is used for representing the ratio of the calculation time of the action quality evaluation to the video duration;

the reconstruction computation cost RCC quantifies the degree of synchronization of the virtual representation of the action with the real action;

the evaluation calculation measures the degree of providing timely evaluation and feedback over ACR.

Further, the quality evaluation of the hand motion is realized through a performance evaluation module and a feedback indication module.

Further, the performance evaluation comprises: the three-dimensional relative position of the joints is used as a feature for each frame, and then a model is built to analyze spatiotemporal information of the whole video and output scores.

Further, the feedback indication includes: the feedback provided instructs the trainee how each of his static postures should be adjusted, giving the joints that need to be adjusted most and the corresponding adjustment direction that maximizes the final score, achieving the maximum improvement of the trainee's final score by establishing a causal relationship between the actions and the score.

The feedback indication is achieved by maximizing the gradient of the final score with respect to the features of each frame.

Compared with the prior art, the invention has the advantages that:

1. the technical scheme integrally solves the problem of lack of real-time supervision of experts in a training scene by realizing automatic hand action evaluation based on the RGB video, and greatly improves the training efficiency;

2. the technical scheme improves the overall operational efficiency by improving the operational efficiency during hand segmentation and a multi-thread parallel operation framework (video reading, video processing and feature organization), provides opportunities for real-time virtual reconstruction of hand actions, achieves the effect of basically synchronizing with the video playing speed in feature extraction, and achieves the effect of timely giving feedback after video playing in action evaluation;

3. according to the technical scheme, the step of characteristic organization is added in the characteristic extraction, so that the matching accuracy of the extracted characteristics and the hand details in the real scene in the continuous change process of the visual angle of the camera is improved, and the accuracy and the robustness of the whole action evaluation are improved.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic structural diagram of a hand segmentation module according to the present invention;

FIG. 3 is a schematic diagram of the structure of a two-dimensional pose estimation module of the present invention;

FIG. 4 is a schematic diagram of the structure of the three-dimensional pose estimation module of the present invention;

FIG. 5 is a schematic diagram of a hand gesture integration module according to the present invention;

FIG. 6 is a schematic diagram of the structure of the action quality assessment module according to the present invention;

FIG. 7 is a schematic flow chart of the present invention for determining the geometric status of the left hand;

FIGS. 8a, 8b, and 8c are diagrams of graphical effects in an embodiment of the present invention;

FIG. 9 is a mean and standard deviation intent of RCC index for poor performance levels on a paper folding data set;

FIG. 10 is a mean and standard deviation plot of the RCC index for typical performance levels on a paper folding data set;

FIG. 11 is the mean and standard deviation intent of RCC index for good performance levels on origami data sets;

FIG. 12 is a schematic of a joint showing the number of first 5 occurrences of poor performance levels on a origami data set;

FIG. 13 is a schematic of a joint showing the number of first 5 occurrences of a general performance level on a origami data set;

FIG. 14 is a schematic of the joints for the first 5 occurrences of good performance levels on the origami data set.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

An image containing depth information (also called image depth) refers to the number of bits used to store each pixel, which is also used to measure the color resolution of the image. It determines the number of colors each pixel of a color image may have or determines the number of gray levels each pixel of a gray scale image may have, which determines the maximum number of colors that may appear in a color image or the maximum gray scale level in a gray scale image. Although the pixel depth or image depth may be deep, the color depth of various display devices is limited. For example, a standard VGA supports 4-bit 16-color images, and at least 8-bit 256 colors are recommended for multimedia applications. Due to device limitations, coupled with limitations on human eye resolution, typically, a particularly deep pixel depth is not necessarily sought. Furthermore, the deeper the pixel depth, the more storage space is occupied. Conversely, if the pixel depth is too shallow, that also affects the quality of the image, which appears to be rough and unnatural to humans.

Aiming at the current situation that the existing three-dimensional hand posture estimation method cannot identify the left hand and the right hand simultaneously and the video-based action automatic evaluation mainly focuses on the body action, the technical scheme of the invention provides a data-driven RGB-video-based real-time hand action evaluation method.

The technical scheme of the invention mainly comprises two main components:

1) a hand posture estimation unit: the hand gesture recognition system comprises a hand segmentation module, a two-dimensional gesture estimation module, a three-dimensional gesture estimation module and a hand gesture integration module, wherein the hand segmentation module, the two-dimensional gesture estimation module, the three-dimensional gesture estimation module and the hand gesture integration module are used for extracting hand key point coordinates from a frame of image;

2) an operation evaluation unit: the hand motion quality assessment system comprises a long-short term memory network model, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier, and is used for predicting the score of hand motion quality and giving suggestions on how to improve the score;

the hand segmentation module of the hand posture estimation unit can lay a solid foundation for hand posture estimation. Applying hand segmentation in real scenarios requires careful consideration of computational efficiency. However, many existing methods do not meet these requirements. The performance of these methods can be drastically degraded if the background is complex or the skin tone changes dramatically. At the same time, these methods ignore the problem of computational efficiency.

In the prior art, hand segmentation was achieved as a model built with data from the Egohands dataset with high quality annotations, with 4800 images containing hands in 48 different environments. The model is trained and applied to each frame of the video. It should be noted that clipping the segmented part from the original image is usually a bottleneck in speed, because this operation requires identifying all pixels on the boundary, so the solution of the present invention does not clip along the exact boundary of the detected box, but instead covers the flexible area covering the detected box.

The two-dimensional hand pose estimation module implements this step according to a codec structure. The 2D hand pose is estimated using a keypoint score map, which represents the probability of each position being selected as a keypoint. The prediction initial score map is represented by image features generated by an encoder. With respect to a specific implementation method for implementing the two-dimensional hand pose estimation module using the codec structure, reference may be made to the paper "convolutional pose machine" (wain-wei, wazu-rumo-christina, jinwuxiong and asher-schh, IEEE international Conference on Computer Vision and Pattern Recognition 2016. IEEE Press publication, 4724-4732. Shih-EnWei, Varun Ramakrishna, Takeo Kanade, and Yaser sheikh.2016. volumetric pos.

The three-dimensional hand pose estimation module estimates the relative regularized three-dimensional coordinates From a PoPriser network in the paper "Estimate 3D gestures From Single RGB image learning" (Cristian-Simleman and Thomas-Brukas, IEEE International computer Vision Conference in 2017. IEEE Press, 4903-4911. Christian Zimmermann and Thomas Brox.2017. L earning to Estimate 3D HandPose From Single RGB images. in 2017IEEE International Conference on computer Vision (ICCV).

The hand pose integration module estimates a rough hand pose based on the previous steps.

Since existing limb motion quality evaluations do not guarantee whether the positions of the extracted limb key points correspond well to each part of the human body in the video. For example, players sometimes experience confusion in their left and right legs due to large changes in camera angle. However, such a mismatch will cause a degradation in the evaluation quality, and it is therefore necessary to maintain a one-to-one match between the extracted spatial information and the actual situation. In the scene with hands of the technical scheme, the left hand and the right hand are distinguished, different states of each hand in each frame of the video are further distinguished, and then the original gesture of the rough hand is adjusted to be consistent with reality under the condition of different camera positions.

Fig. 1 is a flowchart of a method for evaluating a hand motion according to an embodiment of the present invention. The hand motion evaluation method comprises the following two key steps: and performing gesture estimation and organization through a deep learning-based method, and performing action quality evaluation.

Further, each frame of a given video first goes through four modules: hand segmentation, two-dimensional hand pose estimation, three-dimensional hand pose estimation, and hand pose organization. And then forming a space-time dynamic action according to the static gestures, and evaluating the quality degree of the action by the whole dynamic action through an evaluation module.

Still further, the evaluation module also provides corresponding feedback, indicating the most promising improvements.

Specifically, the first step is gesture estimation and organization. Image frames for two hands containing a person

The gesture of both hands is extracted as a feature of each still frame. The features of the t-th frame are defined as a set of coordinates p^j(t)＝(x^j(t),y^j(t),z^j(t)),j∈[1,2m]The coordinates indicate the positions of 2m key points corresponding to the joints of both hands (m is 21 in this example). There are four modules to implement this process: hand segmentation, 2D hand pose estimation, 3D hand pose estimation, and hand pose organization.

Fig. 2 is a schematic structural diagram of the hand segmentation module of the present invention, and it can be seen that the hand segmentation module includes a single-shot multi-box detector.

The hand segmentation module is used for identifying and segmenting the area where the human hand is located in each frame of image. The robust hand segmentation module lays a solid foundation for subsequent extraction of accurate hand gestures. Computational efficiency is also an important aspect in view of its application in real scenarios.

However, most of the existing methods do not meet these requirements. The performance of those methods drops dramatically if the background is unusual or the skin tone changes significantly. Moreover, they do not take into account the impact of computational efficiency carefully. The module constructs a model using data from the Egohands dataset with high quality annotations, with hands in 4800 images in 48 different environments.

The trained model is applied to each frame of the video stream. It can be noted that the bottleneck of computational efficiency is usually to crop the region to be segmented in the original image, since it is often necessary to identify all pixels on the boundary. The module does not cut along the exact boundaries of the detected box. Instead, the flexible area covering the part to be segmented is cropped. This allows a substantial increase in the computational efficiency of the module with little impact on accuracy.

FIG. 3 is a schematic diagram of the structure of the two-dimensional pose estimation module of the present invention.

The two-dimensional hand posture estimation module is used for extracting two-dimensional coordinate information of each joint point of the hand.

This module is implemented according to an encoder-decoder architecture. Extracting two-dimensional coordinate information of each joint point of the hand, and converting the two-dimensional coordinate information into a task of finding out key points in the hand image. And obtaining the two-dimensional hand gesture by utilizing the key point score map of the probability that each pixel point is selected as the key point.

Specifically, for each hand, the pixel coordinate of the j-th joint is recorded as Y_jThe goal is to predict all coordinates Y ═ Y (Y)₁,…,Y_m). The module comprises a series of multi-classification predictors g_t(. The) for each frame T ∈ {1, …, T }, the corresponding predictor assigns a certain pixel coordinate of the image to each joint Y_lZ, based on extracting the extracted feature x at position z_zAnd each Y from the previous classifier_lContext information of neighboring pixels. The definition classifier assigns joint l of frame t to position z ═ (u, v)^TThe obtained probability is

Definition of

Is from b_t-1Mapping to contextual characteristics. Then there are

By updating on all frames, a complete key point score map can be constructed, thereby extracting the two-dimensional coordinate information of all joint points.

FIG. 4 is a schematic diagram of the three-dimensional pose estimation module according to the present invention.

The three-dimensional hand posture estimation module is used for lifting the two-dimensional hand posture extracted by the previous module to three dimensions. The modules predict relative and normalized three-dimensional coordinates from incomplete and noisy keypoint score maps obtained from previous modules.

First, based on two-dimensional position information, a network is trained to predict the corresponding three-dimensional position in a canonical frameCoordinates; next, a transition between the canonical frame and the relative frame is estimated. In particular, for the latter, a rotation matrix R (w) needs to be estimated^rel) Comprising two steps. First, one needs to find a rotation R around the x-axis and the z-axis_xzSo that the joint points are aligned with the y-axis under the canonical framework:

second, a rotation R around the y-axis is calculated_ySo that

And the entire rotation matrix is the product of these two rotations. These estimates are all a matter of view angle estimation.

Through the previous modules, a rough three-dimensional gesture has been estimated. In the prior art, the subsequent evaluation of the human body action evaluation is carried out according to the rough body posture, and whether the extracted position information is consistent with each part of the human body displayed in the video or not is not confirmed. For example, in some experiments based on olympic motion data sets, the left leg and the right leg of the athlete sometimes have wrong correspondence due to a large change in the camera angle (the position information extracted from the right leg corresponds to the features of the left leg). Since such matching errors will result in a decrease in the accuracy of the subsequent evaluation, it is necessary to add a hand posture organization module to ensure a consistent one-to-one match between the extracted joint position information and the actual situation.

Fig. 5 is a schematic structural diagram of the hand posture integration module according to the present invention.

As shown in fig. 5, the hand pose integration module is used to distinguish between the left and right hands in each frame of the video, and the different geometric states of each hand. The original coarse pose should then be adjusted to the actual situation, regardless of the condition of the camera.

In particular, although the hand segmentation module is trained on a first perspective data set that distinguishes between left and right hands, the module employs a correction mechanism to increase its robustness. The part of the forearm to which the hands are attached is first detected in the image. If these regions extend to the lower boundary of the frame, the camera is said to be in a first viewing angle, in which case the left hand segment in the image corresponds to the left hand and the right hand segment corresponds to the right hand. Otherwise, the camera is at the viewing angle of the observer, and the corresponding relationship is reversed.

Each hand defines and distinguishes four states: open with palm facing upward, open with palm facing downward.

Fig. 7 shows a detailed algorithm for distinguishing the four geometric states of the left hand, and the right hand algorithm works on the same principle as the left hand. The image of each hand segment is first converted from RGB to HSV color space. The slight gaussian blur is then used for processing so that the human skin can be better distinguished from other objects with a similar HSV (which is considered noise in this solution). After that, an elliptical region is sampled from the surface of the hand, and the gray scale of the region is checked. Since the back of a person's hand tends to be darker than the palm, the front and back of the hand can be identified. Further, the circumference and area of the hand are calculated to determine whether the hand is in a fist or a spread state. Judging the unfolded shape when the area and the circumference are larger than or equal to the respective threshold values; otherwise, the state is determined as the fist-making state. For the case of being in the fist making state, the fifteenth frame (representing the state of making a fist just before making a fist) before the current frame is used for calculating the gray value and judging the front and back sides, because the gray value corresponding to the hand when making a fist contains the elements of the front and back sides at the same time, the judgment is easy to be wrong.

Considering the final effect, the state of the left hand with its fist up and palm spread out is equivalent to the state of the right hand with its fist down and palm spread out. Thus, all states of both hands are attributed to different treatments of the left and right hands. Specifically, for the left hand, the three-dimensional coordinates obtained by the original extraction are used; and turning over the three-dimensional coordinate obtained by original extraction along the z axis for the right hand.

Specifically, the second step is hand motion assessment.

Fig. 6 is a schematic structural diagram of the action quality evaluation module according to the present invention.

Based on the three-dimensional poses of the two hands organized, a quality assessment of the hand movements shown in the video can be made. The evaluation includes two modules: performance assessment and feedback indication. For the former, the three-dimensional relative position of the joint (from left hand to right hand) is used as a feature for each frame. A model is then built to analyze the spatiotemporal information of the entire video and output scores. For the latter, feedback is provided indicating how the trainee should adjust each of its static postures to achieve the maximum improvement in the final score. This is achieved by maximizing the gradient of the final score with respect to the features of each frame.

For the module "Performance assessment Module", the characteristic of the jth joint in the tth frame is p^j(t)＝[x^j(t),y^j(t),z^j(t)],j∈[1,2m]All of which are normalized with respect to the palm center. Then, the characteristics of all 2m joints (both hands) are organically spliced, yielding phi (t) ═ p⁽¹⁾(t),…,p^(2m)(t)]High level information characterizing the action in each frame.

There are two main methods for implementing automatic evaluation: learning and typical examples. The former presents a machine learning problem in which labeled data is collected from experts (also known as expert databases) to train a scoring model, which can then be generalized to complete the task. The latter compares the observed video with a baseline video of the hand motion to assess the quality of the hand motion. The present module uses the first approach because the second approach does not perform well under the requirements of multiple ideal implementations. Specifically, in order to ensure that the scores are unbiased, the second method must include a large number of well-behaved actions as a reference.

Given that static gestures are interrelated in completing an action, it is necessary to explore the spatiotemporal features of the entire video.

In the technical scheme of the invention, three models of a long-short term memory network, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier are used for comparison and verification.

For the first model, the "long short term memory network", this task has similarities to other time series data for which a recurrent neural network augmented with long short term memory elements has proven to perform well, therefore, the timing aspects of actions are modeled using a layer 1L STM network.

For the second model, "discrete cosine transform + support vector classifier", a discrete-time cosine transform is applied to the time series features to obtain a composite representation in the frequency domain. Then, a defined number of low frequency components are subjected to a support vector classifier (regressor) process to give a score.

For the third model "discrete fourier transform + support vector classifier", the discrete-time cosine transform can simply be replaced by a discrete-time fourier transform.

Training the network to minimize the Mean Absolute Error (MAE) between the target score and the predicted score:

wherein S_nAnd

respectively the true score and the predicted probability vector in video n.

For the module two, "feedback indication module," in addition to performance evaluation quantifying the effects of gestures, it is necessary to provide feedback to indicate how the trainee makes gesture adjustments to improve the overall score.

The module outputs the score relative to the last frame (t) by calculating the output score₀) Extracted featuresBecause the L STM model captures important spatio-temporal features over the course of the video, ignoring relatively trivial features₀The output gates in the frame are denoted as

O(t₀)＝σ(W_oφ(t₀)+U_oO(t₀-1)+b_o)

Where σ (-) denotes a sigmoid function, W_oAnd U_oIs a shared parameter matrix, b_oIs the bias term. For the sake of simplicity, note

Q_m(t₀)＝σ(W_mφ(t₀)+U_mO(t₀-1)+b_m)

K_m(t₀)＝tanh(W_mφ(t₀)+U_mO(t₀-1)+b_m)

Probability vector of ensemble

Relative to the input phi (t)₀) Is calculated as

Where H (-) is the derivative function of the softmax function, W 'and b' are the weight matrix and bias term, respectively, of the fully-connected layer, an

A＝tanh(K_a(t₀)Q_i(t₀)+Q_f(t₀)h(t₀-1))

The calculated gradient can be represented as a matrix of 3 × 6m the goal is to select the largest element in the row vector that corresponds to good performance

The results obtained reflect the joints that most need to be adjusted and the corresponding adjustment parties that maximize the final score improvementFurther, t may also be calculated by back propagation of L STM as 1,2, …, t ₀1 frame corresponding value.

Thus, a causal relationship between the action and the score may be established.

To our knowledge, there is currently no public data set for hand motion assessment. Most relevant datasets deal with gesture recognition and human-based motion quality assessment. To this end, we have built their own origami video data set.

The task of hand motion assessment requires the capture of clear motions to fully reflect the performer's gestures.

Clearly, medical surgery is a good choice. However, it is almost impossible to view and record those surgical videos on a large scale. Furthermore, the skills exhibited in medical procedures are typically reflected on the tools that are operated, rather than on the hands themselves.

Thus, as an effect of the method of the present invention, only one of the basic actions in paper folding, folding a square sheet into 8 × 8 small squares, has been selected.

Experts classify performance into three levels according to the following rules:

1. higher levels are represented by the need to fold the paper very carefully during operation; when the paper is folded longitudinally, the two edges are strictly overlapped; in addition, the crease should be thin and clear so that the paper can still bear bending without breaking after being folded for four times; the harsh requirements add some additional procedures to the conventional procedure;

2. moderate levels are represented by the paper being folded with relative care; however, it is not necessary to ensure that all edges overlap exactly in the process; finally, the square edges will be clear but not particularly true;

3. poor horizontal performance, very careless folding of the paper; no matter whether the two edges are strictly overlapped; the final edges of the dice are ambiguous and indistinguishable; the basic strategy is always to fold the sheet longitudinally without stopping to check the edges.

A short video of 144 paper folding actions, of which 44 are labeled by the expert as good actions, 66 are labeled as medium actions, and 34 are labeled as bad actions. The data set is divided into a training set and a test set in a 5:1 ratio, and each labeled action video is guaranteed to be distributed in the two sets in equal proportion.

The Bayesian optimization method provides the best hyper-parameter set for the three evaluation models on the training set.

Accuracy and computational efficiency are two crucial considerations in performance assessment.

The accuracy of the aforementioned three models ("long-short term memory network", "discrete cosine transform + support vector classifier", and "discrete fourier transform + support vector classifier") on the test set was compared under different criteria (see table 1 below) and classes (see table 2 below).

L STM gives the highest accuracy but the lowest AUC, although it predicts the data for good performance well, its performance decreases at other levels (especially bad levels).

The underlying reason is that L STM is sensitive to time-domain changes (e.g., time span and motion phase). good-level performance is prominent in these areas.

It can be concluded that L STM is more suitable for application on time-sensitive actions, whereas DCT + SVC shows more advantages in evaluating actions that strictly follow standard rules.

TABLE 1

TABLE 2

Computational efficiency is also very important for video-based methods, especially when they are applied in real scenes. Given that the solution of the invention is applied to training systems aimed at facilitating virtual reality, computational efficiency should be ensured so that real-time hand reconstruction and timely action quality assessment can be performed.

Therefore, the technical scheme defines two new indexes: reconstructing a computation Cost (RCC) for representing a computation time of each frame of gesture estimation; an evaluation calculation Ratio (ACR) representing a Ratio of a calculation time of the motion quality evaluation to a video duration. Since the extraction and organization of hand gestures proceeds as the video motion progresses, the reconstruction computation cost RCC quantifies how synchronized the virtual representation of the motion is with the real motion. On the other hand, the evaluation calculation measures the degree to which timely evaluation and feedback is provided over ACR.

The experiments were performed on a computer equipped with CPU Intel Xeon Bronze 3106, GPU 1080Ti and a memory size of 16 GB. The mean and variance of the reconstruction computation cost RCC for each video in the data set were studied for each performance level of the action (see fig. 9 to 11).

In FIG. 9, the English notation and Chinese translation in FIG. 10 and FIG. 11 are compared as follows:

Time-Time; Mean-Mean, Variance-Variance, Video ID-Video number.

It has been found that this value fluctuates mainly around 0.08 s. This illustrates that for normal 12 frames per second (fps) video (0.083 s per frame), a well-organized gesture is nearly synchronized with the motion being performed. For higher fps video, RCC can be improved by skipping frames, i.e., disregarding those of poor quality in each sliding window. The trade-off between RCC and accuracy can be balanced according to specific requirements.

Thus, this approach provides the opportunity for real-time virtual reconstruction of hand movements.

Further, the average values of the videos of the three action performance levels on the evaluation calculation ratio ACR were 0.23, 0.077, and 0.11, respectively. This indicates that this method can provide feedback in a short time after the action is completed.

In addition to performance assessment, feedback is also shown for each frame on how to adjust the gesture.

Several examples are shown in fig. 8a to 8 c. FIGS. 8a and 8b show the regions of two hands obtained after the gesture estimation and organization steps; fig. 8c shows the joint that requires the most adjustment from all joints in both hands (i.e., the joint with the greatest amount of gradient, which is indicated by the circle in fig. 8c, and the direction of the greatest gradient by the arrow).

Further, joints that often present problems when performing a particular action may be studied. This may provide an incentive for the manager to better refine the training program. Specifically, for each hand, the total number of times each joint is selected as most needed to be adjusted throughout the video is recorded.

The first five total joints per hand at each performance level are shown in fig. 12-14, reflecting the joints that are more in need of adjustment.

In FIG. 12, the English notation and Chinese translation in FIG. 13 and FIG. 14 are compared as follows:

number of Occurence-Number of occurrences, Thumb, Ring-Ring finger, Palm-Palm center, Index-Index finger, Pinky-little finger, Middle-Middle finger.

Where fig. 12 corresponds to a bad performance level, fig. 13 corresponds to a medium performance level, and fig. 14 corresponds to a good performance level. The left side of each figure represents the left-hand case and the right side represents the right-hand case. By observing these figures, some interesting insight can be drawn about the hands of the action performers.

It can be observed that the tip of the thumb exceeds almost all the remaining thumbs, indicating its importance in origami. The underlying reason is that this joint makes the most contribution to clearly identifiable creases (core part of the evaluation rules).

Ring fingers are used to help wrap the folds from behind. It shows more problems on the left hand than on the right hand. This represents a right hand that those paper folders are more accustomed to using themselves.

The index finger is another important finger in the paper fold, appearing in the top five in all three cases of the left hand. This illustrates the same information about the paper folder as in the previous paragraph.

Since the total number of left and right hands is extremely unbalanced in the case of underperformance, it can be concluded that the lack of dexterity in the left hand of a paper folder directly leads to a reduction in the final performance level.

In summary, the technical solution of the present invention mainly includes two major components: hand pose estimation (feature extraction) and motion estimation.

For the first part, the technical scheme of the invention is based on each frame of a video, firstly, a convolutional neural network is utilized to segment two hands of a person in an image, and when a segmentation area is cut, a small amount of parameters are used instead of all pixel points, so that the calculation efficiency is improved; secondly, respectively detecting key points of each hand by using a convolution gesture machine, and extracting two-dimensional joint point coordinates; then, a neural network is utilized to lift the two-dimensional joint point coordinate to three dimensions, and the two-dimensional joint point coordinate is used as a gesture feature of each hand; and finally, recognizing the left hand and the right hand in the sliding window and states thereof (fist up, fist down, back open and palm open) according to the skin color and the structure of the hand, correspondingly adjusting the characteristics of the gesture, and obtaining space-time characteristics for the whole video.

For the second part, the technical scheme of the invention establishes three models (a long-short term memory network, a discrete cosine transform + a support vector classifier, a discrete Fourier transform + a support vector classifier) based on space-time characteristics to respectively predict the performance of the hand action reflected in the video.

Further, the technical solution of the present invention also provides a mechanism that can calculate the gradient of the final performance with respect to its three-dimensional coordinates for each joint in each frame and then give an indication of how to adjust the posture in order to improve the performance quickly.

The technical scheme of the invention provides a real-time hand motion recognition and evaluation method based on RGB video aiming at the research blank of hand motion evaluation in the field of human behavior analysis based on video processing, the problem of corresponding error of features and real physical parts in human motion evaluation and the real-time problem of a motion evaluation system, so as to solve the matching accuracy of the extracted features and human hand details in a real scene in the process of constantly changing the visual angle of a camera, improve the calculation/operation efficiency of overall motion recognition and evaluation, realize real-time virtual reconstruction of hand motion, evaluate the human hand motion in real time and accurately and improve the accuracy and robustness of the overall motion evaluation.

The method can be widely applied to the fields of vision-based hand posture estimation, motion quality evaluation methods and the like.

Claims

1. A data-driven real-time hand motion evaluation method based on RGB video is characterized by comprising the following steps:

1) acquiring a video to be recognized of a hand;

2) performing hand region segmentation on the video to be identified;

2. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 1, wherein said gesture estimation and organization at least comprises: for conventional RGB video containing image frames showing both hands of a person, the pose of both hands is extracted as a feature of each still frame.

3. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 1, wherein the gesture estimation and organization is implemented by a hand segmentation module, a 2D hand pose estimation module, a 3D hand pose estimation module and a hand pose organization module.

4. The data-driven RGB video-based real-time hand motion assessment method according to claim 3, wherein said hand segmentation module is used to identify and segment the region where the human hand is located in each frame of image, and to construct a model using data from the Egohands data set;

5. The method as claimed in claim 3, wherein the hand segmentation module greatly increases the computation efficiency of the module by cropping the flexible region covering the portion to be segmented without affecting the accuracy.

6. The data-driven real-time hand motion estimation method based on RGB video according to claim 5, wherein the computation efficiency is represented by a reconstruction computation cost RCC and an estimation computation ratio ACR;

7. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 1, wherein the quality assessment of the hand motion is performed by a performance assessment module and a feedback indication module.

8. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 7, wherein said performance assessment includes: the three-dimensional relative position of the joints is used as a feature for each frame, and then a model is built to analyze spatiotemporal information of the whole video and output scores.

9. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 7, wherein said feedback indication comprises: the feedback provided instructs the trainee how each of his static postures should be adjusted, giving the joints that need to be adjusted most and the corresponding adjustment direction that maximizes the final score, achieving the maximum improvement of the trainee's final score by establishing a causal relationship between the actions and the score.

10. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 9, wherein said feedback indication is achieved by maximizing the gradient of the final score with respect to the features of each frame.