CN111460976A - Data-driven real-time hand motion evaluation method based on RGB video - Google Patents

Data-driven real-time hand motion evaluation method based on RGB video Download PDF

Info

Publication number
CN111460976A
CN111460976A CN202010237076.1A CN202010237076A CN111460976A CN 111460976 A CN111460976 A CN 111460976A CN 202010237076 A CN202010237076 A CN 202010237076A CN 111460976 A CN111460976 A CN 111460976A
Authority
CN
China
Prior art keywords
hand
evaluation
video
module
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010237076.1A
Other languages
Chinese (zh)
Other versions
CN111460976B (en
Inventor
李冕
王天予
王毅杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pnx Information Technology Co ltd
Shanghai Jiaotong University
Original Assignee
Shanghai Pnx Information Technology Co ltd
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pnx Information Technology Co ltd, Shanghai Jiaotong University filed Critical Shanghai Pnx Information Technology Co ltd
Priority to CN202010237076.1A priority Critical patent/CN111460976B/en
Publication of CN111460976A publication Critical patent/CN111460976A/en
Application granted granted Critical
Publication of CN111460976B publication Critical patent/CN111460976B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A data-driven RGB video-based real-time hand motion evaluation method belongs to the field of human behavior analysis based on video processing. Comprises a hand posture estimation unit and an action evaluation unit; the hand posture estimation unit is used for extracting hand key point coordinates from the frame image; the action evaluation unit is used for predicting the score of the hand action quality and giving suggestions on how to improve the score; and performing gesture estimation and organization through a deep learning-based method, and performing action quality evaluation. The method solves the matching accuracy of the extracted features and the human hand details in the real scene in the continuous change process of the visual angle of the camera, can improve the calculation/operation efficiency of overall action recognition and evaluation, realizes real-time virtual reconstruction of the hand action, can evaluate the human hand action in real time and accurately, and improves the accuracy and robustness of the whole action evaluation. The method can be widely applied to the fields of vision-based hand posture estimation, motion quality evaluation methods and the like.

Description

Data-driven real-time hand motion evaluation method based on RGB video
Technical Field
The invention belongs to the field of human behavior analysis based on video processing, and particularly relates to a hand motion real-time evaluation method based on RGB video.
Background
In recent years, rapid developments in the field of computer vision have led to a number of reliable methods for object detection and motion recognition from images and videos. For this reason, the academic world gradually starts exploring the field of video-based human motion quality assessment.
At present, many advances have been made in the macroscopic action of the human body.
In the paper "assess the Quality of action" (Hammed Pierce gas, Carl Vondskirt and Antonio Torraba, European Computer Vision Conference 2014. Sppringe International Press 556-571. HamedPirsia vasosh, Carl Vondrick, and Antonio Torralb.2014. Assessing the Quality of action. in 2014European Conference on Computer Vision Vision (ECCV). Springer International publication 556-571.) a method based on linear support vector regression (L-SVR) was proposed that is trained on the spatio-temporal characteristics of the bottom and top layers.
In the thesis "Kinect-based body-building exercise identification and evaluation" (Wangyi et al, computer science and application, 7.2018, 27.7.1134-1145), a KNN-based fine-tuning method is proposed, which classifies the expression of moving objects according to the similarity between static skeleton data and template models.
However, these existing methods described above cannot ensure that a consistent one-to-one matching relationship is established between the extracted human features and the actual physical portions reflected by the video. For example, a change in the angle of the camera in the paper "assessing the quality of action" would map the real right leg to the features of the left leg in the image accordingly. Great difficulties arise when human hands are involved instead of the body, since the cooperation of one hand and two hands is more complicated and the impact of such matching errors on the evaluation is not negligible.
Therefore, the current exploration for evaluation of hand movements is still very limited. In other words, to date, no method has been investigated for a corresponding assessment of hand motion.
In fact, the evaluation of hand motion is substantially different from the evaluation of human motion in the general sense.
For those hand-based actions, performance is typically dependent on the details in the hand gesture. For example, trainees may be evaluated for surgical training based in part on their hand posture (e.g., a scalpel-holding posture). Also, hand movements involve both hands of a person and different states of each hand (front and back, etc.), which need to be recognized when making an assessment in order to constitute a reasonable feature.
Video-based hand motion quality assessment is important, and many scenarios require this technology to facilitate automated assessment of the hand-based training process. This assessment typically comprises two parts: which part of the performance score and gesture needs the most improvement and how to make the adjustment. The trainees can autonomously improve their performance according to the feedback provided by the technology. Thus, the technique addresses the absence or lack of expert instruction in the normal case. In addition, since this technique is based on a camera, the trainee can get rid of the wearable sensor, thereby training in a more realistic and natural manner.
The evaluation of hand motion is based on hand pose estimation. Hand pose estimation refers to a process and method of extracting two-dimensional or three-dimensional coordinates of each joint of a hand to estimate a pose.
In the conventional gesture estimation method, whether the method for generating a class or the method for distinguishing the class depends on an RGB-D (RGB + Depth Map) picture or video acquired by a Depth camera, which results in very high implementation cost and high requirements on device performance.
In recent years, the academia has proposed some effective methods based on deep learning that rely only on common RGB images and video. This method generally comprises three parts: hand segmentation, pose estimation, and refinement of estimated poses.
However, these methods still have some unsolved problems. First, they do little to consider the issue of computational efficiency. For practical application scenarios, real-time online evaluation is a very important requirement, and therefore, there is a high demand for computational efficiency. In addition, when the estimation posture is improved, the integrality of the two hands is not considered, the integrality is considered to correspond the two hands in the video to the real left hand and the real right hand, and meanwhile, the extracted characteristics of the hands are ensured to be correctly corresponding to each finger joint in the real situation.
In addition to the development in academic circles, there are also many patent documents that propose techniques related to hand motion analysis, for example, chinese invention patent with publication number CN 105160323B, publication number 11/27/2018, and publication number CN 105160323B, which discloses a gesture recognition method aiming at quickly and accurately recognizing a user's gesture based on depth information and color information of an image. The technical solution relies heavily on preset information: the preset hand structure template determines a feature point sequence to be detected of the hand outline, the preset feature point sequence is matched with the action name and the position, and the gesture table is matched with the gesture. This causes scene changes to require modification of the pre-set templates. Moreover, the technical scheme is based on the image containing the depth information, and the required feature points are extracted by depending on the depth information, so that additional requirements are added to the data acquisition equipment.
For another example, in chinese patent invention with publication number CN 103034851B, publication date of 2015, 8, 26 and publication number of CN 103034851B, a self-learning skin color model-based hand tracking device and method are disclosed, which provide a precise hand motion recognition method based on color depth information and scale invariant feature transformation, which can maintain stability and accuracy when the hand is disturbed or occluded. However, the technical solution is to realize tracking based on hand contour and fingertip position, which is not enough for the evaluation of hand motion, and the characteristics capable of reflecting complete gesture are required for realizing the evaluation of hand motion. This solution also relies on depth information to extract the required feature points. Meanwhile, the technical scheme is also based on comparatively identifying the hand motion type in the video, so that a comparatively positive sample is established at a higher cost in a new scene.
The technical solutions in the above patent documents are mainly directed to hand recognition and tracking, and do not relate to the problem of real-time hand motion evaluation. Meanwhile, the technical schemes analyze the macroscopic characteristics of the hand and do not have detailed information deep into the hand structure (such as the condition or action change of a certain finger). Further, these solutions do not consider the problem of incorrect correspondence caused by the change of the camera angle, i.e. a stable camera angle is assumed.
How to solve the matching accuracy of the extracted features and the hand details in the real scene in the continuous change process of the visual angle of the camera, improve the calculation/operation efficiency of the whole action recognition and evaluation, improve the accuracy and robustness of the whole action evaluation, and solve the problem of urgent need in the real-time evaluation work of the hand action.
Disclosure of Invention
The invention aims to provide a data-driven real-time hand motion evaluation method based on RGB video. Aiming at the research blank of hand motion evaluation in the field of human behavior analysis based on video processing, the problem of corresponding errors of features and real physical parts in human motion evaluation and the real-time problem of a motion evaluation system, a real-time hand motion recognition and evaluation method based on RGB video is provided, so that the matching accuracy of the extracted features and the human hand details in a real scene in the continuous change process of the visual angle of a camera is solved, the calculation/operation efficiency of overall motion recognition and evaluation can be improved, the real-time virtual reconstruction of hand motion is realized, the human hand motion can be evaluated in real time and accurately, and the accuracy and robustness of the whole motion evaluation are improved.
The technical scheme of the invention is as follows: the data-driven real-time hand motion evaluation method based on the RGB video is characterized by comprising the following steps of:
1) acquiring a video to be recognized of a hand;
2) performing hand region segmentation on the video to be identified;
3) calculating the probability of selecting each position as a key point according to the segmented hand area to obtain the position of the two-dimensional hand key point;
4) predicting three-dimensional hand key point positions according to the extracted two-dimensional hand key point positions;
5) recognizing the state of the hand according to the skin color and the structure of the hand, correspondingly adjusting the gesture characteristics, and obtaining space-time characteristics for the whole video;
6) based on the space-time characteristics, three models of a long short-term memory network, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier are used for comparison and verification, and the quality level of the hand motion is predicted;
7) the gesture estimation and organization are carried out through a deep learning-based method, real-time virtual reconstruction of hand motions is achieved, and quality evaluation of the hand motions is carried out.
Wherein the gesture estimation and organization at least comprises: for conventional RGB video containing image frames showing both hands of a person, the pose of both hands is extracted as a feature of each still frame.
The gesture estimation and organization are realized through a hand segmentation module, a 2D hand posture estimation module, a 3D hand posture estimation module and a hand posture organization module.
Specifically, the hand segmentation module is used for identifying and segmenting an area where a human hand is located in each frame of image, and constructing a model by using data from an Egohands data set;
the 2D hand posture estimation module is used for extracting two-dimensional coordinate information of each joint point of the hand, and obtaining a two-dimensional hand posture by utilizing a key point score map of the probability that each pixel point is selected as a key point;
the 3D hand posture estimation module is used for lifting the two-dimensional hand posture extracted by the previous module to three-dimensional, and predicting relative and standardized three-dimensional coordinates according to incomplete and noisy key point score maps obtained from the 2D hand posture estimation module;
the hand gesture organization module is used to distinguish the left and right hands in each frame of the video, and the different geometric states of each hand, and then the original rough gesture should be adjusted to conform to the actual situation.
Furthermore, the hand segmentation module greatly improves the calculation efficiency of the module by cutting a flexible area covering the part to be segmented without influencing the accuracy.
Specifically, the computational efficiency is represented by a reconstruction computation cost RCC and an evaluation computation ratio ACR;
the reconstruction calculation cost RCC is used for representing the calculation time of each frame of gesture estimation;
the evaluation calculation ratio ACR is used for representing the ratio of the calculation time of the action quality evaluation to the video duration;
the reconstruction computation cost RCC quantifies the degree of synchronization of the virtual representation of the action with the real action;
the evaluation calculation measures the degree of providing timely evaluation and feedback over ACR.
Further, the quality evaluation of the hand motion is realized through a performance evaluation module and a feedback indication module.
Further, the performance evaluation comprises: the three-dimensional relative position of the joints is used as a feature for each frame, and then a model is built to analyze spatiotemporal information of the whole video and output scores.
Further, the feedback indication includes: the feedback provided instructs the trainee how each of his static postures should be adjusted, giving the joints that need to be adjusted most and the corresponding adjustment direction that maximizes the final score, achieving the maximum improvement of the trainee's final score by establishing a causal relationship between the actions and the score.
The feedback indication is achieved by maximizing the gradient of the final score with respect to the features of each frame.
Compared with the prior art, the invention has the advantages that:
1. the technical scheme integrally solves the problem of lack of real-time supervision of experts in a training scene by realizing automatic hand action evaluation based on the RGB video, and greatly improves the training efficiency;
2. the technical scheme improves the overall operational efficiency by improving the operational efficiency during hand segmentation and a multi-thread parallel operation framework (video reading, video processing and feature organization), provides opportunities for real-time virtual reconstruction of hand actions, achieves the effect of basically synchronizing with the video playing speed in feature extraction, and achieves the effect of timely giving feedback after video playing in action evaluation;
3. according to the technical scheme, the step of characteristic organization is added in the characteristic extraction, so that the matching accuracy of the extracted characteristics and the hand details in the real scene in the continuous change process of the visual angle of the camera is improved, and the accuracy and the robustness of the whole action evaluation are improved.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic structural diagram of a hand segmentation module according to the present invention;
FIG. 3 is a schematic diagram of the structure of a two-dimensional pose estimation module of the present invention;
FIG. 4 is a schematic diagram of the structure of the three-dimensional pose estimation module of the present invention;
FIG. 5 is a schematic diagram of a hand gesture integration module according to the present invention;
FIG. 6 is a schematic diagram of the structure of the action quality assessment module according to the present invention;
FIG. 7 is a schematic flow chart of the present invention for determining the geometric status of the left hand;
FIGS. 8a, 8b, and 8c are diagrams of graphical effects in an embodiment of the present invention;
FIG. 9 is a mean and standard deviation intent of RCC index for poor performance levels on a paper folding data set;
FIG. 10 is a mean and standard deviation plot of the RCC index for typical performance levels on a paper folding data set;
FIG. 11 is the mean and standard deviation intent of RCC index for good performance levels on origami data sets;
FIG. 12 is a schematic of a joint showing the number of first 5 occurrences of poor performance levels on a origami data set;
FIG. 13 is a schematic of a joint showing the number of first 5 occurrences of a general performance level on a origami data set;
FIG. 14 is a schematic of the joints for the first 5 occurrences of good performance levels on the origami data set.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
An image containing depth information (also called image depth) refers to the number of bits used to store each pixel, which is also used to measure the color resolution of the image. It determines the number of colors each pixel of a color image may have or determines the number of gray levels each pixel of a gray scale image may have, which determines the maximum number of colors that may appear in a color image or the maximum gray scale level in a gray scale image. Although the pixel depth or image depth may be deep, the color depth of various display devices is limited. For example, a standard VGA supports 4-bit 16-color images, and at least 8-bit 256 colors are recommended for multimedia applications. Due to device limitations, coupled with limitations on human eye resolution, typically, a particularly deep pixel depth is not necessarily sought. Furthermore, the deeper the pixel depth, the more storage space is occupied. Conversely, if the pixel depth is too shallow, that also affects the quality of the image, which appears to be rough and unnatural to humans.
Aiming at the current situation that the existing three-dimensional hand posture estimation method cannot identify the left hand and the right hand simultaneously and the video-based action automatic evaluation mainly focuses on the body action, the technical scheme of the invention provides a data-driven RGB-video-based real-time hand action evaluation method.
The technical scheme of the invention mainly comprises two main components:
1) a hand posture estimation unit: the hand gesture recognition system comprises a hand segmentation module, a two-dimensional gesture estimation module, a three-dimensional gesture estimation module and a hand gesture integration module, wherein the hand segmentation module, the two-dimensional gesture estimation module, the three-dimensional gesture estimation module and the hand gesture integration module are used for extracting hand key point coordinates from a frame of image;
2) an operation evaluation unit: the hand motion quality assessment system comprises a long-short term memory network model, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier, and is used for predicting the score of hand motion quality and giving suggestions on how to improve the score;
the hand segmentation module of the hand posture estimation unit can lay a solid foundation for hand posture estimation. Applying hand segmentation in real scenarios requires careful consideration of computational efficiency. However, many existing methods do not meet these requirements. The performance of these methods can be drastically degraded if the background is complex or the skin tone changes dramatically. At the same time, these methods ignore the problem of computational efficiency.
In the prior art, hand segmentation was achieved as a model built with data from the Egohands dataset with high quality annotations, with 4800 images containing hands in 48 different environments. The model is trained and applied to each frame of the video. It should be noted that clipping the segmented part from the original image is usually a bottleneck in speed, because this operation requires identifying all pixels on the boundary, so the solution of the present invention does not clip along the exact boundary of the detected box, but instead covers the flexible area covering the detected box.
The two-dimensional hand pose estimation module implements this step according to a codec structure. The 2D hand pose is estimated using a keypoint score map, which represents the probability of each position being selected as a keypoint. The prediction initial score map is represented by image features generated by an encoder. With respect to a specific implementation method for implementing the two-dimensional hand pose estimation module using the codec structure, reference may be made to the paper "convolutional pose machine" (wain-wei, wazu-rumo-christina, jinwuxiong and asher-schh, IEEE international Conference on Computer Vision and Pattern Recognition 2016. IEEE Press publication, 4724-4732. Shih-EnWei, Varun Ramakrishna, Takeo Kanade, and Yaser sheikh.2016. volumetric pos.
The three-dimensional hand pose estimation module estimates the relative regularized three-dimensional coordinates From a PoPriser network in the paper "Estimate 3D gestures From Single RGB image learning" (Cristian-Simleman and Thomas-Brukas, IEEE International computer Vision Conference in 2017. IEEE Press, 4903-4911. Christian Zimmermann and Thomas Brox.2017. L earning to Estimate 3D HandPose From Single RGB images. in 2017IEEE International Conference on computer Vision (ICCV).
The hand pose integration module estimates a rough hand pose based on the previous steps.
Since existing limb motion quality evaluations do not guarantee whether the positions of the extracted limb key points correspond well to each part of the human body in the video. For example, players sometimes experience confusion in their left and right legs due to large changes in camera angle. However, such a mismatch will cause a degradation in the evaluation quality, and it is therefore necessary to maintain a one-to-one match between the extracted spatial information and the actual situation. In the scene with hands of the technical scheme, the left hand and the right hand are distinguished, different states of each hand in each frame of the video are further distinguished, and then the original gesture of the rough hand is adjusted to be consistent with reality under the condition of different camera positions.
Fig. 1 is a flowchart of a method for evaluating a hand motion according to an embodiment of the present invention. The hand motion evaluation method comprises the following two key steps: and performing gesture estimation and organization through a deep learning-based method, and performing action quality evaluation.
Further, each frame of a given video first goes through four modules: hand segmentation, two-dimensional hand pose estimation, three-dimensional hand pose estimation, and hand pose organization. And then forming a space-time dynamic action according to the static gestures, and evaluating the quality degree of the action by the whole dynamic action through an evaluation module.
Still further, the evaluation module also provides corresponding feedback, indicating the most promising improvements.
Specifically, the first step is gesture estimation and organization. Image frames for two hands containing a person
Figure BDA0002431362260000071
The gesture of both hands is extracted as a feature of each still frame. The features of the t-th frame are defined as a set of coordinates pj(t)=(xj(t),yj(t),zj(t)),j∈[1,2m]The coordinates indicate the positions of 2m key points corresponding to the joints of both hands (m is 21 in this example). There are four modules to implement this process: hand segmentation, 2D hand pose estimation, 3D hand pose estimation, and hand pose organization.
Fig. 2 is a schematic structural diagram of the hand segmentation module of the present invention, and it can be seen that the hand segmentation module includes a single-shot multi-box detector.
The hand segmentation module is used for identifying and segmenting the area where the human hand is located in each frame of image. The robust hand segmentation module lays a solid foundation for subsequent extraction of accurate hand gestures. Computational efficiency is also an important aspect in view of its application in real scenarios.
However, most of the existing methods do not meet these requirements. The performance of those methods drops dramatically if the background is unusual or the skin tone changes significantly. Moreover, they do not take into account the impact of computational efficiency carefully. The module constructs a model using data from the Egohands dataset with high quality annotations, with hands in 4800 images in 48 different environments.
The trained model is applied to each frame of the video stream. It can be noted that the bottleneck of computational efficiency is usually to crop the region to be segmented in the original image, since it is often necessary to identify all pixels on the boundary. The module does not cut along the exact boundaries of the detected box. Instead, the flexible area covering the part to be segmented is cropped. This allows a substantial increase in the computational efficiency of the module with little impact on accuracy.
FIG. 3 is a schematic diagram of the structure of the two-dimensional pose estimation module of the present invention.
The two-dimensional hand posture estimation module is used for extracting two-dimensional coordinate information of each joint point of the hand.
This module is implemented according to an encoder-decoder architecture. Extracting two-dimensional coordinate information of each joint point of the hand, and converting the two-dimensional coordinate information into a task of finding out key points in the hand image. And obtaining the two-dimensional hand gesture by utilizing the key point score map of the probability that each pixel point is selected as the key point.
Specifically, for each hand, the pixel coordinate of the j-th joint is recorded as YjThe goal is to predict all coordinates Y ═ Y (Y)1,…,Ym). The module comprises a series of multi-classification predictors gt(. The) for each frame T ∈ {1, …, T }, the corresponding predictor assigns a certain pixel coordinate of the image to each joint YlZ, based on extracting the extracted feature x at position zzAnd each Y from the previous classifierlContext information of neighboring pixels. The definition classifier assigns joint l of frame t to position z ═ (u, v)TThe obtained probability is
Figure BDA0002431362260000081
Definition of
Figure BDA0002431362260000082
Is from bt-1Mapping to contextual characteristics. Then there are
Figure BDA0002431362260000091
By updating on all frames, a complete key point score map can be constructed, thereby extracting the two-dimensional coordinate information of all joint points.
FIG. 4 is a schematic diagram of the three-dimensional pose estimation module according to the present invention.
The three-dimensional hand posture estimation module is used for lifting the two-dimensional hand posture extracted by the previous module to three dimensions. The modules predict relative and normalized three-dimensional coordinates from incomplete and noisy keypoint score maps obtained from previous modules.
First, based on two-dimensional position information, a network is trained to predict the corresponding three-dimensional position in a canonical frameCoordinates; next, a transition between the canonical frame and the relative frame is estimated. In particular, for the latter, a rotation matrix R (w) needs to be estimatedrel) Comprising two steps. First, one needs to find a rotation R around the x-axis and the z-axisxzSo that the joint points are aligned with the y-axis under the canonical framework:
Figure BDA0002431362260000092
second, a rotation R around the y-axis is calculatedySo that
Figure BDA0002431362260000093
And the entire rotation matrix is the product of these two rotations. These estimates are all a matter of view angle estimation.
Through the previous modules, a rough three-dimensional gesture has been estimated. In the prior art, the subsequent evaluation of the human body action evaluation is carried out according to the rough body posture, and whether the extracted position information is consistent with each part of the human body displayed in the video or not is not confirmed. For example, in some experiments based on olympic motion data sets, the left leg and the right leg of the athlete sometimes have wrong correspondence due to a large change in the camera angle (the position information extracted from the right leg corresponds to the features of the left leg). Since such matching errors will result in a decrease in the accuracy of the subsequent evaluation, it is necessary to add a hand posture organization module to ensure a consistent one-to-one match between the extracted joint position information and the actual situation.
Fig. 5 is a schematic structural diagram of the hand posture integration module according to the present invention.
As shown in fig. 5, the hand pose integration module is used to distinguish between the left and right hands in each frame of the video, and the different geometric states of each hand. The original coarse pose should then be adjusted to the actual situation, regardless of the condition of the camera.
In particular, although the hand segmentation module is trained on a first perspective data set that distinguishes between left and right hands, the module employs a correction mechanism to increase its robustness. The part of the forearm to which the hands are attached is first detected in the image. If these regions extend to the lower boundary of the frame, the camera is said to be in a first viewing angle, in which case the left hand segment in the image corresponds to the left hand and the right hand segment corresponds to the right hand. Otherwise, the camera is at the viewing angle of the observer, and the corresponding relationship is reversed.
Each hand defines and distinguishes four states: open with palm facing upward, open with palm facing downward.
Fig. 7 shows a detailed algorithm for distinguishing the four geometric states of the left hand, and the right hand algorithm works on the same principle as the left hand. The image of each hand segment is first converted from RGB to HSV color space. The slight gaussian blur is then used for processing so that the human skin can be better distinguished from other objects with a similar HSV (which is considered noise in this solution). After that, an elliptical region is sampled from the surface of the hand, and the gray scale of the region is checked. Since the back of a person's hand tends to be darker than the palm, the front and back of the hand can be identified. Further, the circumference and area of the hand are calculated to determine whether the hand is in a fist or a spread state. Judging the unfolded shape when the area and the circumference are larger than or equal to the respective threshold values; otherwise, the state is determined as the fist-making state. For the case of being in the fist making state, the fifteenth frame (representing the state of making a fist just before making a fist) before the current frame is used for calculating the gray value and judging the front and back sides, because the gray value corresponding to the hand when making a fist contains the elements of the front and back sides at the same time, the judgment is easy to be wrong.
Considering the final effect, the state of the left hand with its fist up and palm spread out is equivalent to the state of the right hand with its fist down and palm spread out. Thus, all states of both hands are attributed to different treatments of the left and right hands. Specifically, for the left hand, the three-dimensional coordinates obtained by the original extraction are used; and turning over the three-dimensional coordinate obtained by original extraction along the z axis for the right hand.
Specifically, the second step is hand motion assessment.
Fig. 6 is a schematic structural diagram of the action quality evaluation module according to the present invention.
Based on the three-dimensional poses of the two hands organized, a quality assessment of the hand movements shown in the video can be made. The evaluation includes two modules: performance assessment and feedback indication. For the former, the three-dimensional relative position of the joint (from left hand to right hand) is used as a feature for each frame. A model is then built to analyze the spatiotemporal information of the entire video and output scores. For the latter, feedback is provided indicating how the trainee should adjust each of its static postures to achieve the maximum improvement in the final score. This is achieved by maximizing the gradient of the final score with respect to the features of each frame.
For the module "Performance assessment Module", the characteristic of the jth joint in the tth frame is pj(t)=[xj(t),yj(t),zj(t)],j∈[1,2m]All of which are normalized with respect to the palm center. Then, the characteristics of all 2m joints (both hands) are organically spliced, yielding phi (t) ═ p(1)(t),…,p(2m)(t)]High level information characterizing the action in each frame.
There are two main methods for implementing automatic evaluation: learning and typical examples. The former presents a machine learning problem in which labeled data is collected from experts (also known as expert databases) to train a scoring model, which can then be generalized to complete the task. The latter compares the observed video with a baseline video of the hand motion to assess the quality of the hand motion. The present module uses the first approach because the second approach does not perform well under the requirements of multiple ideal implementations. Specifically, in order to ensure that the scores are unbiased, the second method must include a large number of well-behaved actions as a reference.
Given that static gestures are interrelated in completing an action, it is necessary to explore the spatiotemporal features of the entire video.
In the technical scheme of the invention, three models of a long-short term memory network, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier are used for comparison and verification.
For the first model, the "long short term memory network", this task has similarities to other time series data for which a recurrent neural network augmented with long short term memory elements has proven to perform well, therefore, the timing aspects of actions are modeled using a layer 1L STM network.
For the second model, "discrete cosine transform + support vector classifier", a discrete-time cosine transform is applied to the time series features to obtain a composite representation in the frequency domain. Then, a defined number of low frequency components are subjected to a support vector classifier (regressor) process to give a score.
For the third model "discrete fourier transform + support vector classifier", the discrete-time cosine transform can simply be replaced by a discrete-time fourier transform.
Training the network to minimize the Mean Absolute Error (MAE) between the target score and the predicted score:
Figure BDA0002431362260000111
wherein SnAnd
Figure BDA0002431362260000112
respectively the true score and the predicted probability vector in video n.
For the module two, "feedback indication module," in addition to performance evaluation quantifying the effects of gestures, it is necessary to provide feedback to indicate how the trainee makes gesture adjustments to improve the overall score.
The module outputs the score relative to the last frame (t) by calculating the output score0) Extracted featuresBecause the L STM model captures important spatio-temporal features over the course of the video, ignoring relatively trivial features0The output gates in the frame are denoted as
O(t0)=σ(Woφ(t0)+UoO(t0-1)+bo)
Where σ (-) denotes a sigmoid function, WoAnd UoIs a shared parameter matrix, boIs the bias term. For the sake of simplicity, note
Qm(t0)=σ(Wmφ(t0)+UmO(t0-1)+bm)
Km(t0)=tanh(Wmφ(t0)+UmO(t0-1)+bm)
Probability vector of ensemble
Figure BDA0002431362260000121
Relative to the input phi (t)0) Is calculated as
Figure BDA0002431362260000122
Where H (-) is the derivative function of the softmax function, W 'and b' are the weight matrix and bias term, respectively, of the fully-connected layer, an
A=tanh(Ka(t0)Qi(t0)+Qf(t0)h(t0-1))
Figure BDA0002431362260000124
The calculated gradient can be represented as a matrix of 3 × 6m the goal is to select the largest element in the row vector that corresponds to good performance
Figure BDA0002431362260000123
The results obtained reflect the joints that most need to be adjusted and the corresponding adjustment parties that maximize the final score improvementFurther, t may also be calculated by back propagation of L STM as 1,2, …, t 01 frame corresponding value.
Thus, a causal relationship between the action and the score may be established.
To our knowledge, there is currently no public data set for hand motion assessment. Most relevant datasets deal with gesture recognition and human-based motion quality assessment. To this end, we have built their own origami video data set.
The task of hand motion assessment requires the capture of clear motions to fully reflect the performer's gestures.
Clearly, medical surgery is a good choice. However, it is almost impossible to view and record those surgical videos on a large scale. Furthermore, the skills exhibited in medical procedures are typically reflected on the tools that are operated, rather than on the hands themselves.
Thus, as an effect of the method of the present invention, only one of the basic actions in paper folding, folding a square sheet into 8 × 8 small squares, has been selected.
Experts classify performance into three levels according to the following rules:
1. higher levels are represented by the need to fold the paper very carefully during operation; when the paper is folded longitudinally, the two edges are strictly overlapped; in addition, the crease should be thin and clear so that the paper can still bear bending without breaking after being folded for four times; the harsh requirements add some additional procedures to the conventional procedure;
2. moderate levels are represented by the paper being folded with relative care; however, it is not necessary to ensure that all edges overlap exactly in the process; finally, the square edges will be clear but not particularly true;
3. poor horizontal performance, very careless folding of the paper; no matter whether the two edges are strictly overlapped; the final edges of the dice are ambiguous and indistinguishable; the basic strategy is always to fold the sheet longitudinally without stopping to check the edges.
A short video of 144 paper folding actions, of which 44 are labeled by the expert as good actions, 66 are labeled as medium actions, and 34 are labeled as bad actions. The data set is divided into a training set and a test set in a 5:1 ratio, and each labeled action video is guaranteed to be distributed in the two sets in equal proportion.
The Bayesian optimization method provides the best hyper-parameter set for the three evaluation models on the training set.
Accuracy and computational efficiency are two crucial considerations in performance assessment.
The accuracy of the aforementioned three models ("long-short term memory network", "discrete cosine transform + support vector classifier", and "discrete fourier transform + support vector classifier") on the test set was compared under different criteria (see table 1 below) and classes (see table 2 below).
L STM gives the highest accuracy but the lowest AUC, although it predicts the data for good performance well, its performance decreases at other levels (especially bad levels).
The underlying reason is that L STM is sensitive to time-domain changes (e.g., time span and motion phase). good-level performance is prominent in these areas.
It can be concluded that L STM is more suitable for application on time-sensitive actions, whereas DCT + SVC shows more advantages in evaluating actions that strictly follow standard rules.
TABLE 1
Figure BDA0002431362260000141
TABLE 2
Figure BDA0002431362260000142
Computational efficiency is also very important for video-based methods, especially when they are applied in real scenes. Given that the solution of the invention is applied to training systems aimed at facilitating virtual reality, computational efficiency should be ensured so that real-time hand reconstruction and timely action quality assessment can be performed.
Therefore, the technical scheme defines two new indexes: reconstructing a computation Cost (RCC) for representing a computation time of each frame of gesture estimation; an evaluation calculation Ratio (ACR) representing a Ratio of a calculation time of the motion quality evaluation to a video duration. Since the extraction and organization of hand gestures proceeds as the video motion progresses, the reconstruction computation cost RCC quantifies how synchronized the virtual representation of the motion is with the real motion. On the other hand, the evaluation calculation measures the degree to which timely evaluation and feedback is provided over ACR.
The experiments were performed on a computer equipped with CPU Intel Xeon Bronze 3106, GPU 1080Ti and a memory size of 16 GB. The mean and variance of the reconstruction computation cost RCC for each video in the data set were studied for each performance level of the action (see fig. 9 to 11).
In FIG. 9, the English notation and Chinese translation in FIG. 10 and FIG. 11 are compared as follows:
Time-Time; Mean-Mean, Variance-Variance, Video ID-Video number.
It has been found that this value fluctuates mainly around 0.08 s. This illustrates that for normal 12 frames per second (fps) video (0.083 s per frame), a well-organized gesture is nearly synchronized with the motion being performed. For higher fps video, RCC can be improved by skipping frames, i.e., disregarding those of poor quality in each sliding window. The trade-off between RCC and accuracy can be balanced according to specific requirements.
Thus, this approach provides the opportunity for real-time virtual reconstruction of hand movements.
Further, the average values of the videos of the three action performance levels on the evaluation calculation ratio ACR were 0.23, 0.077, and 0.11, respectively. This indicates that this method can provide feedback in a short time after the action is completed.
In addition to performance assessment, feedback is also shown for each frame on how to adjust the gesture.
Several examples are shown in fig. 8a to 8 c. FIGS. 8a and 8b show the regions of two hands obtained after the gesture estimation and organization steps; fig. 8c shows the joint that requires the most adjustment from all joints in both hands (i.e., the joint with the greatest amount of gradient, which is indicated by the circle in fig. 8c, and the direction of the greatest gradient by the arrow).
Further, joints that often present problems when performing a particular action may be studied. This may provide an incentive for the manager to better refine the training program. Specifically, for each hand, the total number of times each joint is selected as most needed to be adjusted throughout the video is recorded.
The first five total joints per hand at each performance level are shown in fig. 12-14, reflecting the joints that are more in need of adjustment.
In FIG. 12, the English notation and Chinese translation in FIG. 13 and FIG. 14 are compared as follows:
number of Occurence-Number of occurrences, Thumb, Ring-Ring finger, Palm-Palm center, Index-Index finger, Pinky-little finger, Middle-Middle finger.
Where fig. 12 corresponds to a bad performance level, fig. 13 corresponds to a medium performance level, and fig. 14 corresponds to a good performance level. The left side of each figure represents the left-hand case and the right side represents the right-hand case. By observing these figures, some interesting insight can be drawn about the hands of the action performers.
It can be observed that the tip of the thumb exceeds almost all the remaining thumbs, indicating its importance in origami. The underlying reason is that this joint makes the most contribution to clearly identifiable creases (core part of the evaluation rules).
Ring fingers are used to help wrap the folds from behind. It shows more problems on the left hand than on the right hand. This represents a right hand that those paper folders are more accustomed to using themselves.
The index finger is another important finger in the paper fold, appearing in the top five in all three cases of the left hand. This illustrates the same information about the paper folder as in the previous paragraph.
Since the total number of left and right hands is extremely unbalanced in the case of underperformance, it can be concluded that the lack of dexterity in the left hand of a paper folder directly leads to a reduction in the final performance level.
In summary, the technical solution of the present invention mainly includes two major components: hand pose estimation (feature extraction) and motion estimation.
For the first part, the technical scheme of the invention is based on each frame of a video, firstly, a convolutional neural network is utilized to segment two hands of a person in an image, and when a segmentation area is cut, a small amount of parameters are used instead of all pixel points, so that the calculation efficiency is improved; secondly, respectively detecting key points of each hand by using a convolution gesture machine, and extracting two-dimensional joint point coordinates; then, a neural network is utilized to lift the two-dimensional joint point coordinate to three dimensions, and the two-dimensional joint point coordinate is used as a gesture feature of each hand; and finally, recognizing the left hand and the right hand in the sliding window and states thereof (fist up, fist down, back open and palm open) according to the skin color and the structure of the hand, correspondingly adjusting the characteristics of the gesture, and obtaining space-time characteristics for the whole video.
For the second part, the technical scheme of the invention establishes three models (a long-short term memory network, a discrete cosine transform + a support vector classifier, a discrete Fourier transform + a support vector classifier) based on space-time characteristics to respectively predict the performance of the hand action reflected in the video.
Further, the technical solution of the present invention also provides a mechanism that can calculate the gradient of the final performance with respect to its three-dimensional coordinates for each joint in each frame and then give an indication of how to adjust the posture in order to improve the performance quickly.
The technical scheme of the invention provides a real-time hand motion recognition and evaluation method based on RGB video aiming at the research blank of hand motion evaluation in the field of human behavior analysis based on video processing, the problem of corresponding error of features and real physical parts in human motion evaluation and the real-time problem of a motion evaluation system, so as to solve the matching accuracy of the extracted features and human hand details in a real scene in the process of constantly changing the visual angle of a camera, improve the calculation/operation efficiency of overall motion recognition and evaluation, realize real-time virtual reconstruction of hand motion, evaluate the human hand motion in real time and accurately and improve the accuracy and robustness of the overall motion evaluation.
The method can be widely applied to the fields of vision-based hand posture estimation, motion quality evaluation methods and the like.

Claims (10)

1. A data-driven real-time hand motion evaluation method based on RGB video is characterized by comprising the following steps:
1) acquiring a video to be recognized of a hand;
2) performing hand region segmentation on the video to be identified;
3) calculating the probability of selecting each position as a key point according to the segmented hand area to obtain the position of the two-dimensional hand key point;
4) predicting three-dimensional hand key point positions according to the extracted two-dimensional hand key point positions;
5) recognizing the state of the hand according to the skin color and the structure of the hand, correspondingly adjusting the gesture characteristics, and obtaining space-time characteristics for the whole video;
6) based on the space-time characteristics, three models of a long short-term memory network, a discrete cosine transform and support vector classifier and a discrete Fourier transform and support vector classifier are used for comparison and verification, and the quality level of the hand motion is predicted;
7) the gesture estimation and organization are carried out through a deep learning-based method, real-time virtual reconstruction of hand motions is achieved, and quality evaluation of the hand motions is carried out.
2. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 1, wherein said gesture estimation and organization at least comprises: for conventional RGB video containing image frames showing both hands of a person, the pose of both hands is extracted as a feature of each still frame.
3. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 1, wherein the gesture estimation and organization is implemented by a hand segmentation module, a 2D hand pose estimation module, a 3D hand pose estimation module and a hand pose organization module.
4. The data-driven RGB video-based real-time hand motion assessment method according to claim 3, wherein said hand segmentation module is used to identify and segment the region where the human hand is located in each frame of image, and to construct a model using data from the Egohands data set;
the 2D hand posture estimation module is used for extracting two-dimensional coordinate information of each joint point of the hand, and obtaining a two-dimensional hand posture by utilizing a key point score map of the probability that each pixel point is selected as a key point;
the 3D hand posture estimation module is used for lifting the two-dimensional hand posture extracted by the previous module to three-dimensional, and predicting relative and standardized three-dimensional coordinates according to incomplete and noisy key point score maps obtained from the 2D hand posture estimation module;
the hand gesture organization module is used to distinguish the left and right hands in each frame of the video, and the different geometric states of each hand, and then the original rough gesture should be adjusted to conform to the actual situation.
5. The method as claimed in claim 3, wherein the hand segmentation module greatly increases the computation efficiency of the module by cropping the flexible region covering the portion to be segmented without affecting the accuracy.
6. The data-driven real-time hand motion estimation method based on RGB video according to claim 5, wherein the computation efficiency is represented by a reconstruction computation cost RCC and an estimation computation ratio ACR;
the reconstruction calculation cost RCC is used for representing the calculation time of each frame of gesture estimation;
the evaluation calculation ratio ACR is used for representing the ratio of the calculation time of the action quality evaluation to the video duration;
the reconstruction computation cost RCC quantifies the degree of synchronization of the virtual representation of the action with the real action;
the evaluation calculation measures the degree of providing timely evaluation and feedback over ACR.
7. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 1, wherein the quality assessment of the hand motion is performed by a performance assessment module and a feedback indication module.
8. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 7, wherein said performance assessment includes: the three-dimensional relative position of the joints is used as a feature for each frame, and then a model is built to analyze spatiotemporal information of the whole video and output scores.
9. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 7, wherein said feedback indication comprises: the feedback provided instructs the trainee how each of his static postures should be adjusted, giving the joints that need to be adjusted most and the corresponding adjustment direction that maximizes the final score, achieving the maximum improvement of the trainee's final score by establishing a causal relationship between the actions and the score.
10. The data-driven RGB video-based real-time hand motion assessment method as claimed in claim 9, wherein said feedback indication is achieved by maximizing the gradient of the final score with respect to the features of each frame.
CN202010237076.1A 2020-03-30 2020-03-30 Data-driven real-time hand motion assessment method based on RGB video Active CN111460976B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010237076.1A CN111460976B (en) 2020-03-30 2020-03-30 Data-driven real-time hand motion assessment method based on RGB video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010237076.1A CN111460976B (en) 2020-03-30 2020-03-30 Data-driven real-time hand motion assessment method based on RGB video

Publications (2)

Publication Number Publication Date
CN111460976A true CN111460976A (en) 2020-07-28
CN111460976B CN111460976B (en) 2023-06-06

Family

ID=71680513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010237076.1A Active CN111460976B (en) 2020-03-30 2020-03-30 Data-driven real-time hand motion assessment method based on RGB video

Country Status (1)

Country Link
CN (1) CN111460976B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233516A (en) * 2020-10-12 2021-01-15 萱闱(北京)生物科技有限公司 Grading method and system for physician CPR examination training and examination
CN112233515A (en) * 2020-10-12 2021-01-15 萱闱(北京)生物科技有限公司 Unmanned examination and intelligent scoring method applied to physician CPR examination
CN112329571A (en) * 2020-10-27 2021-02-05 同济大学 Self-adaptive human body posture optimization method based on posture quality evaluation
CN113223364A (en) * 2021-06-29 2021-08-06 中国人民解放军海军工程大学 Submarine cable diving buoy simulation training system
CN113435320A (en) * 2021-06-25 2021-09-24 中国科学技术大学 Human body posture estimation method with multiple models configured in self-adaption mode

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992858A (en) * 2017-12-25 2018-05-04 深圳市唯特视科技有限公司 A kind of real-time three-dimensional gesture method of estimation based on single RGB frame
US20180307319A1 (en) * 2017-04-20 2018-10-25 Microsoft Technology Licensing, Llc Gesture recognition
CN110047591A (en) * 2019-04-23 2019-07-23 吉林大学 One kind is for doctor's posture appraisal procedure in surgical procedures
CN110147767A (en) * 2019-05-22 2019-08-20 深圳市凌云视迅科技有限责任公司 Three-dimension gesture attitude prediction method based on two dimensional image
CN110738192A (en) * 2019-10-29 2020-01-31 腾讯科技(深圳)有限公司 Human motion function auxiliary evaluation method, device, equipment, system and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180307319A1 (en) * 2017-04-20 2018-10-25 Microsoft Technology Licensing, Llc Gesture recognition
CN107992858A (en) * 2017-12-25 2018-05-04 深圳市唯特视科技有限公司 A kind of real-time three-dimensional gesture method of estimation based on single RGB frame
CN110047591A (en) * 2019-04-23 2019-07-23 吉林大学 One kind is for doctor's posture appraisal procedure in surgical procedures
CN110147767A (en) * 2019-05-22 2019-08-20 深圳市凌云视迅科技有限责任公司 Three-dimension gesture attitude prediction method based on two dimensional image
CN110738192A (en) * 2019-10-29 2020-01-31 腾讯科技(深圳)有限公司 Human motion function auxiliary evaluation method, device, equipment, system and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢清超,晁建刚 等: "基于关节点遮挡推测的多相机手姿态估计方法" *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112233516A (en) * 2020-10-12 2021-01-15 萱闱(北京)生物科技有限公司 Grading method and system for physician CPR examination training and examination
CN112233515A (en) * 2020-10-12 2021-01-15 萱闱(北京)生物科技有限公司 Unmanned examination and intelligent scoring method applied to physician CPR examination
CN112329571A (en) * 2020-10-27 2021-02-05 同济大学 Self-adaptive human body posture optimization method based on posture quality evaluation
CN112329571B (en) * 2020-10-27 2022-12-16 同济大学 Self-adaptive human body posture optimization method based on posture quality evaluation
CN113435320A (en) * 2021-06-25 2021-09-24 中国科学技术大学 Human body posture estimation method with multiple models configured in self-adaption mode
CN113435320B (en) * 2021-06-25 2022-07-15 中国科学技术大学 Human body posture estimation method with multiple models configured in self-adaption mode
CN113223364A (en) * 2021-06-29 2021-08-06 中国人民解放军海军工程大学 Submarine cable diving buoy simulation training system

Also Published As

Publication number Publication date
CN111460976B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111460976B (en) Data-driven real-time hand motion assessment method based on RGB video
Bourdev et al. Poselets: Body part detectors trained using 3d human pose annotations
Li et al. Model-based segmentation and recognition of dynamic gestures in continuous video streams
CN102576259B (en) Hand position detection method
US11783615B2 (en) Systems and methods for language driven gesture understanding
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN108256421A (en) A kind of dynamic gesture sequence real-time identification method, system and device
CN108647654A (en) The gesture video image identification system and method for view-based access control model
Liang et al. Resolving ambiguous hand pose predictions by exploiting part correlations
Chen et al. Combining unsupervised learning and discrimination for 3D action recognition
CN108921011A (en) A kind of dynamic hand gesture recognition system and method based on hidden Markov model
CN114821764A (en) Gesture image recognition method and system based on KCF tracking detection
CN114445853A (en) Visual gesture recognition system recognition method
Neverova Deep learning for human motion analysis
CN108595014A (en) A kind of real-time dynamic hand gesture recognition system and method for view-based access control model
US11361467B2 (en) Pose selection and animation of characters using video data and training techniques
CN108108648A (en) A kind of new gesture recognition system device and method
Hasan et al. Gesture feature extraction for static gesture recognition
Otberdout et al. Hand pose estimation based on deep learning depth map for hand gesture recognition
Achmed Upper body pose recognition and estimation towards the translation of South African Sign Language
Lin Visual hand tracking and gesture analysis
Leow et al. 3-D–2-D spatiotemporal registration for sports motion analysis
Micilotta Detection and tracking of humans for visual interaction
Psaltis Optical flow for dynamic facial expression recognition.
Poppe Discriminative vision-based recovery and recognition of human motion.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant