CN114613006A - Remote gesture recognition method and device - Google Patents

Remote gesture recognition method and device Download PDF

Info

Publication number
CN114613006A
CN114613006A CN202210225062.7A CN202210225062A CN114613006A CN 114613006 A CN114613006 A CN 114613006A CN 202210225062 A CN202210225062 A CN 202210225062A CN 114613006 A CN114613006 A CN 114613006A
Authority
CN
China
Prior art keywords
gesture
frame
video
hand position
gesture recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210225062.7A
Other languages
Chinese (zh)
Inventor
刘丹
张立波
武延军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN202210225062.7A priority Critical patent/CN114613006A/en
Publication of CN114613006A publication Critical patent/CN114613006A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote gesture recognition method and a device, wherein the method comprises the following steps: obtaining frame p in target video1Hand position h1And based on said hand position h1Calculating a frame p1The gesture estimation area of (1); based on frame p in target videoiHand position hiCalculating a frame piThe gesture estimation area of (a); segmenting the target video based on the gesture estimation region to obtain a plurality of video streams; for each of said video streams stAnd performing gesture recognition to obtain a gesture recognition result of the target video. The invention can detect and identify a plurality of gestures appearing in the video under the remote condition, can cope with the influence of different types of gesture duration and individual gesture speed difference, and has stronger anti-interference performance and higher identification precision.

Description

Remote gesture recognition method and device
Technical Field
The invention relates to the field of computer vision and gesture recognition, in particular to a remote gesture recognition method and device.
Background
The gesture recognition plays an important role in the communication between people and the interaction between people and machines, and has wide application prospect in the fields of sign language recognition and natural man-machine interaction. Since the gestures are complex and varied, the duration has a great uncertainty, and the detection and recognition of the gestures are a challenging task due to the effects of the shooting angle and distance and the lighting conditions.
The gesture recognition needs to consider the change of the shape and the position of the hand, the manually designed feature descriptors are difficult to cover the detailed features of the gesture, and the deep neural network has good feature characterization capability and shows powerful advantages on image and video visual tasks. Therefore, the current mainstream gesture recognition method is to learn and express the complex spatial morphological characteristics and time sequence motion characteristics of the gesture based on the deep neural network. Convolutional Neural Networks (CNNs) are widely used to extract spatial features of images. For the representation of time-series motion characteristics, there are mainly three methods: the first method is based on optical flow (optical flow) and motion vectors, and the method has very large calculation amount, is easily influenced by illumination and shielding conditions, and has poor robustness; the second is to use a Recurrent Neural Networks (RNNs) to extract time sequence features, and this method inputs image features extracted by the convolutional Neural Networks into the RNNs to extract motion features, so the model is huge and complex, and is difficult to optimize, and the original video often needs to be greatly down-sampled, so that the key information is easily lost; and the third method is based on 3D convolution, a three-dimensional convolution kernel is used for performing convolution on two space dimensions and one time dimension, and simultaneously, space features and time sequence features are extracted.
The existing gesture recognition method focuses on close-distance interaction scenes such as face-to-face communication, gesture control driving and the like, for example, Chinese invention patents CN108932500A and CN 113255602A. In these scenarios, the gesture-maker is very close to the camera, and thus the hand is conspicuous and easily recognized in the captured image, whereas in many scenarios, remote control and interaction are required. In a meeting scene, a participant wants to control the large-screen integrated meeting machine to show slides through gestures, and adjusts the playing progress, sound and the like through gestures during home movie watching. Under the long-distance condition, the proportion of the area where the gesture occurs in the visual field of the camera is small, the details of the gesture are insufficient, meanwhile, the background can bring more interference, and the gesture recognition has greater difficulty.
Disclosure of Invention
The invention aims to solve the problems of low gesture recognition precision, low speed and incapability of recognizing remote gestures in the prior art, and provides a remote gesture recognition method and device, which can extract more robust video features, capture remote gestures and perform accurate recognition.
In order to achieve the purpose, the invention adopts the following technical scheme:
a remote gesture recognition method comprises the following steps:
obtaining frame p in target video1Hand position h1And based on said hand position h1Calculating a frame p1Gesture estimation region q of1
Obtaining frame p in target videoiHand position hiWhen the hand position h is reachediFalls on frame pi-1Gesture estimation region q ofjWithin, the gesture estimation area q is estimatedjAs the frame piOtherwise based on the hand position hiCalculating a frame piGesture estimation region q ofj+1
Estimating region q based on gesturesjSegmenting the target video to obtain a plurality of video streams st
For each of said video streams stAnd performing gesture recognition to obtain a gesture recognition result of the target video.
Further, acquiring a frame p in the target video1Hand position h1The method comprises the following steps:
carrying out supervision training on a YOLO V4 Tiny detection model on a hand position training set to obtain a hand detector;
the frame p1The image of (2) is input into the hand detector to obtain the hand position h1
Further, the hand position h is based on1Calculating a frame p1Gesture estimation region q of1The method comprises the following steps: at the hand position h1As a center, respectively expand r outwardswMultiple hand width and rhA rectangular area of multiple hand height.
Further, the gesture-based estimation of the region qjSegmenting the target video to obtain a video stream stThe method comprises the following steps:
in each frame piAcquiring a plurality of key frames;
estimating region q using the same gesturejOf the video stream st
Further, in each frame piAcquiring a key frame, comprising:
estimating region q for gestures having the samejFrame p ofiAnd frame pi-1Respectively estimating the gesture estimation regions qjConversion into a grayscale image FcurAnd a grayscale image Fpre
Calculating the grayscale image FcurAnd a gray scale image FpreThe frame difference map of (a);
converting the frame difference map into a binary map based on a set pixel value threshold;
based on the binary image, estimating a region q in the gesturejThe middle statistic is greater than the pixel value thresholdThe number of pixels of (a);
calculating the number of pixels occupying the gesture estimation area qjAnd determining the frame p according to the proportioniWhether it is a key frame.
Further, said pair of each of said video streams stPerforming gesture recognition to obtain a gesture recognition result of the target video, wherein the gesture recognition result comprises the following steps:
obtaining the video stream s using a sliding windowtA plurality of windows;
inputting a window video stream into a multi-mode gesture recognition model based on a 3D ResNeXt-101 convolutional neural network for predicting the gesture category of the window, wherein after each ResNeXt residual error module of the gesture recognition model, feature graphs from different modal video streams are subjected to weighted fusion;
when the gesture categories of n continuous windows are all predicted to be the gesture category LcThen classify the gesture into LcAs said video stream stA predicted outcome of;
counting said video stream stAnd obtaining the gesture recognition result of the target video according to the prediction result.
Further, the different modality video streams include: RGB video streams and depth video streams.
Further, when the gesture class of m consecutive windows is predicted as the non-gesture class LcThen, the gesture class L is judgedcAnd is finished.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform any of the above methods when executed.
An electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform any of the methods described above.
Compared with the prior art, the invention has the beneficial effects that:
(1) multi-modal data-driven: and through a designed multi-modal gesture recognition model, data characteristics of RGB (red, green and blue) and depth modes are fused for prediction. The two data characteristics complement each other, and the method has stronger anti-interference performance and higher identification precision;
(2) remote identification: the gesture region estimation method is provided, so that the gesture can be rapidly positioned under the remote condition;
(3) the recognition speed is high, and the precision is high: the key frame sampling method is provided, so that accurate gesture prediction can be performed by using fewer frames, and the influence of different types of gesture duration and individual gesture speed difference can be dealt with;
(4) detecting the specific position where the gesture occurs: the specific position of the gesture appearing in the video can be detected while the gesture category is recognized, and a plurality of gestures appearing in the video can be detected and recognized.
Drawings
FIG. 1 is a schematic flow chart of an intelligent remote gesture recognition system according to the present invention.
FIG. 2 is a diagram of a multi-modal gesture recognition network model.
Detailed Description
The invention will be further explained in detail below with reference to the drawing and a specific embodiment, to which the invention is not limited.
According to the method, the remote gesture recognition is carried out based on the RGB-D data and the 3D convolutional neural network, a remote gesture data set needs to be constructed firstly, and training data are provided for a model. In this embodiment, a Kinect V4 device is used to acquire RGB-D video data, the resolutions of RGB stream and depth stream are 1280 × 720 and 640 × 570, respectively, and the RGB stream and depth stream are acquired synchronously at a frame rate of 30 fps. 30 subjects were randomly assigned to 5 scenes for gesture acquisition. The acquisition distance is between 1m and 4 m. After the acquisition is finished, marking the category and the starting frame and the ending frame of each gesture in the video, and marking the position (x, y, w, h) of the hand in each video frame, wherein (x, y) is the coordinate of the central point of the rectangular enclosure frame of the hand, and w and h are the width and the height of the rectangular enclosure frame respectively.
In order to estimate the gesture occurrence area, a hand detector needs to be trained. In the embodiment, a YOLO V4 Tiny detection model is adopted, and supervised training is performed on the constructed data set by using hand position labels, so that the model weight with the best performance is saved.
The gesture is not obvious enough in the camera view under the long-distance condition, so that the gesture generation area is estimated based on the hand position in the method provided by the invention to quickly narrow the gesture range and perform more accurate identification, and the specific steps are as follows:
1) detecting the position of the hand in the current frame by using the trained hand detector, and recording as Rhand(x, y, w, h) where (x, y) is a hand rectangular enclosing region RhandW and h are the width and height of the region, respectively;
2) estimating the possible gesture area as hand center and extending r outwardswMultiple (e.g. 5 times) hand width, rhRectangular region of twice (e.g. 4 times) hand height, denoted as Rges=(x,y,rw×w,rh×h);
3) And repeating the step 1) for each new frame, if the hand position in the current frame falls into the gesture area estimated in the previous frame, keeping the originally estimated gesture area unchanged, and otherwise, performing the step 2) to estimate a new gesture area.
The duration of different gestures is different, the gesture making speed of each individual is different, the gestures with long duration can be recognized only by long-distance time sequence characteristics, the model calculation amount is large and the recognition speed is low by simply increasing the size of a characteristic window. The key frame sampling steps are as follows:
1) cutting corresponding gesture estimation areas from the current frame and the previous frame, converting the gesture estimation areas into gray level images, and recording the gray level images as FcurAnd FpreCalculating a frame difference map of two gesture estimation areas:
Fdiff=|Fcur-Fpre|
2) taking a pixel value threshold value 25, and obtaining a frame difference image FdiffConversion into a binary image FbinI.e. pixelA position with a value greater than 25 is assigned a value of 1, indicating that the amount of change between two frames at that position is sufficiently large, otherwise the value is assigned a value of 0;
3) counting the proportion of the pixel number with the median of 1 in the binary image to the total pixel number:
Figure BDA0003538875910000041
wherein W and H are binary images FbinWhen r is more than 0.3, the current frame is regarded as a key frame and reserved for gesture recognition, otherwise, the current frame is regarded as a redundant frame and directly discarded.
The gesture recognition model is designed based on a ResNeXt-101 convolutional neural network, as shown in FIG. 2, two 3D ResNeXt-101 networks are used for respectively extracting the characteristics of an RGB video stream and a depth video stream, the characteristic graph is subjected to weighted correction after each ResNeXt residual error module, and the model focuses on the more effective and robust characteristics through fusion and weight redistribution of two modal data characteristics. The model training steps are as follows:
1) pretreatment: performing region estimation and key frame sampling on gestures in the data set, and then performing data processing according to the following steps of 7: 3, dividing the ratio into a training set and a verification set;
2) training: training a multi-modal gesture recognition model on the preprocessed RGB-D video data, and using gesture types and gesture initial position marking data as a monitoring signal;
3) and (3) testing: and testing the performance of the model on the verification set, and storing the model weight with the best performance.
For a given video or a video stream collected by a camera in real time, the remote gesture recognition method provided by the invention specifically comprises the following steps:
1) obtaining frame p in target video1Hand position h1And based on said hand position h1Calculating a frame p1Gesture estimation region q of1
2) Obtaining frame p in target videoiHand position hiWhen the hand position h is reachediFalls on frame pi-1Gesture estimation region q ofjWhen the gesture is within the area q, the gesture is estimated to be the area qjAs the frame piOtherwise based on the hand position hiCalculating a frame piGesture estimation region q ofj+1
3) Estimating region q based on gesturesjSegmenting the target video to obtain a plurality of video streams st
4) Obtaining the video stream s using a sliding windowtA plurality of windows;
5) inputting a window video stream into a multi-mode gesture recognition model based on a 3D ResNeXt-101 convolutional neural network for predicting the gesture category of the window, wherein after each ResNeXt residual error module of the gesture recognition model, feature graphs from different modal video streams are subjected to weighted fusion;
6) when the gesture categories of n continuous windows are all predicted to be the gesture category LcThen classify the gesture into LcAs said video stream stOne predicted outcome of;
7) counting said video stream stAnd obtaining the gesture recognition result of the target video according to the prediction result.
The method provided by the invention estimates the gesture generation area based on the hand position, greatly reduces the gesture recognition range, reduces a large amount of background interference under a long distance condition, focuses on the gesture characteristics, and improves the precision of long distance gesture recognition. The key frame sampling is used, so that the influence of the gesture duration and speed is reduced, the calculated amount of the model is reduced, and the accuracy and the speed of model identification are ensured. The designed gesture recognition model based on the 3D convolutional neural network comprehensively utilizes multi-mode data information for recognition, and is higher in anti-interference performance and higher in precision.
The above is only a preferred embodiment of the present invention, and it should be understood that any modifications, equivalents and the like which are within the scope of the basic principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A remote gesture recognition method comprises the following steps:
obtaining frame p in target video1Hand position h1And based on said hand position h1Calculating a frame p1Gesture estimation region q of1
Obtaining frame p in target videoiHand position hiWhen the hand position h is reachediFalls on frame pi-1Gesture estimation region q ofjWithin, the gesture estimation area q is estimatedjAs the frame piOtherwise based on the hand position hiCalculating a frame piGesture estimation region q ofj+1
Estimating region q based on gesturesjSegmenting the target video to obtain a plurality of video streams st
For each of said video streams stAnd performing gesture recognition to obtain a gesture recognition result of the target video.
2. The method of claim 1, wherein the obtaining of frame p in the target video1Hand position h1The method comprises the following steps:
carrying out supervision training on a YOLO V4 Tiny detection model on a hand position training set to obtain a hand detector;
the frame p is divided into two parts1The image of (2) is input into the hand detector to obtain the hand position h1
3. The method of claim 1, wherein the hand position h is based on1Calculating a frame p1Gesture estimation region q of1The method comprises the following steps: at the hand position h1As a center, respectively expand r outwardswMultiple hand width and rhA rectangular area of multiple hand height.
4. The method of claim 1, wherein the gesture-based estimation of region q is based onjSegmenting the target video to obtain a video stream stThe method comprises the following steps:
in each frame piAcquiring a plurality of key frames;
estimating region q using the same gesturejOf the video stream st
5. The method of claim 4, wherein p is present in each frameiAcquiring a key frame, comprising:
estimating region q for gestures having the samejFrame p ofiAnd frame pi-1Respectively estimating the gesture estimation regions qjConversion into a grayscale image FcurAnd a grayscale image Fpre
Calculating the grayscale image FcurAnd gray scale image FpreThe frame difference map of (a);
converting the frame difference map into a binary map based on a set pixel value threshold;
based on the binary image, estimating a region q in the gesturejCounting the number of pixels larger than a pixel value threshold value;
calculating the number of pixels occupying the gesture estimation area qjAnd determining the frame p according to the proportioniWhether it is a key frame.
6. The method of claim 1, wherein said pair of each of said video streams stPerforming gesture recognition to obtain a gesture recognition result of the target video, wherein the gesture recognition result comprises the following steps:
acquiring the video stream s using a sliding windowtA plurality of windows;
inputting a window video stream into a multi-mode gesture recognition model based on a 3D ResNeXt-101 convolutional neural network for predicting the gesture category of the window, wherein after each ResNeXt residual error module of the gesture recognition model, feature graphs from different modal video streams are subjected to weighted fusion;
when the gesture categories of n continuous windows are all predicted to be the gesture category LcThen classify the gesture into LcAs said video stream stOne predicted outcome of;
counting said video stream stAnd obtaining the gesture recognition result of the target video according to the prediction result.
7. The method of claim 6, wherein the different modality video streams include: RGB video streams and depth video streams.
8. The method of claim 6, wherein the gesture class when consecutive m windows is predicted to be the non-gesture class LcThen, the gesture class L is judgedcAnd is finished.
9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.
10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.
CN202210225062.7A 2022-03-09 2022-03-09 Remote gesture recognition method and device Pending CN114613006A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210225062.7A CN114613006A (en) 2022-03-09 2022-03-09 Remote gesture recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210225062.7A CN114613006A (en) 2022-03-09 2022-03-09 Remote gesture recognition method and device

Publications (1)

Publication Number Publication Date
CN114613006A true CN114613006A (en) 2022-06-10

Family

ID=81861159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210225062.7A Pending CN114613006A (en) 2022-03-09 2022-03-09 Remote gesture recognition method and device

Country Status (1)

Country Link
CN (1) CN114613006A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117111530A (en) * 2023-09-27 2023-11-24 浙江加力仓储设备股份有限公司 Intelligent control system and method for carrier through gestures
CN117523669A (en) * 2023-11-17 2024-02-06 中国科学院自动化研究所 Gesture recognition method, gesture recognition device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117111530A (en) * 2023-09-27 2023-11-24 浙江加力仓储设备股份有限公司 Intelligent control system and method for carrier through gestures
CN117111530B (en) * 2023-09-27 2024-05-03 浙江加力仓储设备股份有限公司 Intelligent control system and method for carrier through gestures
CN117523669A (en) * 2023-11-17 2024-02-06 中国科学院自动化研究所 Gesture recognition method, gesture recognition device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2019023921A1 (en) Gesture recognition method, apparatus, and device
CN109035304B (en) Target tracking method, medium, computing device and apparatus
US9020195B2 (en) Object tracking device, object tracking method, and control program
EP2959454B1 (en) Method, system and software module for foreground extraction
US11443454B2 (en) Method for estimating the pose of a camera in the frame of reference of a three-dimensional scene, device, augmented reality system and computer program therefor
CN109145708B (en) Pedestrian flow statistical method based on RGB and D information fusion
CN107452015B (en) Target tracking system with re-detection mechanism
CN109727275B (en) Object detection method, device, system and computer readable storage medium
US9536321B2 (en) Apparatus and method for foreground object segmentation
CN114613006A (en) Remote gesture recognition method and device
JP4682820B2 (en) Object tracking device, object tracking method, and program
CN104966304A (en) Kalman filtering and nonparametric background model-based multi-target detection tracking method
CN107169503B (en) Indoor scene classification method and device
JP6924064B2 (en) Image processing device and its control method, and image pickup device
WO2021147055A1 (en) Systems and methods for video anomaly detection using multi-scale image frame prediction network
CN113869258A (en) Traffic incident detection method and device, electronic equipment and readable storage medium
KR100348357B1 (en) An Effective Object Tracking Method of Apparatus for Interactive Hyperlink Video
JP4918615B2 (en) Object number detection device and object number detection method
KR102584708B1 (en) System and Method for Crowd Risk Management by Supporting Under and Over Crowded Environments
CN116363753A (en) Tumble detection method and device based on motion history image and electronic equipment
Chuang et al. Moving object segmentation and tracking using active contour and color classification models
JP4674920B2 (en) Object number detection device and object number detection method
CN111583341B (en) Cloud deck camera shift detection method
Liu et al. A scalable automated system to measure user experience on smart devices
JP2012242947A (en) Method for measuring number of passing objects, number-of-passing object measuring device, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination