CN114613006A

CN114613006A - Remote gesture recognition method and device

Info

Publication number: CN114613006A
Application number: CN202210225062.7A
Authority: CN
Inventors: 刘丹; 张立波; 武延军
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-10

Abstract

The invention discloses a remote gesture recognition method and a device, wherein the method comprises the following steps: obtaining frame p in target video₁Hand position h₁And based on said hand position h₁Calculating a frame p₁The gesture estimation area of (1); based on frame p in target video_iHand position h_iCalculating a frame p_iThe gesture estimation area of (a); segmenting the target video based on the gesture estimation region to obtain a plurality of video streams; for each of said video streams s_tAnd performing gesture recognition to obtain a gesture recognition result of the target video. The invention can detect and identify a plurality of gestures appearing in the video under the remote condition, can cope with the influence of different types of gesture duration and individual gesture speed difference, and has stronger anti-interference performance and higher identification precision.

Description

Remote gesture recognition method and device

Technical Field

The invention relates to the field of computer vision and gesture recognition, in particular to a remote gesture recognition method and device.

Background

The gesture recognition plays an important role in the communication between people and the interaction between people and machines, and has wide application prospect in the fields of sign language recognition and natural man-machine interaction. Since the gestures are complex and varied, the duration has a great uncertainty, and the detection and recognition of the gestures are a challenging task due to the effects of the shooting angle and distance and the lighting conditions.

The gesture recognition needs to consider the change of the shape and the position of the hand, the manually designed feature descriptors are difficult to cover the detailed features of the gesture, and the deep neural network has good feature characterization capability and shows powerful advantages on image and video visual tasks. Therefore, the current mainstream gesture recognition method is to learn and express the complex spatial morphological characteristics and time sequence motion characteristics of the gesture based on the deep neural network. Convolutional Neural Networks (CNNs) are widely used to extract spatial features of images. For the representation of time-series motion characteristics, there are mainly three methods: the first method is based on optical flow (optical flow) and motion vectors, and the method has very large calculation amount, is easily influenced by illumination and shielding conditions, and has poor robustness; the second is to use a Recurrent Neural Networks (RNNs) to extract time sequence features, and this method inputs image features extracted by the convolutional Neural Networks into the RNNs to extract motion features, so the model is huge and complex, and is difficult to optimize, and the original video often needs to be greatly down-sampled, so that the key information is easily lost; and the third method is based on 3D convolution, a three-dimensional convolution kernel is used for performing convolution on two space dimensions and one time dimension, and simultaneously, space features and time sequence features are extracted.

The existing gesture recognition method focuses on close-distance interaction scenes such as face-to-face communication, gesture control driving and the like, for example, Chinese invention patents CN108932500A and CN 113255602A. In these scenarios, the gesture-maker is very close to the camera, and thus the hand is conspicuous and easily recognized in the captured image, whereas in many scenarios, remote control and interaction are required. In a meeting scene, a participant wants to control the large-screen integrated meeting machine to show slides through gestures, and adjusts the playing progress, sound and the like through gestures during home movie watching. Under the long-distance condition, the proportion of the area where the gesture occurs in the visual field of the camera is small, the details of the gesture are insufficient, meanwhile, the background can bring more interference, and the gesture recognition has greater difficulty.

Disclosure of Invention

The invention aims to solve the problems of low gesture recognition precision, low speed and incapability of recognizing remote gestures in the prior art, and provides a remote gesture recognition method and device, which can extract more robust video features, capture remote gestures and perform accurate recognition.

In order to achieve the purpose, the invention adopts the following technical scheme:

a remote gesture recognition method comprises the following steps:

obtaining frame p in target video₁Hand position h₁And based on said hand position h₁Calculating a frame p₁Gesture estimation region q of₁；

Obtaining frame p in target video_iHand position h_iWhen the hand position h is reached_iFalls on frame p_i-1Gesture estimation region q of_jWithin, the gesture estimation area q is estimated_jAs the frame p_iOtherwise based on the hand position h_iCalculating a frame p_iGesture estimation region q of_j+1；

Estimating region q based on gestures_jSegmenting the target video to obtain a plurality of video streams s_t；

For each of said video streams s_tAnd performing gesture recognition to obtain a gesture recognition result of the target video.

Further, acquiring a frame p in the target video₁Hand position h₁The method comprises the following steps:

carrying out supervision training on a YOLO V4 Tiny detection model on a hand position training set to obtain a hand detector;

the frame p₁The image of (2) is input into the hand detector to obtain the hand position h₁。

Further, the hand position h is based on₁Calculating a frame p₁Gesture estimation region q of₁The method comprises the following steps: at the hand position h₁As a center, respectively expand r outwards_wMultiple hand width and r_hA rectangular area of multiple hand height.

Further, the gesture-based estimation of the region q_jSegmenting the target video to obtain a video stream s_tThe method comprises the following steps:

in each frame p_iAcquiring a plurality of key frames;

estimating region q using the same gesture_jOf the video stream s_t。

Further, in each frame p_iAcquiring a key frame, comprising:

estimating region q for gestures having the same_jFrame p of_iAnd frame p_i-1Respectively estimating the gesture estimation regions q_jConversion into a grayscale image F_curAnd a grayscale image F_pre；

Calculating the grayscale image F_curAnd a gray scale image F_preThe frame difference map of (a);

converting the frame difference map into a binary map based on a set pixel value threshold;

based on the binary image, estimating a region q in the gesture_jThe middle statistic is greater than the pixel value thresholdThe number of pixels of (a);

calculating the number of pixels occupying the gesture estimation area q_jAnd determining the frame p according to the proportion_iWhether it is a key frame.

Further, said pair of each of said video streams s_tPerforming gesture recognition to obtain a gesture recognition result of the target video, wherein the gesture recognition result comprises the following steps:

obtaining the video stream s using a sliding window_tA plurality of windows;

inputting a window video stream into a multi-mode gesture recognition model based on a 3D ResNeXt-101 convolutional neural network for predicting the gesture category of the window, wherein after each ResNeXt residual error module of the gesture recognition model, feature graphs from different modal video streams are subjected to weighted fusion;

when the gesture categories of n continuous windows are all predicted to be the gesture category L_cThen classify the gesture into L_cAs said video stream s_tA predicted outcome of;

counting said video stream s_tAnd obtaining the gesture recognition result of the target video according to the prediction result.

Further, the different modality video streams include: RGB video streams and depth video streams.

Further, when the gesture class of m consecutive windows is predicted as the non-gesture class L_cThen, the gesture class L is judged_cAnd is finished.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform any of the above methods when executed.

An electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform any of the methods described above.

Compared with the prior art, the invention has the beneficial effects that:

(1) multi-modal data-driven: and through a designed multi-modal gesture recognition model, data characteristics of RGB (red, green and blue) and depth modes are fused for prediction. The two data characteristics complement each other, and the method has stronger anti-interference performance and higher identification precision;

(2) remote identification: the gesture region estimation method is provided, so that the gesture can be rapidly positioned under the remote condition;

(3) the recognition speed is high, and the precision is high: the key frame sampling method is provided, so that accurate gesture prediction can be performed by using fewer frames, and the influence of different types of gesture duration and individual gesture speed difference can be dealt with;

(4) detecting the specific position where the gesture occurs: the specific position of the gesture appearing in the video can be detected while the gesture category is recognized, and a plurality of gestures appearing in the video can be detected and recognized.

Drawings

FIG. 1 is a schematic flow chart of an intelligent remote gesture recognition system according to the present invention.

FIG. 2 is a diagram of a multi-modal gesture recognition network model.

Detailed Description

The invention will be further explained in detail below with reference to the drawing and a specific embodiment, to which the invention is not limited.

According to the method, the remote gesture recognition is carried out based on the RGB-D data and the 3D convolutional neural network, a remote gesture data set needs to be constructed firstly, and training data are provided for a model. In this embodiment, a Kinect V4 device is used to acquire RGB-D video data, the resolutions of RGB stream and depth stream are 1280 × 720 and 640 × 570, respectively, and the RGB stream and depth stream are acquired synchronously at a frame rate of 30 fps. 30 subjects were randomly assigned to 5 scenes for gesture acquisition. The acquisition distance is between 1m and 4 m. After the acquisition is finished, marking the category and the starting frame and the ending frame of each gesture in the video, and marking the position (x, y, w, h) of the hand in each video frame, wherein (x, y) is the coordinate of the central point of the rectangular enclosure frame of the hand, and w and h are the width and the height of the rectangular enclosure frame respectively.

In order to estimate the gesture occurrence area, a hand detector needs to be trained. In the embodiment, a YOLO V4 Tiny detection model is adopted, and supervised training is performed on the constructed data set by using hand position labels, so that the model weight with the best performance is saved.

The gesture is not obvious enough in the camera view under the long-distance condition, so that the gesture generation area is estimated based on the hand position in the method provided by the invention to quickly narrow the gesture range and perform more accurate identification, and the specific steps are as follows:

1) detecting the position of the hand in the current frame by using the trained hand detector, and recording as R_hand(x, y, w, h) where (x, y) is a hand rectangular enclosing region R_handW and h are the width and height of the region, respectively;

2) estimating the possible gesture area as hand center and extending r outwards_wMultiple (e.g. 5 times) hand width, r_hRectangular region of twice (e.g. 4 times) hand height, denoted as R_ges＝(x，y，r_w×w，r_h×h)；

3) And repeating the step 1) for each new frame, if the hand position in the current frame falls into the gesture area estimated in the previous frame, keeping the originally estimated gesture area unchanged, and otherwise, performing the step 2) to estimate a new gesture area.

The duration of different gestures is different, the gesture making speed of each individual is different, the gestures with long duration can be recognized only by long-distance time sequence characteristics, the model calculation amount is large and the recognition speed is low by simply increasing the size of a characteristic window. The key frame sampling steps are as follows:

1) cutting corresponding gesture estimation areas from the current frame and the previous frame, converting the gesture estimation areas into gray level images, and recording the gray level images as F_curAnd F_preCalculating a frame difference map of two gesture estimation areas:

F_diff＝|F_cur-F_pre|

2) taking a pixel value threshold value 25, and obtaining a frame difference image F_diffConversion into a binary image F_binI.e. pixelA position with a value greater than 25 is assigned a value of 1, indicating that the amount of change between two frames at that position is sufficiently large, otherwise the value is assigned a value of 0;

3) counting the proportion of the pixel number with the median of 1 in the binary image to the total pixel number:

wherein W and H are binary images F_binWhen r is more than 0.3, the current frame is regarded as a key frame and reserved for gesture recognition, otherwise, the current frame is regarded as a redundant frame and directly discarded.

The gesture recognition model is designed based on a ResNeXt-101 convolutional neural network, as shown in FIG. 2, two 3D ResNeXt-101 networks are used for respectively extracting the characteristics of an RGB video stream and a depth video stream, the characteristic graph is subjected to weighted correction after each ResNeXt residual error module, and the model focuses on the more effective and robust characteristics through fusion and weight redistribution of two modal data characteristics. The model training steps are as follows:

1) pretreatment: performing region estimation and key frame sampling on gestures in the data set, and then performing data processing according to the following steps of 7: 3, dividing the ratio into a training set and a verification set;

2) training: training a multi-modal gesture recognition model on the preprocessed RGB-D video data, and using gesture types and gesture initial position marking data as a monitoring signal;

3) and (3) testing: and testing the performance of the model on the verification set, and storing the model weight with the best performance.

For a given video or a video stream collected by a camera in real time, the remote gesture recognition method provided by the invention specifically comprises the following steps:

1) obtaining frame p in target video₁Hand position h₁And based on said hand position h₁Calculating a frame p₁Gesture estimation region q of₁；

2) Obtaining frame p in target video_iHand position h_iWhen the hand position h is reached_iFalls on frame p_i-1Gesture estimation region q of_jWhen the gesture is within the area q, the gesture is estimated to be the area q_jAs the frame p_iOtherwise based on the hand position h_iCalculating a frame p_iGesture estimation region q of_j+1；

3) Estimating region q based on gestures_jSegmenting the target video to obtain a plurality of video streams s_t；

4) Obtaining the video stream s using a sliding window_tA plurality of windows;

5) inputting a window video stream into a multi-mode gesture recognition model based on a 3D ResNeXt-101 convolutional neural network for predicting the gesture category of the window, wherein after each ResNeXt residual error module of the gesture recognition model, feature graphs from different modal video streams are subjected to weighted fusion;

6) when the gesture categories of n continuous windows are all predicted to be the gesture category L_cThen classify the gesture into L_cAs said video stream s_tOne predicted outcome of;

7) counting said video stream s_tAnd obtaining the gesture recognition result of the target video according to the prediction result.

The method provided by the invention estimates the gesture generation area based on the hand position, greatly reduces the gesture recognition range, reduces a large amount of background interference under a long distance condition, focuses on the gesture characteristics, and improves the precision of long distance gesture recognition. The key frame sampling is used, so that the influence of the gesture duration and speed is reduced, the calculated amount of the model is reduced, and the accuracy and the speed of model identification are ensured. The designed gesture recognition model based on the 3D convolutional neural network comprehensively utilizes multi-mode data information for recognition, and is higher in anti-interference performance and higher in precision.

The above is only a preferred embodiment of the present invention, and it should be understood that any modifications, equivalents and the like which are within the scope of the basic principle of the present invention are included in the protection scope of the present invention.

Claims

1. A remote gesture recognition method comprises the following steps:

2. The method of claim 1, wherein the obtaining of frame p in the target video₁Hand position h₁The method comprises the following steps:

the frame p is divided into two parts₁The image of (2) is input into the hand detector to obtain the hand position h₁。

3. The method of claim 1, wherein the hand position h is based on₁Calculating a frame p₁Gesture estimation region q of₁The method comprises the following steps: at the hand position h₁As a center, respectively expand r outwards_wMultiple hand width and r_hA rectangular area of multiple hand height.

4. The method of claim 1, wherein the gesture-based estimation of region q is based on_jSegmenting the target video to obtain a video stream s_tThe method comprises the following steps:

in each frame p_iAcquiring a plurality of key frames;

estimating region q using the same gesture_jOf the video stream s_t。

5. The method of claim 4, wherein p is present in each frame_iAcquiring a key frame, comprising:

Calculating the grayscale image F_curAnd gray scale image F_preThe frame difference map of (a);

based on the binary image, estimating a region q in the gesture_jCounting the number of pixels larger than a pixel value threshold value;

6. The method of claim 1, wherein said pair of each of said video streams s_tPerforming gesture recognition to obtain a gesture recognition result of the target video, wherein the gesture recognition result comprises the following steps:

acquiring the video stream s using a sliding window_tA plurality of windows;

when the gesture categories of n continuous windows are all predicted to be the gesture category L_cThen classify the gesture into L_cAs said video stream s_tOne predicted outcome of;

7. The method of claim 6, wherein the different modality video streams include: RGB video streams and depth video streams.

8. The method of claim 6, wherein the gesture class when consecutive m windows is predicted to be the non-gesture class L_cThen, the gesture class L is judged_cAnd is finished.

9. A storage medium having a computer program stored thereon, wherein the computer program is arranged to, when run, perform the method of any of claims 1-8.

10. An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the method according to any of claims 1-8.