CN110889387A

CN110889387A - Real-time dynamic gesture recognition method based on multi-track matching

Info

Publication number: CN110889387A
Application number: CN201911215465.8A
Authority: CN
Inventors: 简琤峰; 李俊杰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-03-17

Abstract

The invention relates to a real-time dynamic gesture recognition method based on multi-track matchingFASTObtaining a positive sample containing all fingertip points by a convolutional neural network of an angular point detection algorithm, clustering the positive sample based on the obtained fingertip points and another unprocessed video image to obtain a minimum point set in one frame, matching the fingertips in the two frames by a global nearest neighbor matching algorithm, and finally utilizing the global nearest neighbor matching algorithm to match the fingertips in the two framesLSTMAnd the neural network carries out multi-track classification and dynamically identifies the gesture. The invention is combined withCNNAndFASTthe angular point detection algorithm can quickly detect the fingerThe positions of the sharp points are fused with an asymmetric point set matching algorithmLSTMThe dynamic gestures are classified with high robustness, and the gesture recognition efficiency is high and the recognition effect is good.

Description

Real-time dynamic gesture recognition method based on multi-track matching

Technical Field

The present invention relates to computing; calculating; the technical field of counting, in particular to a real-time dynamic gesture recognition method based on multi-track matching in the fields of human-computer interaction and computer vision.

Background

In the field of human-computer interaction, gesture interaction is one of the most common and important interaction forms, and is a crucial link in gesture interaction for directly relating gesture recognition to accuracy and robustness of gesture interaction.

Currently, the gesture recognition is divided from the perspective of whether the gesture recognition has a time sequence meaning, and the gesture recognition can be divided into static gesture recognition and dynamic gesture recognition; static gesture recognition aims at classifying gestures in a single frame through morphological characteristics of the gestures, and dynamic gesture recognition is used for classifying gesture actions in a time sequence. In practical application, the gestures almost have time sequence meanings such as rotation, grabbing and the like, so that the dynamic gesture recognition has more practical meanings.

A monocular camera or a sensor can be used for information acquisition in the dynamic gesture recognition. Common sensors for acquiring gesture information include Kinect, Leap Motion, and the like; although the sensors can acquire complicated characteristics such as depth information and joint information of the hand, the sensors generally have the defects of high price, large calculation amount and inapplicability to a mobile terminal; the gesture recognition can be well carried out by only using the monocular camera, the problems can be well avoided, the cost is low, the monocular camera can be widely applied to various mobile platforms, and the support of other hardware is not needed, however, the monocular camera only has RGB information, cannot directly segment gestures, and is easily influenced by noise.

In the prior art, the method for performing dynamic gesture recognition through a monocular camera mainly includes two methods:

(1) training frames in an interval as a group through 3D convolution to achieve the purpose of classifying actions in a period of time, wherein the method has the defect that long-time gestures are difficult to recognize;

(2) after the gesture is segmented through the color space, the mass center of the hand is tracked, and the track is predicted by using the DTW or the HMM.

Disclosure of Invention

The invention solves the problems in the prior art, provides an optimized real-time dynamic gesture recognition method based on multi-track matching, can dynamically recognize gestures in real time, and has high robustness.

The invention adopts the technical scheme that a real-time dynamic gesture recognition method based on multi-track matching comprises the following steps:

step 1: acquiring a video stream;

step 2: copying a video image, and obtaining a hand area image by dividing the video image;

and step 3: constructing a convolutional neural network based on a FAST corner detection algorithm, and acquiring a positive sample containing all the fingertip points;

and 4, step 4: clustering based on the obtained positive sample of the fingertip point and another unprocessed video image;

and 5: matching the fingertips in the two frames through a global nearest neighbor matching algorithm;

step 6: and carrying out multi-track classification by using an LSTM neural network, and dynamically identifying the gesture.

Preferably, the step 2 comprises the steps of:

step 2.1: compressing each frame of video image of the video stream to a preset resolution;

step 2.2: converting the compressed video images from an RGB color space to a YCrCb color space in sequence;

step 2.3: and taking the Cb component and the Cr component, and segmenting to obtain a hand region image.

Preferably, the step 3 comprises the steps of:

step 3.1: acquiring the hand region image segmented in the step 2;

step 3.2: detecting all corners of the hand region image by using a FAST corner detection algorithm, and cutting the original image by taking each corner as a center to obtain a plurality of image slices;

step 3.3: and constructing a lightweight convolutional neural network, classifying the image slices, and if the classification probability is greater than or equal to 50%, determining the image slices as positive samples, otherwise, determining the image slices as negative samples.

Preferably, in step 3.2, the image slices are 32 × 32 pixels.

Preferably, in the step 3.3, the lightweight convolutional neural network comprises four groups of sub-blocks connected in sequence, wherein any sub-block comprises a depth convolutional layer and a maximum pooling layer, and the depth convolutional layer of the second group to the fourth group of sub-blocks is a 1 × 1 convolutional layer; the fourth group of sub-blocks are sequentially connected with a 1 × 1 deep convolutional layer, a global mean pooling layer and a full-connection layer; a number of image slices are input through the depth convolution layer of the first set of sub-blocks, and classification probabilities are output from the fully-connected layer.

Preferably, the step 4 comprises the steps of:

step 4.1: traversing all the positive samples obtained in the step 3, constructing a set C,

wherein any positive sample r_iHas the coordinates of (x)_i,y_i)，D₁A threshold value for distance; i and j identify different positive samples, respectively;

step 4.2: constructing a set T for keeping the scores of all the elements in the set C, updating the set T,

where n is the increment in element T, the length of set T;

step 4.3: reordering the elements in the set T according to a descending order, and correspondingly modifying the ordering of the elements in the set C;

step 4.4: let the minimum set of points after clustering that contains all fingertips be set R,

wherein the content of the first and second substances,

preferably, the step 5 comprises the steps of:

step 5.1: constructing two frame imagesTwo minimum point sets A, B and distance matrices D, D obtained after step 4_i,j＝||A_i,B_j||₂Wherein i is less than or equal to the number of elements in the point set A and is greater than or equal to 0, and j is less than or equal to the number of elements in the point set B and is greater than or equal to 0; taking col and row as the row number and column number of D respectively;

step 5.2: if col>row, then

Carrying out the next step, otherwise, directly carrying out the next step;

step 5.3: constructing sets seq, Dis, gloseq and glodis for storing temporary variables;

step 5.4: from D_i,jStart search, add i to seq, let Dis_i＝Dis_i+D_i,jSeq is added to gloseq;

if it satisfies

Adding Dis to glodis, otherwise, making i ═ i + 1;

step 5.5: let j equal j +1, Dis_i＝Dis_i-D_i,jIf j is row-1, the next step is carried out, otherwise, the step 5.4 is returned;

step 5.6: obtaining the minimum element glodis in the glodis_iWith glodis_iAs the optimal solution for point matching.

Preferably, in step 6, after a matching solution between two frames is obtained, the direction angle of the corresponding point of the two frames is calculated and used as a unit of the sequence of LSTM to be input.

Preferably, the direction angle is encoded, mapping it from (0,360 ° ] to an integer of 1 to 12.

Preferably, if the corresponding points of the two frames cannot be matched, the point pair which cannot be matched is marked as "-1".

The invention relates to an optimized real-time dynamic gesture recognition method based on multi-track matching, which comprises the steps of obtaining video streams, copying video images, obtaining hand region images by dividing one video image, constructing a convolutional neural network based on a FAST corner detection algorithm, obtaining a positive sample containing all fingertip points, clustering the positive sample based on the obtained fingertip points and the other unprocessed video image to obtain a minimum point set in one frame, matching fingertips in two frames by a global nearest neighbor point matching algorithm, and finally performing multi-track classification by using an LSTM neural network to dynamically recognize gestures.

The method combines the corner detection algorithm of CNN and FAST, can quickly detect the position of the fingertip point, and simultaneously integrates the asymmetric point set matching algorithm and the LSTM, thereby realizing high-robustness classification of the dynamic gesture, and having high gesture recognition efficiency and good recognition effect.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of a lightweight convolutional neural network according to the present invention;

FIG. 3 is a schematic diagram of clustering positive samples of a lightweight convolutional neural network according to the present invention, where dots at the cusps are the clustered positive sample points;

fig. 4 shows that finger tip points between different frames are matched, and the pointing position of the pin is the matched position.

Detailed Description

The present invention is described in further detail with reference to the following examples, but the scope of the present invention is not limited thereto.

The invention relates to a real-time dynamic gesture recognition method based on multi-track matching, which combines a FAST corner detection algorithm with improved depth CNN, uses a new global nearest neighbor matching algorithm to match fingertips in two frames of images and enable the fingertips to form tracks, and finally uses LSTM to classify a plurality of matched tracks to obtain a dynamic gesture recognition result.

The method comprises the following steps.

Step 1: a video stream is acquired.

Step 2: and copying the video image, and segmenting the video image to obtain a hand area image.

The step 2 comprises the following steps:

In the invention, the human skin color has strong aggregation property on the Cb component and the Cr component of YCrCb, so that the non-skin color area can be effectively removed by utilizing the Cb component and the Cr component.

In the present invention, specifically, the Cb component is a blue chrominance component and the Cr component is a red chrominance component.

In the invention, before processing, the resolution of the image can be compressed to 480 × 640, and the effectiveness of segmentation is ensured.

And step 3: and (3) constructing a convolutional neural network based on a FAST corner detection algorithm, and acquiring a positive sample containing all the fingertip points.

The step 3 comprises the following steps:

step 3.1: acquiring the hand region image segmented in the step 2;

in said step 3.2, the image slice is 32 x 32 pixels.

In the step 3.3, the lightweight convolutional neural network comprises four groups of sub-blocks which are sequentially connected, wherein any sub-block comprises a depth convolutional layer and a maximum pooling layer, and the depth convolutional layers of the second group to the fourth group of sub-blocks are 1 × 1 convolutional layers; the fourth group of sub-blocks are sequentially connected with a 1 × 1 deep convolutional layer, a global mean pooling layer and a full-connection layer; a number of image slices are input through the depth convolution layer of the first set of sub-blocks, and classification probabilities are output from the fully-connected layer.

In the invention, the corner is a category of characteristic points, the FAST corner detection algorithm is the prior art in the field, and the person skilled in the art can detect the corner according to the requirement.

In the present invention, the positive samples are all candidate points that may be fingertips, the negative samples are points that are not necessarily fingertips, and the positive samples obtained in step 3 include fingertips and interference points near the fingertips, which need to be excluded in step 4.

In the invention, the lightweight convolutional neural network is constructed on the assumption that cross-channel correlation mapping and space correlation mapping in a feature diagram are completely decoupled, so that the parameter quantity of the network can be greatly reduced; to further reduce the parameters, we do not directly use depth separable convolution, but first add the maximum pool layer after the depth convolution layer and then expand the number of channels with 1 × 1 convolution. In addition, a plurality of full-connection layers are not adopted in the invention, but a global mean pooling layer and a full-connection layer are used, so that the overfitting of the network is weakened, and the parameter quantity of the full-connection layer is reduced.

And 4, step 4: clustering is performed based on a positive sample of the acquired fingertip points and another unprocessed video image.

The step 4 comprises the following steps:

where n is the increment in element T, the length of set T;

wherein the content of the first and second substances,

in the invention, because the corners obtained by the FAST corner detection algorithm are gathered at the fingertip points and the areas nearby the fingertip points, a plurality of fingertip points can be detected, and the CNN cannot eliminate redundant points, the points need to be combined by the clustering algorithm.

In the present invention, the formula of step 4.1 means: replacing elements in the set when the condition is satisfied, and adding the elements to the set when the condition is not satisfied; at the outset, the set is an empty set, i.e. { c }_jThe symbol is empty set; after traversing to the first positive sample point, c_jSatisfies the condition of being_iReplacement is carried out; when the subsequent points are traversed, judging whether to add a new point to the set or not according to conditions; the formula of step 4.2 is the same.

And 5: and matching the fingertips in the two frames by using a global nearest neighbor matching algorithm.

The step 5 comprises the following steps:

step 5.1: constructing two minimum point sets A, B and distance matrixes D, D obtained after step 4 of two-frame image_i,j＝||A_i,B_j||₂Wherein i is less than or equal to the number of elements in the point set A and is greater than or equal to 0, and j is less than or equal to the number of elements in the point set B and is greater than or equal to 0; taking col and row as the row number and column number of D respectively;

step 5.2: if col>row, then

Carrying out the next step, otherwise, directly carrying out the next step;

if it satisfies

Adding Dis to glodis, otherwise, making i ═ i + 1;

In the invention, step 4 obtains the minimum point set in a frame, and step 5 matches the minimum point sets in two frames; the main problem of combining multiple points into multiple tracks is the asymmetric matching between two point sets, i.e. the length of a point set is not equal to the point set itself; if the fingertips in two independent frames need to be matched into a track, the problem that the fingertips cannot be identified 100% and are inevitably misrecognized must be considered, and a global nearest neighbor matching algorithm is adopted for solving the problem.

In the present invention, D is the euclidean distance between all points in the point set A, B.

In the present invention, step 5.2 means to ensure that the number of columns of D is necessarily greater than the number of rows, otherwise, the transposition is performed.

In the present invention, from D_i,jStarting the search means starting the search from both i and j being 0, and adding one to the corresponding subscript every time a cycle is completed.

In step 6, after a matching solution between two frames is obtained, the direction angles of corresponding points of the two frames are calculated and used as a unit of the sequence of the LSTM to be input.

The direction angle is encoded and mapped from (0,360 ° ] to an integer of 1 to 12.

And if the corresponding points of the two frames cannot be matched, marking the point pairs which cannot be matched as '-1'.

In the invention, after a matching solution between two frames is obtained, the direction angles of corresponding points of the two frames are calculated to represent input data of the LSTM, but the track is discontinuous due to incomplete matching of fingertips between the two frames, in order to enable the LSTM to recognize a dynamic gesture with a sparse track, a point pair which cannot be matched is marked as '-1', and the azimuth angle of the point pair which can be matched is calculated and is taken as a unit of a sequence of the LSTM to be input.

The method comprises the steps of obtaining video streams, copying video images, obtaining hand region images by dividing one video image, constructing a convolutional neural network based on a FAST corner detection algorithm, obtaining a positive sample containing all fingertip points, clustering the obtained positive sample of the fingertip points and another unprocessed video image to obtain a minimum point set in one frame, matching fingertips in two frames by a global nearest neighbor point matching algorithm, and finally performing multi-track classification by using an LSTM neural network to dynamically identify gestures.

Claims

1. A real-time dynamic gesture recognition method based on multi-track matching is characterized by comprising the following steps: the method comprises the following steps:

step 1: acquiring a video stream;

2. The real-time dynamic gesture recognition method based on multi-track matching according to claim 1, characterized in that: the step 2 comprises the following steps:

3. The real-time dynamic gesture recognition method based on multi-track matching according to claim 1, characterized in that: the step 3 comprises the following steps:

step 3.1: acquiring the hand region image segmented in the step 2;

4. The real-time dynamic gesture recognition method based on multi-track matching according to claim 3, characterized in that: in said step 3.2, the image slice is 32 x 32 pixels.

5. The real-time dynamic gesture recognition method based on multi-track matching according to claim 3, characterized in that: in the step 3.3, the lightweight convolutional neural network comprises four groups of sub-blocks which are sequentially connected, wherein any sub-block comprises a depth convolutional layer and a maximum pooling layer, and the depth convolutional layers of the second group to the fourth group of sub-blocks are 1 × 1 convolutional layers; the fourth group of sub-blocks are sequentially connected with a 1 × 1 deep convolutional layer, a global mean pooling layer and a full-connection layer; a number of image slices are input through the depth convolution layer of the first set of sub-blocks, and classification probabilities are output from the fully-connected layer.

6. The real-time dynamic gesture recognition method based on multi-track matching according to claim 1, characterized in that: the step 4 comprises the following steps:

where n is the increment in element T, the length of set T;

wherein the content of the first and second substances,

7. the real-time dynamic gesture recognition method based on multi-track matching according to claim 1, characterized in that: the step 5 comprises the following steps:

step 5.2: if col>row, then

Carrying out the next step, otherwise, directly carrying out the next step;

if it satisfies

Adding Dis to glodis, otherwise, making i ═ i + 1;

8. The real-time dynamic gesture recognition method based on multi-track matching according to claim 1, characterized in that: in step 6, after a matching solution between two frames is obtained, the direction angles of corresponding points of the two frames are calculated and used as a unit of the sequence of the LSTM to be input.

9. The real-time dynamic gesture recognition method based on multi-track matching according to claim 8, characterized in that: the direction angle is encoded and mapped from (0,360 ° ] to an integer of 1 to 12.

10. The real-time dynamic gesture recognition method based on multi-track matching according to claim 8, characterized in that: and if the corresponding points of the two frames cannot be matched, marking the point pairs which cannot be matched as '-1'.