CN108537147B - Gesture recognition method based on deep learning - Google Patents
Gesture recognition method based on deep learning Download PDFInfo
- Publication number
- CN108537147B CN108537147B CN201810242638.4A CN201810242638A CN108537147B CN 108537147 B CN108537147 B CN 108537147B CN 201810242638 A CN201810242638 A CN 201810242638A CN 108537147 B CN108537147 B CN 108537147B
- Authority
- CN
- China
- Prior art keywords
- gesture
- formula
- algorithm
- layer
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
- G06V40/113—Recognition of static hand signs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10024—Color image
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a gesture recognition method based on deep learning, which is characterized by comprising the following steps of: training the binary convolution neural network by utilizing a gesture training set and a test set; segmenting the preprocessed original image based on the color information by using the color information reflected by the skin color, and extracting a gesture outline; judging a gesture instruction corresponding to the gesture contour by using the trained binary convolution neural network; and positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and identifying the dynamic gestures by using an HMM algorithm. The method provided by the invention can solve the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like in the conventional gesture recognition.
Description
Technical Field
The invention relates to a gesture recognition method based on deep learning, and belongs to the technical field of gesture recognition.
Background
The appearance of computers has extremely important influence on human social production and daily life, greatly improves the information processing efficiency on one hand, and promotes the development of intelligent life on the other hand. Therefore, how to interact with a computer efficiently and conveniently becomes a hot point of research.
With the development of social information technology, Human Computer Interaction (HCI) has become an important part of daily life. As a new man-machine interaction mode, the gesture recognition technology has wide application prospect in a plurality of fields: (1) digital life and entertainment. For example, in 2008, ericsson introduced a smart phone R520m, which collected gesture information of a user through a built-in camera thereof and served as a keyboard or a touch screen on a mobile phone interface, thereby realizing control of an alarm clock and an incoming call. (2) The field of scientific and technological innovation. In the fields of space exploration and military research, some dangerous environments or special environments which are inconvenient for direct contact control of people are often encountered, and related information can be obtained through interaction of the gesture remote control robot. (3) The field of intelligent transportation, such as unmanned driving. As early as 2010, Google corporation has published their unmanned automobiles outside, which opened up a new era of intelligent transportation.
The gesture recognition technology in the technical field of human-computer interaction can play the following roles:
(1) for a user, the method helps the user to use the product more conveniently, saves the user time and improves the user experience of the user;
(2) for products, redundant use instructions are eliminated, and the products are used only by providing related universal gesture guidance.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the traditional gesture recognition algorithm generally has the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like.
In order to solve the technical problem, the technical scheme of the invention is to provide a gesture recognition method based on deep learning, which is characterized by comprising the following steps:
step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;
step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;
step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;
step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;
step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;
and 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm.
Preferably, in the step 2, the preprocessing includes brightness correction and light compensation;
during brightness correction, correcting a highlight area in the original gesture image by using corrected exponential transformation; for a darker area in an original gesture image, correcting by using logarithmic transformation with parameters, and not correcting other areas;
and performing light compensation based on a dynamic threshold, converting the original gesture image into a YCbCr color space based on an algorithm of a total reflection theory, and then taking a set of points with larger Y components in the YCbCr color space image as a white reference point.
Preferably, in the step 3, when the original image is segmented, a skin color segmentation algorithm based on the YCbCr color space is adopted.
The method provided by the invention can solve the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like in the conventional gesture recognition.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:
the method improves the traditional gesture recognition algorithm based on the conventional technology, uses the improved illumination compensation strategy to enable the original image to be easier to process, uses the improved skin color model to segment the gesture so as to improve the segmentation accuracy, and uses the improved deep convolutional network to classify the static gesture so as to improve the static gesture recognition rate; the improved TLD and HMM algorithm is used for tracking and recognizing the dynamic gesture, so that the robustness, the real-time performance and the recognition rate of a gesture system are improved.
Drawings
FIG. 1 is a system architecture schematic of the design of the deep learning based gesture recognition system of the present invention;
FIG. 2 is a diagram of a binary convolutional neural network architecture of the present invention;
FIG. 3 is a TLD algorithm framework diagram;
FIG. 4 is a detailed flow chart of the TLD algorithm;
FIG. 5 is a flow chart of the modified TLD algorithm;
FIG. 6 is a flow chart of system software design.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to a gesture recognition method based on deep learning, which comprises the following steps as shown in figure 1:
step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;
step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;
step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;
step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;
step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;
and 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm.
The above steps are further described in detail with reference to the following examples:
1. the preprocessing of the original gesture image in the step 2 mainly comprises the following steps: the luminance correction based on exponential transformation and logarithmic transformation and the light compensation based on dynamic threshold specifically comprise the following steps:
(1) luminance correction based on exponential and logarithmic transformations.
The exponential transformation only has good correction effect on a bright area in an image, the logarithmic transformation has good correction effect on a dark area in the image, and the two are combined to realize a light compensation strategy for a human hand, as shown in formula (1), the corrected exponential transformation is used for correcting the bright area, the logarithmic transformation with parameters is used for correcting the dark area, and other areas are not corrected.
The parameters used for equation (1) are as follows:
g (x, y) represents the corrected image; f (x, y) represents the original gesture image; a represents a highlight adjustment coefficient, and a is 0 in the present embodiment; b represents the average luminance of the image, and b is 120/log T in this embodiment1(ii) a c denotes the normal number, which is found by experimental adjustment, in this example c ═ T2(ii) a d denotes (normal number passing)Experimental work has shown that in this example d is 1/255-T2;T1Indicating a lower threshold for light, T in this embodiment, in dim light conditions1=115;T2Indicating an upper threshold for light, T in this embodiment, under brighter lighting conditions2=135。
(2) Light compensation based on dynamic threshold
An algorithm based on the total reflection theory converts an image into a YCbCr color space, and then takes a set of points in the YCbCr color space where the Y component is large as a white reference point. The detailed process is as follows:
assuming that the original gesture image is f (x, y) and the size is m × n, then:
step 1, firstly, converting an original gesture image f (x, y) from an RGB color space to a YCbCr color space by using a formula (2):
step 2, obtaining a reference white point
(a) Cutting the converted image into M × N blocks, where M is 3 and N is 4 in this embodiment;
(b) for each segmented block, C in YCbCr space is calculated respectivelybAnd CrAverage value M of the componentsbAnd Mr;
(c) Using MbAnd MrFor C of each blockbAnd CrMean absolute error D of the componentsbAnd DrAnd (3) performing calculation, wherein the calculation formula is as shown in formula (3):
in the formula (3), Cb(i, j) represents the offset of the B component of each pixel point relative to the brightness, Cr(i, j) represents an offset of the R component with respect to luminance, and sum represents the total number of pixels of the current block.
2. When the preprocessed original image is segmented based on the color information in the step 3, a YCbCr color space based skin color segmentation algorithm is adopted, and the method specifically comprises the following steps:
the YCbCr color space is also called YUV color space, Y denotes brightness, and Cr and Cb denote chroma and saturation, where Cr reflects the difference between the red part of the RGB input signal and the RGB signal brightness value. And Cb reflects the difference between the blue part of the RGB input signal and the luminance value of the RGB signal. The conversion formula from RGB color space to YCrCb is shown in equation (4):
through repeated experiments, the basic values of the parameters are as follows:
77≤Cb≤127 AND 132≤Cr≤172 (5)
however, the formula (5) contains more skin color ranges, and the provided value range is too large, so that interference such as orange or brown objects is easily introduced. Aiming at the unique skin color characteristics of the yellow race, the invention can effectively eliminate the interference of skin color-like objects by adjusting the values after debugging for many times, and the values are as follows:
3. the binarization-Based Convolutional Neural Network (BCNN) based on MOCNN is adopted in the binarization-based convolutional neural network in the step 1, and specifically comprises the following steps:
the current popular deep convolution neural network algorithms have a common defect that the calculation consumption is huge. Therefore, optimization of network computational consumption also mainly expands around these two aspects. On the basis of the MPCNN gesture classification method, a Binary Convolution Neural Network (BCNN) gesture classification method is provided, and a binary approximation strategy is adopted to improve the neural network and reduce the consumption of calculation resources. The binary network has two main ways to reduce the consumption of computing resources: firstly, the original double-precision weight is represented by using a weight value approximate to binaryzation, so that the memory occupation of a network in calculation is reduced; secondly, the input and the weight value in the multiplication calculation with the largest calculation consumption in each layer are replaced by the binary approximate value, so that the multiplication calculation can be simplified into addition and subtraction or even bit operation. Including the modification of the volume block and the modification of the full connection block.
(1) And (5) binarization of the volume block.
The specific way of performing binary approximate reconstruction on the convolutional neural network is as follows:
firstly, in the forward propagation process, carrying out binarization on a weight matrix w of a convolution network according to a formula (7) to obtain wbAnd the original weight matrix w is retained, namely:
in the formula (7), the reaction mixture is,representing the binary approximation to obtain a matrix wbWeight of (1), cf、wf、hfRepresenting the number, width and height of the convolution kernels,in the standard sign function, when w is 0, sign (w) is 0, and here, in order to achieve the binarization effect, the 3 rd value is not allowed to exist, so that it is specified that when w is 0, sign (w) is 1.
And secondly, adding a binarization activation layer before the previous layer of each layer to obtain a node value, namely, a node value, which replaces the original ReLU activation layer, as shown in formula (8), namely:
in the formula (8), the reaction mixture is,to binarize the input values of the i-th layer of the network,c, w and h respectively represent the number of channels, the width and the height of an input image; l (X)(i-1)) Obtaining a value for the ith binary activation layer; x(i-1)Representing the input value of layer i-1 of the binarization network.
The function of sign is consistent with equation (8). Finally, the weight w is obtainedbAnd (3) performing convolution operation on the binary convolution layer, as shown in formula (9), namely:
in formula (9): l isb(Xb) Is a binary network layer function;is a convolution operation; xbIs thatwbObtained by respectively carrying out the following formulas (7) and (8).
For the roll-up block, the structure thereof also needs some adjustment. The normalization process BatchNorm layer and the binarization activation layer are placed before the convolution operation, which is to prevent the result of the binarization activation layer from being mostly 1 when passing through the maximum pooling layer. The specific network structure is shown in fig. 2.
The process of back propagation of training is as follows. Calculating the gradient of the last layer, reversely propagating the gradient of the nodes and the gradient of the weight from the penultimate layer to the first layer by layer, and updating the retained w before binarization to obtain wuAnd a loose operation as in equation (10) is performed, namely:
in the formula (10), wuRepresenting the updated value of the floating point number weight value reserved in the forward propagation process; sigma (w)u) Represents a weight wuProbability > 0; chip (. cndot.) represents the max function.
(2) And (5) binarization of the full connection block.
The binaryzation of the full-connection block is basically consistent with the binaryzation of the rolling block, the binaryzation rolling layer is replaced by the binaryzation full-connection layer, and the largest pooling layer is removed. The calculation formula of the binarization full-connection layer is shown as formula (11).
Lb(Xb)=wbXb (11)
In the formula (11), Lb(Xb) A full link layer function which is binary; xb,wbObtained by respectively carrying out the following formulas (7) and (8). The binarized fully connected layer removes the offset b.
4. In step 6, a TLD algorithm is used for tracking gesture tracks, deviations in the tracking process are corrected by using a Haar classifier, and the specific method for identifying dynamic gestures by using an HMM algorithm comprises the following steps:
4.1, the TLD Algorithm framework consists of three parts: tracking, learning and detecting, as shown in fig. 3:
in the algorithm framework, the three parts cooperate and complement to complete the tracking of the object. In the tracking module, the precondition is that the object motion speed is not high, the object does not have large-amplitude displacement between two adjacent frames, and the tracked target is always within the range of the camera, so as to estimate the moving target, and if the target disappears from the visual field, the tracking failure can be caused. In the detection module, on the premise that no interference is generated between each frame of the video, the detection algorithm is used to search for the target in each frame of the image respectively through the model detected and learned in the past, and the possible occurrence area of the target is calibrated. When the detection module has errors, the learning module evaluates the errors of the detection module according to the result obtained by the tracking module, generates a training sample and updates the purpose of the detection module
And key feature points of the model and the tracking module are marked, so that similar errors are avoided. A detailed flow chart of the TLD algorithm is shown in fig. 4.
The TLD algorithm has good real-time performance on target tracking, and when the target is shielded or leaves the camera area and reappears, the target can still be identified and tracked. However, the algorithm needs to manually select the tracked target through a mouse during initialization, which is not beneficial to the automation of target tracking; meanwhile, although the LBP feature adopted in the detection module is simple in calculation and easy to meet the real-time requirement, a position deviation occurs in the tracking process, resulting in a tracking failure. Therefore, the system combines the characteristics of static gesture recognition and gesture tracking on the basis of the original TLD algorithm, and improves the algorithm as follows:
in order to solve the problem that a target area needs to be manually selected when an algorithm is initialized, a static gesture recognition database is added into a detection module, and when a gesture matched with the gesture database appears in a video frame, a TLD tracking algorithm is automatically initialized. Meanwhile, due to the fact that the trained static gesture database is adopted, a learning module in the original TLD algorithm can be removed, when the gesture of a user changes, only the fact that whether the gesture exists in the gesture database in the video frame needs to be retrieved again, then the TLD algorithm is initialized, and the flow of the improved TLD algorithm is shown in the figure 5.
4.2 correcting deviation in tracking process by using Haar classifier
The method mainly comprises the steps of extracting Haar features and training a classifier. The Haar features mainly include center features, linear features, edge features, and diagonal features. In order to obtain the final Haar classifier, the invention adopts an improved Adaboost algorithm for training. Different weak classifiers are trained by Haar features extracted from samples, and then the weak classifiers are integrated to obtain a final strong classifier, namely the Haar classifier required by the text.
The implementation flow of the improved Adaboost algorithm is as follows:
let X be the sample space and Y be the set of sample class identifications. For a typical two-classification problem, Y ═ 0, 1, let S { (x) }i,yi) I | ═ 1, 2, 3, …, m } for training after taggingSet of samples, where there is xi∈X,yiE Y, assuming a total of T iterations when the final goal is reached.
Step 1, initializing weights of m samples:
in the formula, Dt(i) Denotes the sample (x) in the t-th iterationi,yi) The weight of (2).
Step 2, for T equal to 1, 2, 3 …, T, respectively, calculates:
(a) training a weak classifier h for each feature f of a sample xl(x,f,p,θ):
In the formula (13), θ represents the threshold of the weak classifier corresponding to f, and p is used for adjusting the unequal sign direction. Computing usage qiThe classification error rate ε of all the weighted weak classifiersf:
εf=∑iqi|ht(x,f,p,θ)-yi| (14)
In formula (14), yiRepresenting elements in the sample class identification space, qiRepresenting the weight of the ith training sample.
(b) Selecting the one with the minimum error rate epsilontOf the optimal weak classifier epsilont
εt=minf,p,θ∑iqi|ht(x,f,p,θ)-yi| (15)
(c) Sample weights are modified using the best weak classifier:
βt=εt(1-εt) (17)
in the formula (16), Dt+1(i) Representing the probability value of the t +1 th training sample,represents Dt+1And DtThere is an iterative relationship, which can be passed through DtUpdate Dt+1。
In the formula (17), betatRepresenting a normalization constant.
If sample xiIs correctly classified, then ei0, otherwise, ei=1。
Step 3, final Haar classifier c (x):
αt=log(1/βt) (19)
4.3 HMM-based dynamic gesture trajectory recognition
In the invention, a hidden Markov model can be used for identifying the dynamic gesture track, and the identification process corresponds to three processes of solving by the hidden Markov model:
(1) estimating a problem
The problem is that for a given hidden markov model λ ═ (pi, a, B), and an observation sequence O ═ generated by the model (O)1,o2,…,oT) The likelihood probability P (O | λ) of the resulting observation sequence O is calculated. One effective algorithm to solve this problem is the forward-backward recursion algorithm.
The forward variables are defined as:
αt(i)=P(o1,o2,…oT,qt=θi|λ),1≤t≤T (19)
in formula (19), P (.) represents the likelihood probability of the observed sequence; o1,o2,…oTRepresents an observation sequence; q. q.stAn observed value representing time t; thetaiRepresents a system state value; lambda denotes hidden horseAn Erkoff model; t represents the total observation time; t represents the time scale and takes a value between 0 and T.
Note bj(ot)=bjk|ot=vk,bj(ot) Representing the observed state transition matrix, bjkRepresents an arbitrary time t, a system observation matrix, vkRepresenting the hidden state at the moment t, the forward algorithm comprises the following steps:
initialization:
α1(i)=πibj(o1),1≤i≤N (20)
in the formula (20), α1(i) Indicating the occurrence of o from 1-i1~oiObservation sequence and hidden state v at that moment1A probability of 1; piiAn initial probability distribution matrix is represented.
Recursion:
in the formula (21), αt+1(j) Indicating a hidden state v at time jt+1Is the probability of t +1, αi,jIndicating the system state transition matrix at any time t.
Calculate P (O | λ):
in equation (15), P (O | λ) represents the likelihood probability of generating the observation sequence O under the current model λ. The variables after definition are:
βt(i)=P(ot+1,ot+2,…oT,qt=θi|λ),1≤t≤T (22)
in the formula (22), betat(i) Represents the posterior probability of P (O | λ) at time t.
The backward algorithm comprises the following steps:
initialization:
βT(i)=1,1≤i≤N (23)
recursion:
t=T-1,T-2,,,,1,1≤i≤N
calculate P (O | λ):
by adopting a forward algorithm in the first half of calculation and setting a time period to be 0-T, and adopting a backward algorithm in the second half of calculation and setting the time period to be T-T, the probability can be obtained as follows:
(2) problem of decoding
For a hidden markov model λ ═ (pi, a, B), it is first necessary to find an observation sequence O ═ (O) generated by the model1,o2,…oTIn this case), on the basis of the observation sequence, the optimal state sequence experienced by the model in the course of generating the observation sequence is calculatedHere, the Viterbi algorithm is used.
(3) Study questions
Generating an observation sequence O ═ O (O) from the model without knowledge of hidden Markov model parameters1,o2,…oTAnd) the likelihood probability P (O | lambda) is maximized by adjusting the model parameters. In the present system, the learning problem is typically solved using the Baum-Welch algorithm.
The gesture recognition platform collects gesture images through the camera and converts gesture commands in the gesture images into instructions which can be executed by the computer. Firstly, a sample database is needed, and static gesture and dynamic gesture track recognition are carried out on the basis of the database: the gesture image can be acquired through a camera or directly from a local video file; after a gesture image is obtained, performing operations such as gesture segmentation, image binarization, feature extraction and the like on the gesture image; and finally, performing gesture recognition on the image, and returning a recognition result to facilitate process observation. The system software design flow is shown in fig. 6. The system uses multi-threaded development, where image preprocessing, gesture segmentation is done in sub-thread 1, and dynamic gesture tracking and recognition is done in sub-thread 3.
Claims (1)
1. A gesture recognition method based on deep learning is characterized by comprising the following steps:
step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;
the binarization-based convolutional neural network BCNN based on the MPCNN is adopted by the binarization-based convolutional neural network in the step 1, and specifically comprises the following steps:
on the basis of the MPCNN gesture classification method, a convolution neural network gesture classification method based on binaryzation is provided, and a strategy of binaryzation approximation is adopted to improve the neural network, so that the consumption of the neural network on computing resources is reduced; the binarization network has two ways to reduce the consumption of computing resources: firstly, the original double-precision weight is represented by using a weight value approximate to binaryzation, so that the memory occupation of a network in calculation is reduced; secondly, the input and the weight value in the multiplication calculation with the maximum calculation consumption in each layer are replaced by binary approximate values, so that the multiplication calculation can be simplified into addition and subtraction or even bit operation, including the transformation of a rolling block and the transformation of a full connecting block;
a first part: binarization of the rolling blocks;
the specific way of performing binary approximate reconstruction on the convolutional neural network is as follows:
step 1011, in the forward propagation process, binarizing the weight matrix w of the convolution network according to the formula (7) to obtain wbAnd the original weight matrix w is retained, namely:
in the formula (7), the reaction mixture is,representing the binary approximation to obtain a matrix wbWeight of (1), cf、wf、hfRepresenting the number, width and height of the convolution kernels,specifying that sign (w) is 1 when w is 0;
step 1012, adding a binarization activation layer before the previous layer of each layer to obtain a node value, and replacing the original ReLU activation layer, as shown in formula (8), that is:
in the formula (8), the reaction mixture is,to binarize the input values of the i-th layer of the network,c, w and h respectively represent the number of channels, the width and the height of an input image; l (X)(i-1)) Obtaining a value for the ith binary activation layer; x(i-1)An input value representing the i-1 st layer of the binarization network;
the function of sign is consistent with the formula (8), and finally, the weight w is obtainedbAnd (3) performing convolution operation on the binary convolution layer, as shown in formula (9), namely:
in formula (9): l isb(Xb) Is a binary network layer function;is a convolution operation; xbIs thatwb、XbRespectively obtained by the formula (7) and the formula (8);
for the volume block, a normalization processing BatchNorm layer and a binarization activation layer are placed before convolution operation, so that the situation that most results are 1 when the results of the binarization activation layer pass through the maximum pooling layer is prevented;
the back propagation process of the binaryzation convolutional neural network training comprises the following steps of calculating the gradient of the last layer, reversely propagating the layer from the penultimate layer to the first layer, calculating the gradient of nodes and the gradient of weight, and updating the retained w before binaryzation to obtain wuAnd a loose operation as in equation (10) is performed, namely:
in the formula (10), wuRepresenting the updated value of the floating point number weight value reserved in the forward propagation process; sigma (w)u) Represents a weight wu>Probability of 0 hour; chip (·) represents the max function;
a second part: binaryzation of the full connecting block;
the binaryzation of the full-connection block is basically consistent with the binaryzation of the rolling block, the difference is that the binaryzation rolling layer is replaced by the binaryzation full-connection layer, the largest pooling layer is removed, and the calculation formula of the binaryzation full-connection layer is as shown in formula (11):
Lb(Xb)=wbXb (11)
in the formula (11), Lb(Xb) A full link layer function which is binary; xb,wbRespectively obtained by the formulas (7) and (8), and the binary full-connection layer is removed from deviationB, placing;
step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;
in step 2, the preprocessing of the original gesture image includes: the luminance correction based on exponential transformation and logarithmic transformation and the light compensation based on dynamic threshold specifically comprise the following steps:
step 201, luminance correction based on exponential transformation and logarithmic transformation:
the exponential transformation only has good correction effect on a bright area in an image, the logarithmic transformation has good correction effect on a dark area in the image, and the two are combined to realize a light compensation strategy for a human hand, as shown in formula (1), the corrected exponential transformation is used for correcting the bright area, the logarithmic transformation with parameters is used for correcting the dark area, and other areas are not corrected:
the parameters used for equation (1) are as follows:
g (x, y) represents the corrected image; f (x, y) represents the original gesture image; a represents a highlight adjustment coefficient; b represents the average brightness of the image; c represents the normal number obtained by experimental debugging; d-1/255-T2;T1Representing the lower limit threshold of light under the condition of dark illumination; t is2Indicating an upper threshold for light under brighter lighting conditions;
step 202, dynamic threshold based light compensation
The method comprises the following steps of converting an image into a YCbCr color space by an algorithm based on a total reflection theory, and then taking a set of points with larger Y components in the YCbCr color space as a white reference point:
assuming that the original gesture image is f (x, y) and the size is m × n, then:
step 2021, convert the original gesture image f (x, y) from RGB color space to YCbCr color space using equation (2):
step 2022, obtaining a reference white point, comprising the following steps
(a) Cutting the converted image into M multiplied by N blocks;
(b) for each segmented block, C in YCbCr space is calculated respectivelybAnd CrAverage value M of the componentsbAnd Mr;
(c) Using MbAnd MrFor C of each blockbAnd CrMean absolute error D of the componentsbAnd DrAnd (3) performing calculation, wherein the calculation formula is as shown in formula (3):
in the formula (3), Cb(i, j) represents the offset of the B component of each pixel point relative to the brightness, Cr(i, j) represents an offset of the R component with respect to luminance, sum represents a total number of pixels of the current block;
step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;
in step 3, when the preprocessed original image is segmented based on the color information, a segmentation algorithm based on the YCbCr color space skin color is adopted, and the method specifically comprises the following steps:
the YCbCr color space is also called YUV color space, Y represents brightness, and Cr and Cb represent chroma and saturation, where Cr reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, and Cb reflects the difference between the blue part of the RGB input signal and the brightness value of the RGB signal;
the conversion formula from RGB color space to YCrCb is shown in equation (4):
through repeated experiments, the basic values of the parameters are as follows:
77≤Cb≤127 AND 132≤Cr≤172 (5)
however, the formula (5) contains more skin color ranges, and the provided value range is too large, so that orange or brown object interference is easily introduced, and the value is adjusted through multiple debugging, so that the interference of skin color-like objects can be effectively eliminated, and the values are as follows:
step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;
step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;
step 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm;
in step 6, tracking the gesture track by using a TLD algorithm, correcting the deviation in the tracking process by using a Haar classifier, and identifying the dynamic gesture by using an HMM algorithm, wherein the specific method comprises the following steps:
step 601, the TLD algorithm framework consists of three parts: tracking, learning and detecting, wherein in an algorithm frame, the three parts are cooperated and complemented to finish the tracking of the object; in the tracking module, the precondition is that the object motion speed is not high, the object can not generate large-amplitude displacement between two adjacent frames, and the tracked target is always within the range of the camera, so as to estimate the moving target, and if the target disappears from the visual field, the tracking failure can be caused; in the detection module, on the premise that no interference is generated between each frame of the video, a detection algorithm is used for searching targets in each frame of image respectively through a model detected and learned in the past, and possible occurring areas of the targets are calibrated; when the detection module has errors, the learning module evaluates the errors of the detection module according to the result obtained by the tracking module, generates a training sample, and updates the target model of the detection module and the key characteristic points of the tracking module, thereby avoiding similar errors;
step 602, correcting bias in tracking process by using Haar classifier
The method mainly comprises the steps of extracting Haar features and training a classifier; the Haar features mainly comprise central features, linear features, edge features and diagonal features; in order to obtain the final Haar classifier, an improved Adaboost algorithm is adopted to train: firstly, training different weak classifiers by using Haar features extracted from a sample, and then integrating the weak classifiers to obtain a final strong classifier, namely a Haar classifier;
the implementation flow of the improved Adaboost algorithm is as follows:
suppose X is a sample space and Y is a set of sample class identifications; for a typical two-classification problem, Y ═ 0, 1, let S { (x) }i,yi) I 1, 2, 3, m is a set of labeled training samples, where x isi∈X,yiE, Y, and if the final target is reached, the iteration is performed for T times in total;
step 6021, initializing the weights of the m samples:
in the formula, Dt(i) Denotes the sample (x) in the t-th iterationi,yi) The weight of (2);
step 6022, for T1, 2, 3 …, T, respectively, calculates:
(a) training a weak classifier h for each feature f of a sample xl(x,f,p,θ):
In the formula (13), θ represents the threshold of the weak classifier corresponding to f, p is used for adjusting the unequal sign direction, and q is usediThe classification error rate ε after weighting weak classifiers of all featuresfAnd (3) calculating:
εf=∑iqi|ht(x,f,p,θ)-yi| (14)
in formula (14), yiRepresenting elements in the sample class identification space, qiRepresenting the weight of the ith training sample;
(b) selecting the one with the minimum error rate epsilontOf the optimal weak classifier epsilont
εt=minf,p,θ∑iqi|ht(x,f,p,θ)-yi| (15)
(c) Sample weights are modified using the best weak classifier:
βt=εt(1-εt) (17)
in the formula (16), Dt+1(i) Representing the probability value of the t +1 th training sample,represents Dt+1And DtThere is an iterative relationship, by DtUpdate Dt+1;
In the formula (17), betatRepresents a normalization constant;
if sample xiIs correctly classified, then ei0, otherwise, ei=1;
Step 6023, final Haar classifier c (x):
αt=log(1/βt) (19)
step 603, HMM-based dynamic gesture track recognition
The hidden Markov model is used for recognizing the dynamic gesture track, and the recognition process corresponds to three processes solved by the hidden Markov model:
step 6031, estimate problem
The problem is that for a given hidden markov model λ ═ (pi, a, B), and an observation sequence O ═ generated by the model (O)1,o2,...,oT) One effective algorithm to solve this problem is the forward-backward recursion algorithm:
the forward variables are defined as:
αt(i)=P(o1,o2,…oT,qt=θi|λ),1≤t≤T (19)
in the formula (19), P (-) represents a likelihood probability of the observation sequence; o1,o2,...,oTRepresents an observation sequence; q. q.stAn observed value representing time t; thetaiRepresents a system state value; λ represents a hidden markov model; t represents the total observation time; t represents time scale and takes a value between 0 and T;
note bj(ot)=bjk|ot=vk,bj(ot) Representing the observed state transition matrix, bjkRepresents an arbitrary time t, a system observation matrix, vkRepresenting the hidden state at the moment t, the forward algorithm comprises the following steps:
initialization:
α1(i)=πibj(o1),1≤i≤N (20)
in the formula (20), α1(i) Indicating the occurrence of o from 1-i1~oiObserving the sequence, and at that timeInscription hidden state v1A probability of 1; piiRepresenting an initial probability distribution matrix;
recursion:
in the formula (21), αt+1(j) Indicating a hidden state v at time jt+1Is the probability of t +1, αi,jRepresenting the system state transition matrix at any time t;
calculate P (O | λ):
in equation (15), P (O | λ) represents the likelihood probability of generating the observation sequence O under the current model λ, and the defined variables are:
βt(i)=P(ot+1,ot+2,…oT,qt=θi|λ),1≤t≤T (22)
in the formula (22), betat(i) Represents the posterior probability of P (O | lambda) at time t;
the backward algorithm comprises the following steps:
initialization:
βT(i)=1,1≤i≤N (23)
recursion:
calculate P (O | λ):
by adopting a forward algorithm in the first half of calculation and setting a time period to be 0-T, and adopting a backward algorithm in the second half of calculation and setting the time period to be T-T, the probability is obtained as follows:
step 6032, decode problem
For a hidden markov model λ ═ (pi, a, B), it is first necessary to find an observation sequence O ═ (O) generated by the model1,o2,...,oT) Calculating the optimal state sequence experienced by the model in the process of generating the observation sequence on the basis of the observation value sequenceHere the Viterbi algorithm is used;
step 6033, learn question
Generating an observation sequence O ═ O (O) from the model without knowledge of hidden Markov model parameters1,o2,...oTAnd) the likelihood probability P (O | lambda) is maximized by adjusting the model parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810242638.4A CN108537147B (en) | 2018-03-22 | 2018-03-22 | Gesture recognition method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810242638.4A CN108537147B (en) | 2018-03-22 | 2018-03-22 | Gesture recognition method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108537147A CN108537147A (en) | 2018-09-14 |
CN108537147B true CN108537147B (en) | 2021-12-10 |
Family
ID=63483626
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810242638.4A Active CN108537147B (en) | 2018-03-22 | 2018-03-22 | Gesture recognition method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108537147B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109508670B (en) * | 2018-11-12 | 2021-10-12 | 东南大学 | Static gesture recognition method based on infrared camera |
CN109614922B (en) * | 2018-12-07 | 2023-05-02 | 南京富士通南大软件技术有限公司 | Dynamic and static gesture recognition method and system |
CN109634415B (en) * | 2018-12-11 | 2019-10-18 | 哈尔滨拓博科技有限公司 | It is a kind of for controlling the gesture identification control method of analog quantity |
CN109684959B (en) * | 2018-12-14 | 2021-08-03 | 武汉大学 | Video gesture recognition method and device based on skin color detection and deep learning |
CN110908581B (en) * | 2019-11-20 | 2021-04-23 | 网易(杭州)网络有限公司 | Gesture recognition method and device, computer storage medium and electronic equipment |
CN113449573A (en) * | 2020-03-27 | 2021-09-28 | 华为技术有限公司 | Dynamic gesture recognition method and device |
CN111753764A (en) * | 2020-06-29 | 2020-10-09 | 济南浪潮高新科技投资发展有限公司 | Gesture recognition method of edge terminal based on attitude estimation |
CN112183639B (en) * | 2020-09-30 | 2022-04-19 | 四川大学 | Mineral image identification and classification method |
CN112270220B (en) * | 2020-10-14 | 2022-02-25 | 西安工程大学 | Sewing gesture recognition method based on deep learning |
CN112784812B (en) * | 2021-02-08 | 2022-09-23 | 安徽工程大学 | Deep squatting action recognition method |
US11983327B2 (en) * | 2021-10-06 | 2024-05-14 | Fotonation Limited | Method for identifying a gesture |
CN114049539B (en) * | 2022-01-10 | 2022-04-26 | 杭州海康威视数字技术股份有限公司 | Collaborative target identification method, system and device based on decorrelation binary network |
CN114627561B (en) * | 2022-05-16 | 2022-09-23 | 南昌虚拟现实研究院股份有限公司 | Dynamic gesture recognition method and device, readable storage medium and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502570A (en) * | 2016-10-25 | 2017-03-15 | 科世达(上海)管理有限公司 | A kind of method of gesture identification, device and onboard system |
US20170220122A1 (en) * | 2010-07-13 | 2017-08-03 | Intel Corporation | Efficient Gesture Processing |
-
2018
- 2018-03-22 CN CN201810242638.4A patent/CN108537147B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170220122A1 (en) * | 2010-07-13 | 2017-08-03 | Intel Corporation | Efficient Gesture Processing |
CN106502570A (en) * | 2016-10-25 | 2017-03-15 | 科世达(上海)管理有限公司 | A kind of method of gesture identification, device and onboard system |
Non-Patent Citations (4)
Title |
---|
一种基于肤色特征提取的手势检测识别方法;范文兵 等;《现代电子技术》;20170915;第40卷(第18期);第85-88页 * |
利用肤色信息和几何特征的人脸检测算法研究;韦艳柳 等;《无线互联科技》;20161130(第21期);第107-111页 * |
基于二值化卷积神经网络的手势分类方法研究;胡骏飞 等;《湖南工业大学学报》;20170131;第31卷(第1期);第75-80页 * |
机器人视觉手势交互技术研究进展;齐静 等;《机器人》;20170731;第39卷(第4期);第565-584页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108537147A (en) | 2018-09-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108537147B (en) | Gesture recognition method based on deep learning | |
CN108154118B (en) | A kind of target detection system and method based on adaptive combined filter and multistage detection | |
CN108133188B (en) | Behavior identification method based on motion history image and convolutional neural network | |
CN109460702B (en) | Passenger abnormal behavior identification method based on human body skeleton sequence | |
CN107609460B (en) | Human body behavior recognition method integrating space-time dual network flow and attention mechanism | |
CN111241931B (en) | Aerial unmanned aerial vehicle target identification and tracking method based on YOLOv3 | |
CN111079847B (en) | Remote sensing image automatic labeling method based on deep learning | |
CN111191583A (en) | Space target identification system and method based on convolutional neural network | |
CN110334656B (en) | Multi-source remote sensing image water body extraction method and device based on information source probability weighting | |
Yu et al. | Research of image main objects detection algorithm based on deep learning | |
CN113608663B (en) | Fingertip tracking method based on deep learning and K-curvature method | |
CN112906550B (en) | Static gesture recognition method based on watershed transformation | |
CN110728694A (en) | Long-term visual target tracking method based on continuous learning | |
Chen et al. | Research on moving object detection based on improved mixture Gaussian model | |
CN113312973A (en) | Method and system for extracting features of gesture recognition key points | |
CN113326735A (en) | Multi-mode small target detection method based on YOLOv5 | |
CN114548256A (en) | Small sample rare bird identification method based on comparative learning | |
Yang et al. | A Face Detection Method Based on Skin Color Model and Improved AdaBoost Algorithm. | |
CN111310827A (en) | Target area detection method based on double-stage convolution model | |
CN111695507B (en) | Static gesture recognition method based on improved VGGNet network and PCA | |
CN112581502A (en) | Target tracking method based on twin network | |
CN112487927B (en) | Method and system for realizing indoor scene recognition based on object associated attention | |
CN113591607B (en) | Station intelligent epidemic situation prevention and control system and method | |
Raj et al. | Deep manifold clustering based optimal pseudo pose representation (dmc-oppr) for unsupervised person re-identification | |
Yamashita et al. | Facial point detection using convolutional neural network transferred from a heterogeneous task |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |