CN108537147B - Gesture recognition method based on deep learning - Google Patents

Gesture recognition method based on deep learning Download PDF

Info

Publication number
CN108537147B
CN108537147B CN201810242638.4A CN201810242638A CN108537147B CN 108537147 B CN108537147 B CN 108537147B CN 201810242638 A CN201810242638 A CN 201810242638A CN 108537147 B CN108537147 B CN 108537147B
Authority
CN
China
Prior art keywords
gesture
formula
algorithm
layer
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810242638.4A
Other languages
Chinese (zh)
Other versions
CN108537147A (en
Inventor
董训锋
陈镜超
李国振
马啸天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Donghua University
Original Assignee
Donghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Donghua University filed Critical Donghua University
Priority to CN201810242638.4A priority Critical patent/CN108537147B/en
Publication of CN108537147A publication Critical patent/CN108537147A/en
Application granted granted Critical
Publication of CN108537147B publication Critical patent/CN108537147B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/113Recognition of static hand signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a gesture recognition method based on deep learning, which is characterized by comprising the following steps of: training the binary convolution neural network by utilizing a gesture training set and a test set; segmenting the preprocessed original image based on the color information by using the color information reflected by the skin color, and extracting a gesture outline; judging a gesture instruction corresponding to the gesture contour by using the trained binary convolution neural network; and positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and identifying the dynamic gestures by using an HMM algorithm. The method provided by the invention can solve the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like in the conventional gesture recognition.

Description

Gesture recognition method based on deep learning
Technical Field
The invention relates to a gesture recognition method based on deep learning, and belongs to the technical field of gesture recognition.
Background
The appearance of computers has extremely important influence on human social production and daily life, greatly improves the information processing efficiency on one hand, and promotes the development of intelligent life on the other hand. Therefore, how to interact with a computer efficiently and conveniently becomes a hot point of research.
With the development of social information technology, Human Computer Interaction (HCI) has become an important part of daily life. As a new man-machine interaction mode, the gesture recognition technology has wide application prospect in a plurality of fields: (1) digital life and entertainment. For example, in 2008, ericsson introduced a smart phone R520m, which collected gesture information of a user through a built-in camera thereof and served as a keyboard or a touch screen on a mobile phone interface, thereby realizing control of an alarm clock and an incoming call. (2) The field of scientific and technological innovation. In the fields of space exploration and military research, some dangerous environments or special environments which are inconvenient for direct contact control of people are often encountered, and related information can be obtained through interaction of the gesture remote control robot. (3) The field of intelligent transportation, such as unmanned driving. As early as 2010, Google corporation has published their unmanned automobiles outside, which opened up a new era of intelligent transportation.
The gesture recognition technology in the technical field of human-computer interaction can play the following roles:
(1) for a user, the method helps the user to use the product more conveniently, saves the user time and improves the user experience of the user;
(2) for products, redundant use instructions are eliminated, and the products are used only by providing related universal gesture guidance.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the traditional gesture recognition algorithm generally has the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like.
In order to solve the technical problem, the technical scheme of the invention is to provide a gesture recognition method based on deep learning, which is characterized by comprising the following steps:
step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;
step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;
step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;
step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;
step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;
and 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm.
Preferably, in the step 2, the preprocessing includes brightness correction and light compensation;
during brightness correction, correcting a highlight area in the original gesture image by using corrected exponential transformation; for a darker area in an original gesture image, correcting by using logarithmic transformation with parameters, and not correcting other areas;
and performing light compensation based on a dynamic threshold, converting the original gesture image into a YCbCr color space based on an algorithm of a total reflection theory, and then taking a set of points with larger Y components in the YCbCr color space image as a white reference point.
Preferably, in the step 3, when the original image is segmented, a skin color segmentation algorithm based on the YCbCr color space is adopted.
The method provided by the invention can solve the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like in the conventional gesture recognition.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:
the method improves the traditional gesture recognition algorithm based on the conventional technology, uses the improved illumination compensation strategy to enable the original image to be easier to process, uses the improved skin color model to segment the gesture so as to improve the segmentation accuracy, and uses the improved deep convolutional network to classify the static gesture so as to improve the static gesture recognition rate; the improved TLD and HMM algorithm is used for tracking and recognizing the dynamic gesture, so that the robustness, the real-time performance and the recognition rate of a gesture system are improved.
Drawings
FIG. 1 is a system architecture schematic of the design of the deep learning based gesture recognition system of the present invention;
FIG. 2 is a diagram of a binary convolutional neural network architecture of the present invention;
FIG. 3 is a TLD algorithm framework diagram;
FIG. 4 is a detailed flow chart of the TLD algorithm;
FIG. 5 is a flow chart of the modified TLD algorithm;
FIG. 6 is a flow chart of system software design.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to a gesture recognition method based on deep learning, which comprises the following steps as shown in figure 1:
step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;
step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;
step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;
step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;
step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;
and 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm.
The above steps are further described in detail with reference to the following examples:
1. the preprocessing of the original gesture image in the step 2 mainly comprises the following steps: the luminance correction based on exponential transformation and logarithmic transformation and the light compensation based on dynamic threshold specifically comprise the following steps:
(1) luminance correction based on exponential and logarithmic transformations.
The exponential transformation only has good correction effect on a bright area in an image, the logarithmic transformation has good correction effect on a dark area in the image, and the two are combined to realize a light compensation strategy for a human hand, as shown in formula (1), the corrected exponential transformation is used for correcting the bright area, the logarithmic transformation with parameters is used for correcting the dark area, and other areas are not corrected.
Figure BDA0001605151300000041
The parameters used for equation (1) are as follows:
g (x, y) represents the corrected image; f (x, y) represents the original gesture image; a represents a highlight adjustment coefficient, and a is 0 in the present embodiment; b represents the average luminance of the image, and b is 120/log T in this embodiment1(ii) a c denotes the normal number, which is found by experimental adjustment, in this example c ═ T2(ii) a d denotes (normal number passing)Experimental work has shown that in this example d is 1/255-T2;T1Indicating a lower threshold for light, T in this embodiment, in dim light conditions1=115;T2Indicating an upper threshold for light, T in this embodiment, under brighter lighting conditions2=135。
(2) Light compensation based on dynamic threshold
An algorithm based on the total reflection theory converts an image into a YCbCr color space, and then takes a set of points in the YCbCr color space where the Y component is large as a white reference point. The detailed process is as follows:
assuming that the original gesture image is f (x, y) and the size is m × n, then:
step 1, firstly, converting an original gesture image f (x, y) from an RGB color space to a YCbCr color space by using a formula (2):
Figure BDA0001605151300000042
step 2, obtaining a reference white point
(a) Cutting the converted image into M × N blocks, where M is 3 and N is 4 in this embodiment;
(b) for each segmented block, C in YCbCr space is calculated respectivelybAnd CrAverage value M of the componentsbAnd Mr
(c) Using MbAnd MrFor C of each blockbAnd CrMean absolute error D of the componentsbAnd DrAnd (3) performing calculation, wherein the calculation formula is as shown in formula (3):
Figure BDA0001605151300000051
in the formula (3), Cb(i, j) represents the offset of the B component of each pixel point relative to the brightness, Cr(i, j) represents an offset of the R component with respect to luminance, and sum represents the total number of pixels of the current block.
2. When the preprocessed original image is segmented based on the color information in the step 3, a YCbCr color space based skin color segmentation algorithm is adopted, and the method specifically comprises the following steps:
the YCbCr color space is also called YUV color space, Y denotes brightness, and Cr and Cb denote chroma and saturation, where Cr reflects the difference between the red part of the RGB input signal and the RGB signal brightness value. And Cb reflects the difference between the blue part of the RGB input signal and the luminance value of the RGB signal. The conversion formula from RGB color space to YCrCb is shown in equation (4):
Figure BDA0001605151300000052
through repeated experiments, the basic values of the parameters are as follows:
77≤Cb≤127 AND 132≤Cr≤172 (5)
however, the formula (5) contains more skin color ranges, and the provided value range is too large, so that interference such as orange or brown objects is easily introduced. Aiming at the unique skin color characteristics of the yellow race, the invention can effectively eliminate the interference of skin color-like objects by adjusting the values after debugging for many times, and the values are as follows:
Figure BDA0001605151300000053
3. the binarization-Based Convolutional Neural Network (BCNN) based on MOCNN is adopted in the binarization-based convolutional neural network in the step 1, and specifically comprises the following steps:
the current popular deep convolution neural network algorithms have a common defect that the calculation consumption is huge. Therefore, optimization of network computational consumption also mainly expands around these two aspects. On the basis of the MPCNN gesture classification method, a Binary Convolution Neural Network (BCNN) gesture classification method is provided, and a binary approximation strategy is adopted to improve the neural network and reduce the consumption of calculation resources. The binary network has two main ways to reduce the consumption of computing resources: firstly, the original double-precision weight is represented by using a weight value approximate to binaryzation, so that the memory occupation of a network in calculation is reduced; secondly, the input and the weight value in the multiplication calculation with the largest calculation consumption in each layer are replaced by the binary approximate value, so that the multiplication calculation can be simplified into addition and subtraction or even bit operation. Including the modification of the volume block and the modification of the full connection block.
(1) And (5) binarization of the volume block.
The specific way of performing binary approximate reconstruction on the convolutional neural network is as follows:
firstly, in the forward propagation process, carrying out binarization on a weight matrix w of a convolution network according to a formula (7) to obtain wbAnd the original weight matrix w is retained, namely:
Figure BDA0001605151300000061
in the formula (7), the reaction mixture is,
Figure BDA0001605151300000062
representing the binary approximation to obtain a matrix wbWeight of (1), cf、wf、hfRepresenting the number, width and height of the convolution kernels,
Figure BDA0001605151300000063
in the standard sign function, when w is 0, sign (w) is 0, and here, in order to achieve the binarization effect, the 3 rd value is not allowed to exist, so that it is specified that when w is 0, sign (w) is 1.
And secondly, adding a binarization activation layer before the previous layer of each layer to obtain a node value, namely, a node value, which replaces the original ReLU activation layer, as shown in formula (8), namely:
Figure BDA0001605151300000064
in the formula (8), the reaction mixture is,
Figure BDA0001605151300000065
to binarize the input values of the i-th layer of the network,
Figure BDA0001605151300000066
c, w and h respectively represent the number of channels, the width and the height of an input image; l (X)(i-1)) Obtaining a value for the ith binary activation layer; x(i-1)Representing the input value of layer i-1 of the binarization network.
The function of sign is consistent with equation (8). Finally, the weight w is obtainedbAnd (3) performing convolution operation on the binary convolution layer, as shown in formula (9), namely:
Figure BDA0001605151300000067
in formula (9): l isb(Xb) Is a binary network layer function;
Figure BDA0001605151300000068
is a convolution operation; xbIs that
Figure BDA0001605151300000069
wbObtained by respectively carrying out the following formulas (7) and (8).
For the roll-up block, the structure thereof also needs some adjustment. The normalization process BatchNorm layer and the binarization activation layer are placed before the convolution operation, which is to prevent the result of the binarization activation layer from being mostly 1 when passing through the maximum pooling layer. The specific network structure is shown in fig. 2.
The process of back propagation of training is as follows. Calculating the gradient of the last layer, reversely propagating the gradient of the nodes and the gradient of the weight from the penultimate layer to the first layer by layer, and updating the retained w before binarization to obtain wuAnd a loose operation as in equation (10) is performed, namely:
Figure BDA0001605151300000071
in the formula (10), wuRepresenting the updated value of the floating point number weight value reserved in the forward propagation process; sigma (w)u) Represents a weight wuProbability > 0; chip (. cndot.) represents the max function.
(2) And (5) binarization of the full connection block.
The binaryzation of the full-connection block is basically consistent with the binaryzation of the rolling block, the binaryzation rolling layer is replaced by the binaryzation full-connection layer, and the largest pooling layer is removed. The calculation formula of the binarization full-connection layer is shown as formula (11).
Lb(Xb)=wbXb (11)
In the formula (11), Lb(Xb) A full link layer function which is binary; xb,wbObtained by respectively carrying out the following formulas (7) and (8). The binarized fully connected layer removes the offset b.
4. In step 6, a TLD algorithm is used for tracking gesture tracks, deviations in the tracking process are corrected by using a Haar classifier, and the specific method for identifying dynamic gestures by using an HMM algorithm comprises the following steps:
4.1, the TLD Algorithm framework consists of three parts: tracking, learning and detecting, as shown in fig. 3:
in the algorithm framework, the three parts cooperate and complement to complete the tracking of the object. In the tracking module, the precondition is that the object motion speed is not high, the object does not have large-amplitude displacement between two adjacent frames, and the tracked target is always within the range of the camera, so as to estimate the moving target, and if the target disappears from the visual field, the tracking failure can be caused. In the detection module, on the premise that no interference is generated between each frame of the video, the detection algorithm is used to search for the target in each frame of the image respectively through the model detected and learned in the past, and the possible occurrence area of the target is calibrated. When the detection module has errors, the learning module evaluates the errors of the detection module according to the result obtained by the tracking module, generates a training sample and updates the purpose of the detection module
And key feature points of the model and the tracking module are marked, so that similar errors are avoided. A detailed flow chart of the TLD algorithm is shown in fig. 4.
The TLD algorithm has good real-time performance on target tracking, and when the target is shielded or leaves the camera area and reappears, the target can still be identified and tracked. However, the algorithm needs to manually select the tracked target through a mouse during initialization, which is not beneficial to the automation of target tracking; meanwhile, although the LBP feature adopted in the detection module is simple in calculation and easy to meet the real-time requirement, a position deviation occurs in the tracking process, resulting in a tracking failure. Therefore, the system combines the characteristics of static gesture recognition and gesture tracking on the basis of the original TLD algorithm, and improves the algorithm as follows:
in order to solve the problem that a target area needs to be manually selected when an algorithm is initialized, a static gesture recognition database is added into a detection module, and when a gesture matched with the gesture database appears in a video frame, a TLD tracking algorithm is automatically initialized. Meanwhile, due to the fact that the trained static gesture database is adopted, a learning module in the original TLD algorithm can be removed, when the gesture of a user changes, only the fact that whether the gesture exists in the gesture database in the video frame needs to be retrieved again, then the TLD algorithm is initialized, and the flow of the improved TLD algorithm is shown in the figure 5.
4.2 correcting deviation in tracking process by using Haar classifier
The method mainly comprises the steps of extracting Haar features and training a classifier. The Haar features mainly include center features, linear features, edge features, and diagonal features. In order to obtain the final Haar classifier, the invention adopts an improved Adaboost algorithm for training. Different weak classifiers are trained by Haar features extracted from samples, and then the weak classifiers are integrated to obtain a final strong classifier, namely the Haar classifier required by the text.
The implementation flow of the improved Adaboost algorithm is as follows:
let X be the sample space and Y be the set of sample class identifications. For a typical two-classification problem, Y ═ 0, 1, let S { (x) }i,yi) I | ═ 1, 2, 3, …, m } for training after taggingSet of samples, where there is xi∈X,yiE Y, assuming a total of T iterations when the final goal is reached.
Step 1, initializing weights of m samples:
Figure BDA0001605151300000081
in the formula, Dt(i) Denotes the sample (x) in the t-th iterationi,yi) The weight of (2).
Step 2, for T equal to 1, 2, 3 …, T, respectively, calculates:
(a) training a weak classifier h for each feature f of a sample xl(x,f,p,θ):
Figure BDA0001605151300000091
In the formula (13), θ represents the threshold of the weak classifier corresponding to f, and p is used for adjusting the unequal sign direction. Computing usage qiThe classification error rate ε of all the weighted weak classifiersf
εf=∑iqi|ht(x,f,p,θ)-yi| (14)
In formula (14), yiRepresenting elements in the sample class identification space, qiRepresenting the weight of the ith training sample.
(b) Selecting the one with the minimum error rate epsilontOf the optimal weak classifier epsilont
εt=minf,p,θiqi|ht(x,f,p,θ)-yi| (15)
(c) Sample weights are modified using the best weak classifier:
Figure BDA0001605151300000092
βt=εt(1-εt) (17)
in the formula (16), Dt+1(i) Representing the probability value of the t +1 th training sample,
Figure BDA0001605151300000093
represents Dt+1And DtThere is an iterative relationship, which can be passed through DtUpdate Dt+1
In the formula (17), betatRepresenting a normalization constant.
If sample xiIs correctly classified, then ei0, otherwise, ei=1。
Step 3, final Haar classifier c (x):
Figure BDA0001605151300000094
αt=log(1/βt) (19)
4.3 HMM-based dynamic gesture trajectory recognition
In the invention, a hidden Markov model can be used for identifying the dynamic gesture track, and the identification process corresponds to three processes of solving by the hidden Markov model:
(1) estimating a problem
The problem is that for a given hidden markov model λ ═ (pi, a, B), and an observation sequence O ═ generated by the model (O)1,o2,…,oT) The likelihood probability P (O | λ) of the resulting observation sequence O is calculated. One effective algorithm to solve this problem is the forward-backward recursion algorithm.
The forward variables are defined as:
αt(i)=P(o1,o2,…oT,qt=θi|λ),1≤t≤T (19)
in formula (19), P (.) represents the likelihood probability of the observed sequence; o1,o2,…oTRepresents an observation sequence; q. q.stAn observed value representing time t; thetaiRepresents a system state value; lambda denotes hidden horseAn Erkoff model; t represents the total observation time; t represents the time scale and takes a value between 0 and T.
Note bj(ot)=bjk|ot=vk,bj(ot) Representing the observed state transition matrix, bjkRepresents an arbitrary time t, a system observation matrix, vkRepresenting the hidden state at the moment t, the forward algorithm comprises the following steps:
initialization:
α1(i)=πibj(o1),1≤i≤N (20)
in the formula (20), α1(i) Indicating the occurrence of o from 1-i1~oiObservation sequence and hidden state v at that moment1A probability of 1; piiAn initial probability distribution matrix is represented.
Recursion:
Figure BDA0001605151300000101
in the formula (21), αt+1(j) Indicating a hidden state v at time jt+1Is the probability of t +1, αi,jIndicating the system state transition matrix at any time t.
Calculate P (O | λ):
Figure BDA0001605151300000102
in equation (15), P (O | λ) represents the likelihood probability of generating the observation sequence O under the current model λ. The variables after definition are:
βt(i)=P(ot+1,ot+2,…oT,qt=θi|λ),1≤t≤T (22)
in the formula (22), betat(i) Represents the posterior probability of P (O | λ) at time t.
The backward algorithm comprises the following steps:
initialization:
βT(i)=1,1≤i≤N (23)
recursion:
Figure BDA0001605151300000103
t=T-1,T-2,,,,1,1≤i≤N
calculate P (O | λ):
Figure BDA0001605151300000111
by adopting a forward algorithm in the first half of calculation and setting a time period to be 0-T, and adopting a backward algorithm in the second half of calculation and setting the time period to be T-T, the probability can be obtained as follows:
Figure BDA0001605151300000112
(2) problem of decoding
For a hidden markov model λ ═ (pi, a, B), it is first necessary to find an observation sequence O ═ (O) generated by the model1,o2,…oTIn this case), on the basis of the observation sequence, the optimal state sequence experienced by the model in the course of generating the observation sequence is calculated
Figure BDA0001605151300000113
Here, the Viterbi algorithm is used.
(3) Study questions
Generating an observation sequence O ═ O (O) from the model without knowledge of hidden Markov model parameters1,o2,…oTAnd) the likelihood probability P (O | lambda) is maximized by adjusting the model parameters. In the present system, the learning problem is typically solved using the Baum-Welch algorithm.
The gesture recognition platform collects gesture images through the camera and converts gesture commands in the gesture images into instructions which can be executed by the computer. Firstly, a sample database is needed, and static gesture and dynamic gesture track recognition are carried out on the basis of the database: the gesture image can be acquired through a camera or directly from a local video file; after a gesture image is obtained, performing operations such as gesture segmentation, image binarization, feature extraction and the like on the gesture image; and finally, performing gesture recognition on the image, and returning a recognition result to facilitate process observation. The system software design flow is shown in fig. 6. The system uses multi-threaded development, where image preprocessing, gesture segmentation is done in sub-thread 1, and dynamic gesture tracking and recognition is done in sub-thread 3.

Claims (1)

1. A gesture recognition method based on deep learning is characterized by comprising the following steps:
step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;
the binarization-based convolutional neural network BCNN based on the MPCNN is adopted by the binarization-based convolutional neural network in the step 1, and specifically comprises the following steps:
on the basis of the MPCNN gesture classification method, a convolution neural network gesture classification method based on binaryzation is provided, and a strategy of binaryzation approximation is adopted to improve the neural network, so that the consumption of the neural network on computing resources is reduced; the binarization network has two ways to reduce the consumption of computing resources: firstly, the original double-precision weight is represented by using a weight value approximate to binaryzation, so that the memory occupation of a network in calculation is reduced; secondly, the input and the weight value in the multiplication calculation with the maximum calculation consumption in each layer are replaced by binary approximate values, so that the multiplication calculation can be simplified into addition and subtraction or even bit operation, including the transformation of a rolling block and the transformation of a full connecting block;
a first part: binarization of the rolling blocks;
the specific way of performing binary approximate reconstruction on the convolutional neural network is as follows:
step 1011, in the forward propagation process, binarizing the weight matrix w of the convolution network according to the formula (7) to obtain wbAnd the original weight matrix w is retained, namely:
Figure FDA0003283711350000011
in the formula (7), the reaction mixture is,
Figure FDA0003283711350000012
representing the binary approximation to obtain a matrix wbWeight of (1), cf、wf、hfRepresenting the number, width and height of the convolution kernels,
Figure FDA0003283711350000013
specifying that sign (w) is 1 when w is 0;
step 1012, adding a binarization activation layer before the previous layer of each layer to obtain a node value, and replacing the original ReLU activation layer, as shown in formula (8), that is:
Figure FDA0003283711350000014
in the formula (8), the reaction mixture is,
Figure FDA0003283711350000015
to binarize the input values of the i-th layer of the network,
Figure FDA0003283711350000016
c, w and h respectively represent the number of channels, the width and the height of an input image; l (X)(i-1)) Obtaining a value for the ith binary activation layer; x(i-1)An input value representing the i-1 st layer of the binarization network;
the function of sign is consistent with the formula (8), and finally, the weight w is obtainedbAnd (3) performing convolution operation on the binary convolution layer, as shown in formula (9), namely:
Figure FDA0003283711350000021
in formula (9): l isb(Xb) Is a binary network layer function;
Figure FDA0003283711350000022
is a convolution operation; xbIs that
Figure FDA0003283711350000023
wb、XbRespectively obtained by the formula (7) and the formula (8);
for the volume block, a normalization processing BatchNorm layer and a binarization activation layer are placed before convolution operation, so that the situation that most results are 1 when the results of the binarization activation layer pass through the maximum pooling layer is prevented;
the back propagation process of the binaryzation convolutional neural network training comprises the following steps of calculating the gradient of the last layer, reversely propagating the layer from the penultimate layer to the first layer, calculating the gradient of nodes and the gradient of weight, and updating the retained w before binaryzation to obtain wuAnd a loose operation as in equation (10) is performed, namely:
Figure FDA0003283711350000024
in the formula (10), wuRepresenting the updated value of the floating point number weight value reserved in the forward propagation process; sigma (w)u) Represents a weight wu>Probability of 0 hour; chip (·) represents the max function;
a second part: binaryzation of the full connecting block;
the binaryzation of the full-connection block is basically consistent with the binaryzation of the rolling block, the difference is that the binaryzation rolling layer is replaced by the binaryzation full-connection layer, the largest pooling layer is removed, and the calculation formula of the binaryzation full-connection layer is as shown in formula (11):
Lb(Xb)=wbXb (11)
in the formula (11), Lb(Xb) A full link layer function which is binary; xb,wbRespectively obtained by the formulas (7) and (8), and the binary full-connection layer is removed from deviationB, placing;
step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;
in step 2, the preprocessing of the original gesture image includes: the luminance correction based on exponential transformation and logarithmic transformation and the light compensation based on dynamic threshold specifically comprise the following steps:
step 201, luminance correction based on exponential transformation and logarithmic transformation:
the exponential transformation only has good correction effect on a bright area in an image, the logarithmic transformation has good correction effect on a dark area in the image, and the two are combined to realize a light compensation strategy for a human hand, as shown in formula (1), the corrected exponential transformation is used for correcting the bright area, the logarithmic transformation with parameters is used for correcting the dark area, and other areas are not corrected:
Figure FDA0003283711350000031
the parameters used for equation (1) are as follows:
g (x, y) represents the corrected image; f (x, y) represents the original gesture image; a represents a highlight adjustment coefficient; b represents the average brightness of the image; c represents the normal number obtained by experimental debugging; d-1/255-T2;T1Representing the lower limit threshold of light under the condition of dark illumination; t is2Indicating an upper threshold for light under brighter lighting conditions;
step 202, dynamic threshold based light compensation
The method comprises the following steps of converting an image into a YCbCr color space by an algorithm based on a total reflection theory, and then taking a set of points with larger Y components in the YCbCr color space as a white reference point:
assuming that the original gesture image is f (x, y) and the size is m × n, then:
step 2021, convert the original gesture image f (x, y) from RGB color space to YCbCr color space using equation (2):
Figure FDA0003283711350000032
step 2022, obtaining a reference white point, comprising the following steps
(a) Cutting the converted image into M multiplied by N blocks;
(b) for each segmented block, C in YCbCr space is calculated respectivelybAnd CrAverage value M of the componentsbAnd Mr
(c) Using MbAnd MrFor C of each blockbAnd CrMean absolute error D of the componentsbAnd DrAnd (3) performing calculation, wherein the calculation formula is as shown in formula (3):
Figure FDA0003283711350000041
in the formula (3), Cb(i, j) represents the offset of the B component of each pixel point relative to the brightness, Cr(i, j) represents an offset of the R component with respect to luminance, sum represents a total number of pixels of the current block;
step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;
in step 3, when the preprocessed original image is segmented based on the color information, a segmentation algorithm based on the YCbCr color space skin color is adopted, and the method specifically comprises the following steps:
the YCbCr color space is also called YUV color space, Y represents brightness, and Cr and Cb represent chroma and saturation, where Cr reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, and Cb reflects the difference between the blue part of the RGB input signal and the brightness value of the RGB signal;
the conversion formula from RGB color space to YCrCb is shown in equation (4):
Figure FDA0003283711350000042
through repeated experiments, the basic values of the parameters are as follows:
77≤Cb≤127 AND 132≤Cr≤172 (5)
however, the formula (5) contains more skin color ranges, and the provided value range is too large, so that orange or brown object interference is easily introduced, and the value is adjusted through multiple debugging, so that the interference of skin color-like objects can be effectively eliminated, and the values are as follows:
Figure FDA0003283711350000043
step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;
step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;
step 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm;
in step 6, tracking the gesture track by using a TLD algorithm, correcting the deviation in the tracking process by using a Haar classifier, and identifying the dynamic gesture by using an HMM algorithm, wherein the specific method comprises the following steps:
step 601, the TLD algorithm framework consists of three parts: tracking, learning and detecting, wherein in an algorithm frame, the three parts are cooperated and complemented to finish the tracking of the object; in the tracking module, the precondition is that the object motion speed is not high, the object can not generate large-amplitude displacement between two adjacent frames, and the tracked target is always within the range of the camera, so as to estimate the moving target, and if the target disappears from the visual field, the tracking failure can be caused; in the detection module, on the premise that no interference is generated between each frame of the video, a detection algorithm is used for searching targets in each frame of image respectively through a model detected and learned in the past, and possible occurring areas of the targets are calibrated; when the detection module has errors, the learning module evaluates the errors of the detection module according to the result obtained by the tracking module, generates a training sample, and updates the target model of the detection module and the key characteristic points of the tracking module, thereby avoiding similar errors;
step 602, correcting bias in tracking process by using Haar classifier
The method mainly comprises the steps of extracting Haar features and training a classifier; the Haar features mainly comprise central features, linear features, edge features and diagonal features; in order to obtain the final Haar classifier, an improved Adaboost algorithm is adopted to train: firstly, training different weak classifiers by using Haar features extracted from a sample, and then integrating the weak classifiers to obtain a final strong classifier, namely a Haar classifier;
the implementation flow of the improved Adaboost algorithm is as follows:
suppose X is a sample space and Y is a set of sample class identifications; for a typical two-classification problem, Y ═ 0, 1, let S { (x) }i,yi) I 1, 2, 3, m is a set of labeled training samples, where x isi∈X,yiE, Y, and if the final target is reached, the iteration is performed for T times in total;
step 6021, initializing the weights of the m samples:
Figure FDA0003283711350000051
in the formula, Dt(i) Denotes the sample (x) in the t-th iterationi,yi) The weight of (2);
step 6022, for T1, 2, 3 …, T, respectively, calculates:
(a) training a weak classifier h for each feature f of a sample xl(x,f,p,θ):
Figure FDA0003283711350000052
In the formula (13), θ represents the threshold of the weak classifier corresponding to f, p is used for adjusting the unequal sign direction, and q is usediThe classification error rate ε after weighting weak classifiers of all featuresfAnd (3) calculating:
εf=∑iqi|ht(x,f,p,θ)-yi| (14)
in formula (14), yiRepresenting elements in the sample class identification space, qiRepresenting the weight of the ith training sample;
(b) selecting the one with the minimum error rate epsilontOf the optimal weak classifier epsilont
εt=minf,p,θiqi|ht(x,f,p,θ)-yi| (15)
(c) Sample weights are modified using the best weak classifier:
Figure FDA0003283711350000061
βt=εt(1-εt) (17)
in the formula (16), Dt+1(i) Representing the probability value of the t +1 th training sample,
Figure FDA0003283711350000062
represents Dt+1And DtThere is an iterative relationship, by DtUpdate Dt+1
In the formula (17), betatRepresents a normalization constant;
if sample xiIs correctly classified, then ei0, otherwise, ei=1;
Step 6023, final Haar classifier c (x):
Figure FDA0003283711350000063
αt=log(1/βt) (19)
step 603, HMM-based dynamic gesture track recognition
The hidden Markov model is used for recognizing the dynamic gesture track, and the recognition process corresponds to three processes solved by the hidden Markov model:
step 6031, estimate problem
The problem is that for a given hidden markov model λ ═ (pi, a, B), and an observation sequence O ═ generated by the model (O)1,o2,...,oT) One effective algorithm to solve this problem is the forward-backward recursion algorithm:
the forward variables are defined as:
αt(i)=P(o1,o2,…oT,qt=θi|λ),1≤t≤T (19)
in the formula (19), P (-) represents a likelihood probability of the observation sequence; o1,o2,...,oTRepresents an observation sequence; q. q.stAn observed value representing time t; thetaiRepresents a system state value; λ represents a hidden markov model; t represents the total observation time; t represents time scale and takes a value between 0 and T;
note bj(ot)=bjk|ot=vk,bj(ot) Representing the observed state transition matrix, bjkRepresents an arbitrary time t, a system observation matrix, vkRepresenting the hidden state at the moment t, the forward algorithm comprises the following steps:
initialization:
α1(i)=πibj(o1),1≤i≤N (20)
in the formula (20), α1(i) Indicating the occurrence of o from 1-i1~oiObserving the sequence, and at that timeInscription hidden state v1A probability of 1; piiRepresenting an initial probability distribution matrix;
recursion:
Figure FDA0003283711350000071
in the formula (21), αt+1(j) Indicating a hidden state v at time jt+1Is the probability of t +1, αi,jRepresenting the system state transition matrix at any time t;
calculate P (O | λ):
Figure FDA0003283711350000072
in equation (15), P (O | λ) represents the likelihood probability of generating the observation sequence O under the current model λ, and the defined variables are:
βt(i)=P(ot+1,ot+2,…oT,qt=θi|λ),1≤t≤T (22)
in the formula (22), betat(i) Represents the posterior probability of P (O | lambda) at time t;
the backward algorithm comprises the following steps:
initialization:
βT(i)=1,1≤i≤N (23)
recursion:
Figure FDA0003283711350000073
calculate P (O | λ):
Figure FDA0003283711350000074
by adopting a forward algorithm in the first half of calculation and setting a time period to be 0-T, and adopting a backward algorithm in the second half of calculation and setting the time period to be T-T, the probability is obtained as follows:
Figure FDA0003283711350000081
step 6032, decode problem
For a hidden markov model λ ═ (pi, a, B), it is first necessary to find an observation sequence O ═ (O) generated by the model1,o2,...,oT) Calculating the optimal state sequence experienced by the model in the process of generating the observation sequence on the basis of the observation value sequence
Figure FDA0003283711350000082
Here the Viterbi algorithm is used;
step 6033, learn question
Generating an observation sequence O ═ O (O) from the model without knowledge of hidden Markov model parameters1,o2,...oTAnd) the likelihood probability P (O | lambda) is maximized by adjusting the model parameters.
CN201810242638.4A 2018-03-22 2018-03-22 Gesture recognition method based on deep learning Active CN108537147B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810242638.4A CN108537147B (en) 2018-03-22 2018-03-22 Gesture recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810242638.4A CN108537147B (en) 2018-03-22 2018-03-22 Gesture recognition method based on deep learning

Publications (2)

Publication Number Publication Date
CN108537147A CN108537147A (en) 2018-09-14
CN108537147B true CN108537147B (en) 2021-12-10

Family

ID=63483626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810242638.4A Active CN108537147B (en) 2018-03-22 2018-03-22 Gesture recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN108537147B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508670B (en) * 2018-11-12 2021-10-12 东南大学 Static gesture recognition method based on infrared camera
CN109614922B (en) * 2018-12-07 2023-05-02 南京富士通南大软件技术有限公司 Dynamic and static gesture recognition method and system
CN109634415B (en) * 2018-12-11 2019-10-18 哈尔滨拓博科技有限公司 It is a kind of for controlling the gesture identification control method of analog quantity
CN109684959B (en) * 2018-12-14 2021-08-03 武汉大学 Video gesture recognition method and device based on skin color detection and deep learning
CN110908581B (en) * 2019-11-20 2021-04-23 网易(杭州)网络有限公司 Gesture recognition method and device, computer storage medium and electronic equipment
CN113449573A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Dynamic gesture recognition method and device
CN111753764A (en) * 2020-06-29 2020-10-09 济南浪潮高新科技投资发展有限公司 Gesture recognition method of edge terminal based on attitude estimation
CN112183639B (en) * 2020-09-30 2022-04-19 四川大学 Mineral image identification and classification method
CN112270220B (en) * 2020-10-14 2022-02-25 西安工程大学 Sewing gesture recognition method based on deep learning
CN112784812B (en) * 2021-02-08 2022-09-23 安徽工程大学 Deep squatting action recognition method
US11983327B2 (en) * 2021-10-06 2024-05-14 Fotonation Limited Method for identifying a gesture
CN114049539B (en) * 2022-01-10 2022-04-26 杭州海康威视数字技术股份有限公司 Collaborative target identification method, system and device based on decorrelation binary network
CN114627561B (en) * 2022-05-16 2022-09-23 南昌虚拟现实研究院股份有限公司 Dynamic gesture recognition method and device, readable storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502570A (en) * 2016-10-25 2017-03-15 科世达(上海)管理有限公司 A kind of method of gesture identification, device and onboard system
US20170220122A1 (en) * 2010-07-13 2017-08-03 Intel Corporation Efficient Gesture Processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220122A1 (en) * 2010-07-13 2017-08-03 Intel Corporation Efficient Gesture Processing
CN106502570A (en) * 2016-10-25 2017-03-15 科世达(上海)管理有限公司 A kind of method of gesture identification, device and onboard system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
一种基于肤色特征提取的手势检测识别方法;范文兵 等;《现代电子技术》;20170915;第40卷(第18期);第85-88页 *
利用肤色信息和几何特征的人脸检测算法研究;韦艳柳 等;《无线互联科技》;20161130(第21期);第107-111页 *
基于二值化卷积神经网络的手势分类方法研究;胡骏飞 等;《湖南工业大学学报》;20170131;第31卷(第1期);第75-80页 *
机器人视觉手势交互技术研究进展;齐静 等;《机器人》;20170731;第39卷(第4期);第565-584页 *

Also Published As

Publication number Publication date
CN108537147A (en) 2018-09-14

Similar Documents

Publication Publication Date Title
CN108537147B (en) Gesture recognition method based on deep learning
CN108154118B (en) A kind of target detection system and method based on adaptive combined filter and multistage detection
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN109460702B (en) Passenger abnormal behavior identification method based on human body skeleton sequence
CN107609460B (en) Human body behavior recognition method integrating space-time dual network flow and attention mechanism
CN111241931B (en) Aerial unmanned aerial vehicle target identification and tracking method based on YOLOv3
CN111079847B (en) Remote sensing image automatic labeling method based on deep learning
CN111191583A (en) Space target identification system and method based on convolutional neural network
CN110334656B (en) Multi-source remote sensing image water body extraction method and device based on information source probability weighting
Yu et al. Research of image main objects detection algorithm based on deep learning
CN113608663B (en) Fingertip tracking method based on deep learning and K-curvature method
CN112906550B (en) Static gesture recognition method based on watershed transformation
CN110728694A (en) Long-term visual target tracking method based on continuous learning
Chen et al. Research on moving object detection based on improved mixture Gaussian model
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN113326735A (en) Multi-mode small target detection method based on YOLOv5
CN114548256A (en) Small sample rare bird identification method based on comparative learning
Yang et al. A Face Detection Method Based on Skin Color Model and Improved AdaBoost Algorithm.
CN111310827A (en) Target area detection method based on double-stage convolution model
CN111695507B (en) Static gesture recognition method based on improved VGGNet network and PCA
CN112581502A (en) Target tracking method based on twin network
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN113591607B (en) Station intelligent epidemic situation prevention and control system and method
Raj et al. Deep manifold clustering based optimal pseudo pose representation (dmc-oppr) for unsupervised person re-identification
Yamashita et al. Facial point detection using convolutional neural network transferred from a heterogeneous task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant