CN108537147B

CN108537147B - Gesture recognition method based on deep learning

Info

Publication number: CN108537147B
Application number: CN201810242638.4A
Authority: CN
Inventors: 董训锋; 陈镜超; 李国振; 马啸天
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2021-12-10
Anticipated expiration: 2038-03-22
Also published as: CN108537147A

Abstract

The invention provides a gesture recognition method based on deep learning, which is characterized by comprising the following steps of: training the binary convolution neural network by utilizing a gesture training set and a test set; segmenting the preprocessed original image based on the color information by using the color information reflected by the skin color, and extracting a gesture outline; judging a gesture instruction corresponding to the gesture contour by using the trained binary convolution neural network; and positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and identifying the dynamic gestures by using an HMM algorithm. The method provided by the invention can solve the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like in the conventional gesture recognition.

Description

Gesture recognition method based on deep learning

Technical Field

The invention relates to a gesture recognition method based on deep learning, and belongs to the technical field of gesture recognition.

Background

The appearance of computers has extremely important influence on human social production and daily life, greatly improves the information processing efficiency on one hand, and promotes the development of intelligent life on the other hand. Therefore, how to interact with a computer efficiently and conveniently becomes a hot point of research.

With the development of social information technology, Human Computer Interaction (HCI) has become an important part of daily life. As a new man-machine interaction mode, the gesture recognition technology has wide application prospect in a plurality of fields: (1) digital life and entertainment. For example, in 2008, ericsson introduced a smart phone R520m, which collected gesture information of a user through a built-in camera thereof and served as a keyboard or a touch screen on a mobile phone interface, thereby realizing control of an alarm clock and an incoming call. (2) The field of scientific and technological innovation. In the fields of space exploration and military research, some dangerous environments or special environments which are inconvenient for direct contact control of people are often encountered, and related information can be obtained through interaction of the gesture remote control robot. (3) The field of intelligent transportation, such as unmanned driving. As early as 2010, Google corporation has published their unmanned automobiles outside, which opened up a new era of intelligent transportation.

The gesture recognition technology in the technical field of human-computer interaction can play the following roles:

(1) for a user, the method helps the user to use the product more conveniently, saves the user time and improves the user experience of the user;

(2) for products, redundant use instructions are eliminated, and the products are used only by providing related universal gesture guidance.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the traditional gesture recognition algorithm generally has the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like.

In order to solve the technical problem, the technical scheme of the invention is to provide a gesture recognition method based on deep learning, which is characterized by comprising the following steps:

step 1, training a binary convolution neural network by utilizing a gesture training set and a test set;

step 2, after the original gesture image is collected, preprocessing the original gesture image to remove the influence of illumination on the original image;

step 3, segmenting the preprocessed original image based on color information by using the color information reflected by skin color, and extracting a gesture outline;

step 4, judging whether the gesture outline extracted in the step 3 is a start point and a stop point of a dynamic gesture, if so, extracting a series of images behind the gesture outline to obtain a gesture outline which is a dynamic gesture, entering the step 6, otherwise, extracting a gesture outline which is a static gesture, and entering the step 5;

step 5, judging a gesture instruction corresponding to the gesture outline by using the trained binary convolution neural network;

and 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm.

Preferably, in the step 2, the preprocessing includes brightness correction and light compensation;

during brightness correction, correcting a highlight area in the original gesture image by using corrected exponential transformation; for a darker area in an original gesture image, correcting by using logarithmic transformation with parameters, and not correcting other areas;

and performing light compensation based on a dynamic threshold, converting the original gesture image into a YCbCr color space based on an algorithm of a total reflection theory, and then taking a set of points with larger Y components in the YCbCr color space image as a white reference point.

Preferably, in the step 3, when the original image is segmented, a skin color segmentation algorithm based on the YCbCr color space is adopted.

The method provided by the invention can solve the problems of low recognition precision, poor stability, poor real-time performance, single gesture function and the like in the conventional gesture recognition.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:

the method improves the traditional gesture recognition algorithm based on the conventional technology, uses the improved illumination compensation strategy to enable the original image to be easier to process, uses the improved skin color model to segment the gesture so as to improve the segmentation accuracy, and uses the improved deep convolutional network to classify the static gesture so as to improve the static gesture recognition rate; the improved TLD and HMM algorithm is used for tracking and recognizing the dynamic gesture, so that the robustness, the real-time performance and the recognition rate of a gesture system are improved.

Drawings

FIG. 1 is a system architecture schematic of the design of the deep learning based gesture recognition system of the present invention;

FIG. 2 is a diagram of a binary convolutional neural network architecture of the present invention;

FIG. 3 is a TLD algorithm framework diagram;

FIG. 4 is a detailed flow chart of the TLD algorithm;

FIG. 5 is a flow chart of the modified TLD algorithm;

FIG. 6 is a flow chart of system software design.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a gesture recognition method based on deep learning, which comprises the following steps as shown in figure 1:

The above steps are further described in detail with reference to the following examples:

1. the preprocessing of the original gesture image in the step 2 mainly comprises the following steps: the luminance correction based on exponential transformation and logarithmic transformation and the light compensation based on dynamic threshold specifically comprise the following steps:

(1) luminance correction based on exponential and logarithmic transformations.

The exponential transformation only has good correction effect on a bright area in an image, the logarithmic transformation has good correction effect on a dark area in the image, and the two are combined to realize a light compensation strategy for a human hand, as shown in formula (1), the corrected exponential transformation is used for correcting the bright area, the logarithmic transformation with parameters is used for correcting the dark area, and other areas are not corrected.

The parameters used for equation (1) are as follows:

g (x, y) represents the corrected image; f (x, y) represents the original gesture image; a represents a highlight adjustment coefficient, and a is 0 in the present embodiment; b represents the average luminance of the image, and b is 120/log T in this embodiment₁(ii) a c denotes the normal number, which is found by experimental adjustment, in this example c ═ T₂(ii) a d denotes (normal number passing)Experimental work has shown that in this example d is 1/255-T₂；T₁Indicating a lower threshold for light, T in this embodiment, in dim light conditions₁＝115；T₂Indicating an upper threshold for light, T in this embodiment, under brighter lighting conditions₂＝135。

(2) Light compensation based on dynamic threshold

An algorithm based on the total reflection theory converts an image into a YCbCr color space, and then takes a set of points in the YCbCr color space where the Y component is large as a white reference point. The detailed process is as follows:

assuming that the original gesture image is f (x, y) and the size is m × n, then:

step 1, firstly, converting an original gesture image f (x, y) from an RGB color space to a YCbCr color space by using a formula (2):

step 2, obtaining a reference white point

(a) Cutting the converted image into M × N blocks, where M is 3 and N is 4 in this embodiment;

(b) for each segmented block, C in YCbCr space is calculated respectively_bAnd C_rAverage value M of the components_bAnd M_r；

(c) Using M_bAnd M_rFor C of each block_bAnd C_rMean absolute error D of the components_bAnd D_rAnd (3) performing calculation, wherein the calculation formula is as shown in formula (3):

in the formula (3), C_b(i, j) represents the offset of the B component of each pixel point relative to the brightness, C_r(i, j) represents an offset of the R component with respect to luminance, and sum represents the total number of pixels of the current block.

2. When the preprocessed original image is segmented based on the color information in the step 3, a YCbCr color space based skin color segmentation algorithm is adopted, and the method specifically comprises the following steps:

the YCbCr color space is also called YUV color space, Y denotes brightness, and Cr and Cb denote chroma and saturation, where Cr reflects the difference between the red part of the RGB input signal and the RGB signal brightness value. And Cb reflects the difference between the blue part of the RGB input signal and the luminance value of the RGB signal. The conversion formula from RGB color space to YCrCb is shown in equation (4):

through repeated experiments, the basic values of the parameters are as follows:

77≤C_b≤127 AND 132≤C_r≤172 (5)

however, the formula (5) contains more skin color ranges, and the provided value range is too large, so that interference such as orange or brown objects is easily introduced. Aiming at the unique skin color characteristics of the yellow race, the invention can effectively eliminate the interference of skin color-like objects by adjusting the values after debugging for many times, and the values are as follows:

3. the binarization-Based Convolutional Neural Network (BCNN) based on MOCNN is adopted in the binarization-based convolutional neural network in the step 1, and specifically comprises the following steps:

the current popular deep convolution neural network algorithms have a common defect that the calculation consumption is huge. Therefore, optimization of network computational consumption also mainly expands around these two aspects. On the basis of the MPCNN gesture classification method, a Binary Convolution Neural Network (BCNN) gesture classification method is provided, and a binary approximation strategy is adopted to improve the neural network and reduce the consumption of calculation resources. The binary network has two main ways to reduce the consumption of computing resources: firstly, the original double-precision weight is represented by using a weight value approximate to binaryzation, so that the memory occupation of a network in calculation is reduced; secondly, the input and the weight value in the multiplication calculation with the largest calculation consumption in each layer are replaced by the binary approximate value, so that the multiplication calculation can be simplified into addition and subtraction or even bit operation. Including the modification of the volume block and the modification of the full connection block.

(1) And (5) binarization of the volume block.

The specific way of performing binary approximate reconstruction on the convolutional neural network is as follows:

firstly, in the forward propagation process, carrying out binarization on a weight matrix w of a convolution network according to a formula (7) to obtain w^bAnd the original weight matrix w is retained, namely:

in the formula (7), the reaction mixture is,

representing the binary approximation to obtain a matrix w^bWeight of (1), c_f、w_f、h_fRepresenting the number, width and height of the convolution kernels,

in the standard sign function, when w is 0, sign (w) is 0, and here, in order to achieve the binarization effect, the 3 rd value is not allowed to exist, so that it is specified that when w is 0, sign (w) is 1.

And secondly, adding a binarization activation layer before the previous layer of each layer to obtain a node value, namely, a node value, which replaces the original ReLU activation layer, as shown in formula (8), namely:

in the formula (8), the reaction mixture is,

to binarize the input values of the i-th layer of the network,

c, w and h respectively represent the number of channels, the width and the height of an input image; l (X)_(i-1)) Obtaining a value for the ith binary activation layer; x_(i-1)Representing the input value of layer i-1 of the binarization network.

The function of sign is consistent with equation (8). Finally, the weight w is obtained^bAnd (3) performing convolution operation on the binary convolution layer, as shown in formula (9), namely:

in formula (9): l is^b(X^b) Is a binary network layer function;

is a convolution operation; x^bIs that

w^bObtained by respectively carrying out the following formulas (7) and (8).

For the roll-up block, the structure thereof also needs some adjustment. The normalization process BatchNorm layer and the binarization activation layer are placed before the convolution operation, which is to prevent the result of the binarization activation layer from being mostly 1 when passing through the maximum pooling layer. The specific network structure is shown in fig. 2.

The process of back propagation of training is as follows. Calculating the gradient of the last layer, reversely propagating the gradient of the nodes and the gradient of the weight from the penultimate layer to the first layer by layer, and updating the retained w before binarization to obtain w^uAnd a loose operation as in equation (10) is performed, namely:

in the formula (10), w^uRepresenting the updated value of the floating point number weight value reserved in the forward propagation process; sigma (w)^u) Represents a weight w^uProbability > 0; chip (. cndot.) represents the max function.

(2) And (5) binarization of the full connection block.

The binaryzation of the full-connection block is basically consistent with the binaryzation of the rolling block, the binaryzation rolling layer is replaced by the binaryzation full-connection layer, and the largest pooling layer is removed. The calculation formula of the binarization full-connection layer is shown as formula (11).

L^b(X^b)＝w^bX^b (11)

In the formula (11), L^b(X^b) A full link layer function which is binary; x^b，w^bObtained by respectively carrying out the following formulas (7) and (8). The binarized fully connected layer removes the offset b.

4. In step 6, a TLD algorithm is used for tracking gesture tracks, deviations in the tracking process are corrected by using a Haar classifier, and the specific method for identifying dynamic gestures by using an HMM algorithm comprises the following steps:

4.1, the TLD Algorithm framework consists of three parts: tracking, learning and detecting, as shown in fig. 3:

in the algorithm framework, the three parts cooperate and complement to complete the tracking of the object. In the tracking module, the precondition is that the object motion speed is not high, the object does not have large-amplitude displacement between two adjacent frames, and the tracked target is always within the range of the camera, so as to estimate the moving target, and if the target disappears from the visual field, the tracking failure can be caused. In the detection module, on the premise that no interference is generated between each frame of the video, the detection algorithm is used to search for the target in each frame of the image respectively through the model detected and learned in the past, and the possible occurrence area of the target is calibrated. When the detection module has errors, the learning module evaluates the errors of the detection module according to the result obtained by the tracking module, generates a training sample and updates the purpose of the detection module

And key feature points of the model and the tracking module are marked, so that similar errors are avoided. A detailed flow chart of the TLD algorithm is shown in fig. 4.

The TLD algorithm has good real-time performance on target tracking, and when the target is shielded or leaves the camera area and reappears, the target can still be identified and tracked. However, the algorithm needs to manually select the tracked target through a mouse during initialization, which is not beneficial to the automation of target tracking; meanwhile, although the LBP feature adopted in the detection module is simple in calculation and easy to meet the real-time requirement, a position deviation occurs in the tracking process, resulting in a tracking failure. Therefore, the system combines the characteristics of static gesture recognition and gesture tracking on the basis of the original TLD algorithm, and improves the algorithm as follows:

in order to solve the problem that a target area needs to be manually selected when an algorithm is initialized, a static gesture recognition database is added into a detection module, and when a gesture matched with the gesture database appears in a video frame, a TLD tracking algorithm is automatically initialized. Meanwhile, due to the fact that the trained static gesture database is adopted, a learning module in the original TLD algorithm can be removed, when the gesture of a user changes, only the fact that whether the gesture exists in the gesture database in the video frame needs to be retrieved again, then the TLD algorithm is initialized, and the flow of the improved TLD algorithm is shown in the figure 5.

4.2 correcting deviation in tracking process by using Haar classifier

The method mainly comprises the steps of extracting Haar features and training a classifier. The Haar features mainly include center features, linear features, edge features, and diagonal features. In order to obtain the final Haar classifier, the invention adopts an improved Adaboost algorithm for training. Different weak classifiers are trained by Haar features extracted from samples, and then the weak classifiers are integrated to obtain a final strong classifier, namely the Haar classifier required by the text.

The implementation flow of the improved Adaboost algorithm is as follows:

let X be the sample space and Y be the set of sample class identifications. For a typical two-classification problem, Y ═ 0, 1, let S { (x) }_i，y_i) I | ═ 1, 2, 3, …, m } for training after taggingSet of samples, where there is x_i∈X，y_iE Y, assuming a total of T iterations when the final goal is reached.

Step 1, initializing weights of m samples:

in the formula, D_t(i) Denotes the sample (x) in the t-th iteration_i，y_i) The weight of (2).

Step 2, for T equal to 1, 2, 3 …, T, respectively, calculates:

(a) training a weak classifier h for each feature f of a sample x_l(x，f，p，θ)：

In the formula (13), θ represents the threshold of the weak classifier corresponding to f, and p is used for adjusting the unequal sign direction. Computing usage q_iThe classification error rate ε of all the weighted weak classifiers_f：

ε_f＝∑_iq_i|h_t(x，f，p，θ)-y_i| (14)

In formula (14), y_iRepresenting elements in the sample class identification space, q_iRepresenting the weight of the ith training sample.

(b) Selecting the one with the minimum error rate epsilon_tOf the optimal weak classifier epsilon_t

ε_t＝min_f，p，θ∑_iq_i|h_t(x，f，p，θ)-y_i| (15)

(c) Sample weights are modified using the best weak classifier:

β_t＝ε_t(1-ε_t) (17)

in the formula (16), D_t+1(i) Representing the probability value of the t +1 th training sample,

represents D_t+1And D_tThere is an iterative relationship, which can be passed through D_tUpdate D_t+1。

In the formula (17), beta_tRepresenting a normalization constant.

If sample x_iIs correctly classified, then e_i0, otherwise, e_i＝1。

Step 3, final Haar classifier c (x):

α_t＝log(1/β_t) (19)

4.3 HMM-based dynamic gesture trajectory recognition

In the invention, a hidden Markov model can be used for identifying the dynamic gesture track, and the identification process corresponds to three processes of solving by the hidden Markov model:

(1) estimating a problem

The problem is that for a given hidden markov model λ ═ (pi, a, B), and an observation sequence O ═ generated by the model (O)₁，o2，…，o_T) The likelihood probability P (O | λ) of the resulting observation sequence O is calculated. One effective algorithm to solve this problem is the forward-backward recursion algorithm.

The forward variables are defined as:

α_t(i)＝P(o₁，o₂，…o_T，q_t＝θ_i|λ)，1≤t≤T (19)

in formula (19), P (.) represents the likelihood probability of the observed sequence; o₁，o₂，…o_TRepresents an observation sequence; q. q.s_tAn observed value representing time t; theta_iRepresents a system state value; lambda denotes hidden horseAn Erkoff model; t represents the total observation time; t represents the time scale and takes a value between 0 and T.

Note b_j(o_t)＝b_jk|o_t＝v_k，b_j(o_t) Representing the observed state transition matrix, b_jkRepresents an arbitrary time t, a system observation matrix, v_kRepresenting the hidden state at the moment t, the forward algorithm comprises the following steps:

initialization:

α₁(i)＝π_ib_j(o₁)，1≤i≤N (20)

in the formula (20), α₁(i) Indicating the occurrence of o from 1-i₁～o_iObservation sequence and hidden state v at that moment₁A probability of 1; pi_iAn initial probability distribution matrix is represented.

Recursion:

in the formula (21), α_t+1(j) Indicating a hidden state v at time j_t+1Is the probability of t +1, α_i，jIndicating the system state transition matrix at any time t.

Calculate P (O | λ):

in equation (15), P (O | λ) represents the likelihood probability of generating the observation sequence O under the current model λ. The variables after definition are:

β_t(i)＝P(o_t+1，o_t+2，…o_T，q_t＝θ_i|λ)，1≤t≤T (22)

in the formula (22), beta_t(i) Represents the posterior probability of P (O | λ) at time t.

The backward algorithm comprises the following steps:

initialization:

β_T(i)＝1，1≤i≤N (23)

recursion:

t＝T-1，T-2，，，，1，1≤i≤N

calculate P (O | λ):

by adopting a forward algorithm in the first half of calculation and setting a time period to be 0-T, and adopting a backward algorithm in the second half of calculation and setting the time period to be T-T, the probability can be obtained as follows:

(2) problem of decoding

For a hidden markov model λ ═ (pi, a, B), it is first necessary to find an observation sequence O ═ (O) generated by the model₁，o₂，…o_TIn this case), on the basis of the observation sequence, the optimal state sequence experienced by the model in the course of generating the observation sequence is calculated

Here, the Viterbi algorithm is used.

(3) Study questions

Generating an observation sequence O ═ O (O) from the model without knowledge of hidden Markov model parameters₁，o₂，…o_TAnd) the likelihood probability P (O | lambda) is maximized by adjusting the model parameters. In the present system, the learning problem is typically solved using the Baum-Welch algorithm.

The gesture recognition platform collects gesture images through the camera and converts gesture commands in the gesture images into instructions which can be executed by the computer. Firstly, a sample database is needed, and static gesture and dynamic gesture track recognition are carried out on the basis of the database: the gesture image can be acquired through a camera or directly from a local video file; after a gesture image is obtained, performing operations such as gesture segmentation, image binarization, feature extraction and the like on the gesture image; and finally, performing gesture recognition on the image, and returning a recognition result to facilitate process observation. The system software design flow is shown in fig. 6. The system uses multi-threaded development, where image preprocessing, gesture segmentation is done in sub-thread 1, and dynamic gesture tracking and recognition is done in sub-thread 3.

Claims

1. A gesture recognition method based on deep learning is characterized by comprising the following steps:

the binarization-based convolutional neural network BCNN based on the MPCNN is adopted by the binarization-based convolutional neural network in the step 1, and specifically comprises the following steps:

on the basis of the MPCNN gesture classification method, a convolution neural network gesture classification method based on binaryzation is provided, and a strategy of binaryzation approximation is adopted to improve the neural network, so that the consumption of the neural network on computing resources is reduced; the binarization network has two ways to reduce the consumption of computing resources: firstly, the original double-precision weight is represented by using a weight value approximate to binaryzation, so that the memory occupation of a network in calculation is reduced; secondly, the input and the weight value in the multiplication calculation with the maximum calculation consumption in each layer are replaced by binary approximate values, so that the multiplication calculation can be simplified into addition and subtraction or even bit operation, including the transformation of a rolling block and the transformation of a full connecting block;

a first part: binarization of the rolling blocks;

step 1011, in the forward propagation process, binarizing the weight matrix w of the convolution network according to the formula (7) to obtain w^bAnd the original weight matrix w is retained, namely:

in the formula (7), the reaction mixture is,

specifying that sign (w) is 1 when w is 0;

step 1012, adding a binarization activation layer before the previous layer of each layer to obtain a node value, and replacing the original ReLU activation layer, as shown in formula (8), that is:

in the formula (8), the reaction mixture is,

to binarize the input values of the i-th layer of the network,

c, w and h respectively represent the number of channels, the width and the height of an input image; l (X)_(i-1)) Obtaining a value for the ith binary activation layer; x_(i-1)An input value representing the i-1 st layer of the binarization network;

the function of sign is consistent with the formula (8), and finally, the weight w is obtained^bAnd (3) performing convolution operation on the binary convolution layer, as shown in formula (9), namely:

in formula (9): l is^b(X^b) Is a binary network layer function;

is a convolution operation; x^bIs that

w^b、X^bRespectively obtained by the formula (7) and the formula (8);

for the volume block, a normalization processing BatchNorm layer and a binarization activation layer are placed before convolution operation, so that the situation that most results are 1 when the results of the binarization activation layer pass through the maximum pooling layer is prevented;

the back propagation process of the binaryzation convolutional neural network training comprises the following steps of calculating the gradient of the last layer, reversely propagating the layer from the penultimate layer to the first layer, calculating the gradient of nodes and the gradient of weight, and updating the retained w before binaryzation to obtain w^uAnd a loose operation as in equation (10) is performed, namely:

in the formula (10), w^uRepresenting the updated value of the floating point number weight value reserved in the forward propagation process; sigma (w)^u) Represents a weight w^u>Probability of 0 hour; chip (·) represents the max function;

a second part: binaryzation of the full connecting block;

the binaryzation of the full-connection block is basically consistent with the binaryzation of the rolling block, the difference is that the binaryzation rolling layer is replaced by the binaryzation full-connection layer, the largest pooling layer is removed, and the calculation formula of the binaryzation full-connection layer is as shown in formula (11):

L^b(X^b)＝w^bX^b (11)

in the formula (11), L^b(X^b) A full link layer function which is binary; x^b，w^bRespectively obtained by the formulas (7) and (8), and the binary full-connection layer is removed from deviationB, placing;

in step 2, the preprocessing of the original gesture image includes: the luminance correction based on exponential transformation and logarithmic transformation and the light compensation based on dynamic threshold specifically comprise the following steps:

step 201, luminance correction based on exponential transformation and logarithmic transformation:

the exponential transformation only has good correction effect on a bright area in an image, the logarithmic transformation has good correction effect on a dark area in the image, and the two are combined to realize a light compensation strategy for a human hand, as shown in formula (1), the corrected exponential transformation is used for correcting the bright area, the logarithmic transformation with parameters is used for correcting the dark area, and other areas are not corrected:

the parameters used for equation (1) are as follows:

g (x, y) represents the corrected image; f (x, y) represents the original gesture image; a represents a highlight adjustment coefficient; b represents the average brightness of the image; c represents the normal number obtained by experimental debugging; d-1/255-T₂；T₁Representing the lower limit threshold of light under the condition of dark illumination; t is₂Indicating an upper threshold for light under brighter lighting conditions;

step 202, dynamic threshold based light compensation

The method comprises the following steps of converting an image into a YCbCr color space by an algorithm based on a total reflection theory, and then taking a set of points with larger Y components in the YCbCr color space as a white reference point:

step 2021, convert the original gesture image f (x, y) from RGB color space to YCbCr color space using equation (2):

step 2022, obtaining a reference white point, comprising the following steps

(a) Cutting the converted image into M multiplied by N blocks;

in the formula (3), C_b(i, j) represents the offset of the B component of each pixel point relative to the brightness, C_r(i, j) represents an offset of the R component with respect to luminance, sum represents a total number of pixels of the current block;

in step 3, when the preprocessed original image is segmented based on the color information, a segmentation algorithm based on the YCbCr color space skin color is adopted, and the method specifically comprises the following steps:

the YCbCr color space is also called YUV color space, Y represents brightness, and Cr and Cb represent chroma and saturation, where Cr reflects the difference between the red part of the RGB input signal and the brightness value of the RGB signal, and Cb reflects the difference between the blue part of the RGB input signal and the brightness value of the RGB signal;

the conversion formula from RGB color space to YCrCb is shown in equation (4):

77≤C_b≤127 AND 132≤C_r≤172 (5)

however, the formula (5) contains more skin color ranges, and the provided value range is too large, so that orange or brown object interference is easily introduced, and the value is adjusted through multiple debugging, so that the interference of skin color-like objects can be effectively eliminated, and the values are as follows:

step 6, positioning start points and stop points of dynamic gestures corresponding to a series of gesture outlines, tracking gesture tracks by using a TLD algorithm, correcting deviations in the tracking process by using a Haar classifier, and recognizing the dynamic gestures by using an HMM algorithm;

in step 6, tracking the gesture track by using a TLD algorithm, correcting the deviation in the tracking process by using a Haar classifier, and identifying the dynamic gesture by using an HMM algorithm, wherein the specific method comprises the following steps:

step 601, the TLD algorithm framework consists of three parts: tracking, learning and detecting, wherein in an algorithm frame, the three parts are cooperated and complemented to finish the tracking of the object; in the tracking module, the precondition is that the object motion speed is not high, the object can not generate large-amplitude displacement between two adjacent frames, and the tracked target is always within the range of the camera, so as to estimate the moving target, and if the target disappears from the visual field, the tracking failure can be caused; in the detection module, on the premise that no interference is generated between each frame of the video, a detection algorithm is used for searching targets in each frame of image respectively through a model detected and learned in the past, and possible occurring areas of the targets are calibrated; when the detection module has errors, the learning module evaluates the errors of the detection module according to the result obtained by the tracking module, generates a training sample, and updates the target model of the detection module and the key characteristic points of the tracking module, thereby avoiding similar errors;

step 602, correcting bias in tracking process by using Haar classifier

The method mainly comprises the steps of extracting Haar features and training a classifier; the Haar features mainly comprise central features, linear features, edge features and diagonal features; in order to obtain the final Haar classifier, an improved Adaboost algorithm is adopted to train: firstly, training different weak classifiers by using Haar features extracted from a sample, and then integrating the weak classifiers to obtain a final strong classifier, namely a Haar classifier;

the implementation flow of the improved Adaboost algorithm is as follows:

suppose X is a sample space and Y is a set of sample class identifications; for a typical two-classification problem, Y ═ 0, 1, let S { (x) }_i，y_i) I 1, 2, 3, m is a set of labeled training samples, where x is_i∈X，y_iE, Y, and if the final target is reached, the iteration is performed for T times in total;

step 6021, initializing the weights of the m samples:

in the formula, D_t(i) Denotes the sample (x) in the t-th iteration_i，y_i) The weight of (2);

step 6022, for T1, 2, 3 …, T, respectively, calculates:

In the formula (13), θ represents the threshold of the weak classifier corresponding to f, p is used for adjusting the unequal sign direction, and q is used_iThe classification error rate ε after weighting weak classifiers of all features_fAnd (3) calculating:

ε_f＝∑_iq_i|h_t(x，f，p，θ)-y_i| (14)

in formula (14), y_iRepresenting elements in the sample class identification space, q_iRepresenting the weight of the ith training sample;

ε_t＝min_f，p，θ∑_iq_i|h_t(x，f，p，θ)-y_i| (15)

(c) Sample weights are modified using the best weak classifier:

β_t＝ε_t(1-ε_t) (17)

represents D_t+1And D_tThere is an iterative relationship, by D_tUpdate D_t+1；

In the formula (17), beta_tRepresents a normalization constant;

if sample x_iIs correctly classified, then e_i0, otherwise, e_i＝1；

Step 6023, final Haar classifier c (x):

α_t＝log(1/β_t) (19)

step 603, HMM-based dynamic gesture track recognition

The hidden Markov model is used for recognizing the dynamic gesture track, and the recognition process corresponds to three processes solved by the hidden Markov model:

step 6031, estimate problem

The problem is that for a given hidden markov model λ ═ (pi, a, B), and an observation sequence O ═ generated by the model (O)₁，o₂，...，o_T) One effective algorithm to solve this problem is the forward-backward recursion algorithm:

the forward variables are defined as:

α_t(i)＝P(o₁，o₂，…o_T，q_t＝θ_i|λ)，1≤t≤T (19)

in the formula (19), P (-) represents a likelihood probability of the observation sequence; o₁，o₂，...，o_TRepresents an observation sequence; q. q.s_tAn observed value representing time t; theta_iRepresents a system state value; λ represents a hidden markov model; t represents the total observation time; t represents time scale and takes a value between 0 and T;

initialization:

α₁(i)＝π_ib_j(o₁)，1≤i≤N (20)

in the formula (20), α₁(i) Indicating the occurrence of o from 1-i₁～o_iObserving the sequence, and at that timeInscription hidden state v₁A probability of 1; pi_iRepresenting an initial probability distribution matrix;

recursion:

in the formula (21), α_t+1(j) Indicating a hidden state v at time j_t+1Is the probability of t +1, α_i，jRepresenting the system state transition matrix at any time t;

calculate P (O | λ):

in equation (15), P (O | λ) represents the likelihood probability of generating the observation sequence O under the current model λ, and the defined variables are:

β_t(i)＝P(o_t+1，o_t+2，…o_T，q_t＝θ_i|λ)，1≤t≤T (22)

in the formula (22), beta_t(i) Represents the posterior probability of P (O | lambda) at time t;

the backward algorithm comprises the following steps:

initialization:

β_T(i)＝1，1≤i≤N (23)

recursion:

calculate P (O | λ):

by adopting a forward algorithm in the first half of calculation and setting a time period to be 0-T, and adopting a backward algorithm in the second half of calculation and setting the time period to be T-T, the probability is obtained as follows:

step 6032, decode problem

For a hidden markov model λ ═ (pi, a, B), it is first necessary to find an observation sequence O ═ (O) generated by the model₁，o₂，...，o_T) Calculating the optimal state sequence experienced by the model in the process of generating the observation sequence on the basis of the observation value sequence

Here the Viterbi algorithm is used;

step 6033, learn question

Generating an observation sequence O ═ O (O) from the model without knowledge of hidden Markov model parameters₁，o₂，...o_TAnd) the likelihood probability P (O | lambda) is maximized by adjusting the model parameters.