CN109271840A

CN109271840A - A kind of video gesture classification method

Info

Publication number: CN109271840A
Application number: CN201810826221.2A
Authority: CN
Inventors: 苗启广; 李宇楠; 徐昕; 戚玉涛; 房慧娟; 马振鑫; 齐相达; 张鸿远; 权义宁; 宋建锋
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2019-01-25

Abstract

The present invention relates to a kind of video gesture classification methods.Related method includes: that the visible light video and deep video of same target collection are normalized；The conspicuousness for calculating each frame image in visible light video, obtains conspicuousness video；Visible light video, deep video, conspicuousness video after being utilized respectively normalization are trained Three dimensional convolution neural network, and the Three dimensional convolution neural network includes: the input layer set gradually, 18 convolutional layers, global average pond layer and full articulamentum；The be averaged feature of pond layer of the overall situation of the airspace dimension of obtained visible light video, deep video and conspicuousness video is carried out by Fusion Features obtains fusion feature, video gesture classification is carried out according to obtained fusion feature.The targeted problem of the present invention is the extensive gesture identification based on video, and categorical data and the video data volume are all much larger than existing invention.The present invention can identify in real time video, more confidence level and practicability.

Description

A kind of video gesture classification method

Technical field

The invention belongs to deep learning Applied research fields, further relate to deep based on three-dimensional in convolutional neural networks The dynamic gesture classification method for spending neural network can be applied to human-computer interaction, intelligent driving, intelligence wearing, Entertainment etc. and relate to And to the place of gesture identification.

Background technique

Gesture is substantially also a kind of language system.Gesture can be divided into static gesture and dynamic gesture, and static gesture is main Pay close attention to the factors such as shape, profile, the center of gravity of gesture；The dynamic gesture record motion profile of hand, hand brandish the main points such as orientation.Phase The Gesture Recognition answered is also classified into static identification technology and dynamic recognition technique, and static gesture refers mainly to the side of human body gesture To, the shape of gesture and texture, the meaning for being applicable to expression is less, the simple scene of identification；Dynamic hand gesture recognition is main It is motion process and the track of tracking and identification gesture, suitable for the scene of variation and the image of irregular movement, this gesture Identification can express meaning more accurate and abundant.

Initial Research of Gesture Recognition is based on a kind of dedicated data glove come what is carried out, and computer passes through data The sensor of gloves obtains the various information of hand.Luo Ying, Cao Jinling et al. are in a kind of patent " intelligence based on data glove In hand language recognition device ", the movement variation and amplitude in each joint of hand are perceived by the sensor in data glove, to know The sign language meaning of other deaf-mute, so that deaf-mute be helped to carry out daily exchange with normal person.Zhou Yanming, Huang Changzheng et al. are in patent Microcontroller, data acquisition module, bluetooth in " a kind of identification device of quickly small gesture " by being arranged on clamping device Wireless communication module and haptic feedback module realize the quick identification to minor motion, and by haptic feedback module according to Different scenes Real-time Feedback corresponding actions.But the requirement based on the gesture identification of data glove to hardware is relatively high, data hand Covering the features such as at high cost, application method is complicated makes it be not easy to promote.Lu Tong, Hu Wei et al. patent " based on finger contours and In the static gesture identification method of decision tree ", the profile for obtaining hand is split to depth image, the distance calculated between finger is made For feature vector, feature vector is trained and is classified using decision tree, to obtain the meaning of gesture.But this method is It is identified based on two-dimensional static gesture, it is impossible to be used in true 3 D stereo scene.Wang Peng in document, " know by Wang Peng ' dynamic gesture Not Yan Jiu ', Dalian University of Technology, propose in 2013 " a kind of online vision with adaptivity and recapture ability with Track algorithm, the tracker of algorithm is based partially on particle filter tracking frame, using a kind of supporting vector of online study Machine is as observing and nursing；Detector portion has chosen descriptor histograms of oriented gradients (HOG) feature as sample characteristics.The party Method tentatively realizes dynamic hand gesture recognition, but categories of datasets number used by testing is less, only 10 classes, and is specific hand Gesture.

In conclusion although existing Gesture Recognition achieves certain achievement, but mostly there is certain limitation Property, due to ignoring the complementarity of different modalities data, to the extensive identification based on video, that there are still accuracy of identification is low, vulnerable to back The problems such as scape interferes also requires further study in terms of the dynamic hand gesture recognition in real scene based on space-time characteristic.

Summary of the invention

The present invention be directed to limitation of the existing Gesture Recognition in true stereo scene, especially dynamic gestures to know The other technology problem lower to the discrimination of extensive classification problem, provides a kind of video gesture classification method.

Video gesture classification method of the invention includes:

Step 1, the visible light video and deep video of same target collection are normalized, it is visible after normalization Light video and deep video have identical height, same widths and equal length；

Step 2, the conspicuousness for calculating each frame image in visible light video, obtains conspicuousness video；

Step 3, the visible light video, deep video, conspicuousness video after being utilized respectively normalization are to Three dimensional convolution nerve Network is trained, and the Three dimensional convolution neural network includes: the input layer set gradually, 18 convolutional layers, global average pond Change layer and full articulamentum；

Step 4, it is averaged feature, the overall situation of deep video of pond layer to the overall situation of visible light video obtained in step 3 The be averaged feature progress Fusion Features of pond layer of the overall situation of the airspace dimension of the feature and conspicuousness video of average pond layer obtain Fusion feature carries out video gesture classification according to obtained fusion feature.

Further, normalized of the present invention includes airspace normalized and time domain normalized, the sky Domain normalized includes so that visible light video and deep video is had identical height and identical width by scaling or cutting processing Degree；The time domain normalized includes adaptively choosing the mode work of all video frame numbers after the normalized of airspace To normalize standard, reduction frame number is carried out to the video beyond normalization standard；Completion is carried out for substandard video, is made All video frame numbers having the same.

Preferably, Fusion Features are merged using mean value in step 4 of the present invention, splicing is merged or Canonical correlation (CCA) analysis fusion.

Preferably, fusion feature is inputted in support vector machines in step 4 of the present invention and is classified.

Compared with prior art, the invention has the following advantages that

First, what the present invention was studied is three-dimensional dynamic hand gesture recognition, rather than is identified based on two-dimensional static gesture, And the special hardwares such as data glove are needed not rely on, therefore closer to real-life real scene, are conducive to application.

Second, the targeted problem of the present invention is the extensive gesture identification based on video, categorical data and video counts Existing invention is all much larger than according to amount.The method that the present invention uses can identify in real time video, more confidence level and practical Property.

Third, the present invention have carried out space-time normalized to data first before carrying out gesture identification, have made new sample This has a high consistency in time and Spatial Dimension, and can be reasonable, reservation gesture video as much as possible Feature.

4th, the present invention removes the interference of gesture irrelevant factor using Saliency maps picture.In visible light video, back Color of scape, the colour of skin and performing artist's clothes etc. is all the factor unrelated with gesture meaning, these factors are in the side based on study The identification of gesture classification can be interfered in method, and the conspicuousness video masking that the present invention uses these irrelevant factors, make Network can be absorbed in the movement of gesture itself.

5th, it is of the present invention preferably to be learnt based on canonical correlation analysis fusion (CCA) method to same Feature association degree of one sample under different modalities data, and corresponding weight is designed accordingly, existed by the method for Weighted Fusion The complementarity for utilizing different modalities data to the full extent obtains the fusion feature that can more comprehensively, more reflect gesture feature.

Detailed description of the invention

Fig. 1 is flow chart of the invention.

Fig. 2 is the network structure of Three dimensional convolution neural network.

Fig. 3 is the effect picture of Saliency maps picture；(a) figure is RGB original image, and (b) figure is conspicuousness effect picture.

Fig. 4 is characterized extraction simulation result diagram；(a) figure is three kinds of feature accuracys rate and loss function line chart, and (b) figure is Single features recognition accuracy.

Fig. 5 is characterized fusion simulation result diagram.

Specific embodiment

The visible light video and deep video of same target collection of the invention refer to same target recorded obtained by regard The visible light video and deep video of frequency, wherein visible light video and deep video are that the conventional of this field understands.It can also lead to Cross visible light video and deep video that the video data for obtaining and being stored in hard disc of computer space obtains same target collection, example The visible light video (rgb video) and deep video (depth video) such as got by Kinect sensor.

The purpose of normalized of the invention is to make video height having the same, width and length, it can be achieved that the technology The data normalization processing method of purpose is suitable for the present invention.

Conspicuousness video acquisition of the invention can be obtained according to conspicuousness theory, i.e. each frame in calculating visible light video Then each frame Saliency maps picture is connected, obtains conspicuousness video by the significance value of image.

Three-dimensional nerve network of the invention can handle the information of time and Spatial Dimension, obtain the space-time characteristic of gesture. A kind of specific three-dimensional nerve network is: the number of each layer of convolution kernel is respectively 64,64,64,64,64,128,128,128, 128,256,256,256,256,512,512,512,512, a convolutional layer is then in addition added, the use of 1024 sizes is 1* The convolution kernel of 1*1, by Feature Mapping to higher dimension.This specific network structure is as shown in Fig. 2, include 18 Convolutional layer, 1 overall situation are averaged pond layer and 1 full articulamentum.It is 3*3*3's that size, which can be used, in most of convolutional layer in network Convolution kernel carries out convolution operation.In the network structure, traditional pond layer is replaced by step-length by 2 convolutional layer, is come with this Achieve the effect that down-sampling.Then, after 8 residual units and an additional convolutional layer, it is connected to an airspace dimension The overall situation of degree is averaged pond layer, and the size of pond window is 7*7.Finally, a full articulamentum is connected to, for what will be inputted Classification is mapped to 249 classes.

Embodiment:

Embodiment processing carries out on Chalearn LAP IsoGD data set.The data set is that Wan et al. exists Paper " Chalearn looking at people rgb-d isolated and continuous on CVPRW2016 It is proposed in datasets for gesture recognition ", which shares 47933 gesture videos, each video One gesture motion is made by a volunteer, which contains 249 class gestures altogether.Wherein 35878 videos are as training set, 5784 videos are as verifying collection, and 6271 videos are as test set.

Referring to Fig.1, the video gesture classification method of the embodiment includes:

Step 1, the reading of video data

Using matlab software reading video data, wherein the data include being got from sensors such as Kinect Light video (rgb video) and deep video (depth video).

Step 2, the space-time normalization of video data

2a) the airspace normalization of video data, the present invention is by scale scaling and cuts out two ways, in step 1 Video is normalized on Spatial Dimension, with identical height and width.

Steps are as follows for a kind of specific airspace normalized:

Scale reduction is carried out to each frame of video data, makes the video of 320*240 (wide * high)；

Each frame of video data is cut, the video of 112*112 is obtained, wherein cut video top left co-ordinate by Selected coordinate (i, j) determines (i≤208, j≤128) at random；

2b) the time domain normalization of video data, the embodiment are adaptive according to the statistics to all inputting video data frame numbers Choose 2a with answering) in all video frame numbers progress of the mode (value S) as normalization standard, to the video beyond standard Sampling carries out completion to substandard video, with identical frame number.

Steps are as follows for a kind of specific time domain normalized:

The video V for exceeding standard for frame number, carries out equally spaced uniform sampling, retains in original video every m frame One frame deletes remaining m-1 frame.It can be obtained the video of standard frame number, m is obtained by following formula:

M=total/s (1)

It is wherein standard frame number is s, total is the practical frame number of original video.

Video substandard for frame number calculates the ratio between standard frame number and practical frame number:

Then, it for each frame in video, is replicated ratio times, interpolation is after the position of the frame.View at this time Frequency frame number and the difference of standard frame number are

Diff=s-total*ratio (3)

If diff > 0, it need to select at random diff frame in original total frame and once be replicated, is placed sequentially in After the position of random frame.So far, the video that frame number is less than standard is completed to expand frame/completion operation.

Step 3 generates conspicuousness video

Conspicuousness video is generated according to visible light video obtained in step 2, the specific steps of which are as follows:

[1] gaussian filtering is carried out to image, obtains filtered image, takes the Lab value of its pixel

WhereinThe pixel value of image after indicating gaussian filtering,Indicate its luminance component of brightness, value range is [0,100],Indicate that, from green to red range, value range is [- 128,127],It indicates from blue to yellow Range, value range are [- 128,127].

[2] image is calculated in the mean value in the space LAB

I_μIt is mean value pixel value, L_μIndicate brightness, a_μIt indicates from green to red range, b_μIt indicates from blue to yellow Range.

[3] Euclidean distance is sought, Saliency maps are obtained

Step 4, the training of Three dimensional convolution neural network model

Conspicuousness video obtained in visible light video, deep video obtained in step 2 and step 3 is sent into Fig. 2 Training, iteration 120000 times, obtains corresponding network model in represented Three dimensional convolution neural network model.

Step 5, feature extraction

Respectively using trained model in step 4, feature is carried out to visible light, depth, conspicuousness gesture video and is mentioned It takes.Specifically, its feature at pool6 layers is extracted.

Step 6, Fusion Features

Q.-S.Sun, S.-G.Zeng, Y.Liu, the paper .A of P.-A.Heng, D.-S.Xia et al. are used in the embodiment new method of feature fusion and its application in image recognition.Pattern Method disclosed in Recognition seeks different modalities obtained in step 5 by canonical correlation analysis method (CCA) The weight of data, realizes its Weighted Fusion, concrete mode be first with the Canonical correlation method to Visible Light Characteristics and Depth characteristic is merged, and intermediate fusion feature vector is obtained, and recycles intermediate fusion of the Canonical correlation method to obtaining Feature is merged with significant characteristics vector, obtains finally merging vector.

Step 7, classification

The present invention is classified using SVM classifier.In the specific embodiment, the SVM in the tool box matlab is used Classifier carries out Classification and Identification to gesture video.Since SVM is accomplished that two classification, and general gesture type can all be greater than 2, Therefore, the present invention is by trained and consistent two classifier of gesture number of species, to realize more classification.Specifically, by step 6 Obtained in the fusion feature new feature of training data be input in SVM, SVM is trained, then melting test data It closes feature and is input in trained SVM and classify, obtain classification results.Wherein according to Chalearn LAP IsoGD data Collect the training data divided and test data and above-mentioned deep neural network is consistent division.

The simulated effect of embodiment:

Emulation 1, the emulation that the saliency data in above-mentioned steps two and step 3 is obtained.

Emulating 1 simulated conditions is carried out under MATLAB R2016a software.

The emulation experiment of saliency data acquisition is carried out referring to Fig. 3.Fig. 3 (a) is visible data sample, and Fig. 3 (b) is it Corresponding conspicuousness sample.As seen from Figure 3, saliency data has gesture irrelevant contents in elimination visible data dry The characteristic and visible data for disturbing, reflecting gesture region of variation have good complementarity.

Emulation 2, the emulation to the feature extraction in above-mentioned steps four and step 5.

Emulating 2 simulated conditions is lower under deep learning frame caffe environment carry out.

Feature extraction emulation experiment is carried out referring to Fig. 4.Fig. 4 (a) is the loss decline curve figure of three kinds of modal datas, the song Line reflects the case where network model fitting data feature, and penalty values are lower to illustrate that fitting degree is higher；Fig. 4 (b) is different moulds The respective recognition result of state data.

From Fig. 4 (a) and 4 (b) as can be seen that the feature extracting method that the present invention uses can be fitted different modalities well Data, after enough iteration, penalty values are lower, while the identification on visible light, depth and conspicuousness modal data Rate also respectively reaches 45.07%, 48.44% and 43.21%.

Emulation 3, the emulation to six Fusion Features of above-mentioned steps.

Emulating 3 simulated conditions is carried out under MATLAB R2016a software.

Emulation experiment is carried out referring to Fig. 5.Fig. 5 is using the fusion method for directly concatenating feature from beginning to end and using based on allusion quotation The fusion results that type correlation analysis (CCA) obtains.

From fig. 5, it can be seen that the algorithm in the present invention achieves good syncretizing effect.It is merged using Canonical correlation special The accuracy of sign has reached 61.98%, higher than the identification knot for the fusion method and single features for directly concatenating feature from beginning to end Fruit.

Claims

1. a kind of video gesture classification method, which is characterized in that method includes:

Step 1, the visible light video and deep video of same target collection are normalized, the visible light view after normalization Frequency and deep video have identical height, same widths and equal length；

Step 3, the visible light video, deep video, conspicuousness video after being utilized respectively normalization are to Three dimensional convolution neural network It is trained, the Three dimensional convolution neural network includes: the input layer set gradually, 18 convolutional layers, global average pond layer With full articulamentum；

Step 4, average to the be averaged feature of pond layer, the overall situation of deep video of the overall situation of visible light video obtained in step 3 The overall situation of the airspace dimension of the feature and conspicuousness video of pond layer be averaged pond layer feature carry out Fusion Features merged Feature carries out video gesture classification according to obtained fusion feature.

2. video gesture classification method as described in claim 1, which is characterized in that the normalized includes airspace normalizing Change processing and time domain normalized, the airspace normalized include by scaling or cutting processing make visible light video and Deep video has identical height and same widths；The time domain normalized includes adaptively choosing to normalize through airspace The mode for all video frame numbers that treated carries out reduction frame number as normalization standard, to the video beyond normalization standard； Completion is carried out for substandard video, makes all videos frame number having the same.

3. video gesture classification method as described in claim 1, which is characterized in that Fusion Features are using equal in the step 4 Value fusion, splicing fusion or canonical correlation analysis fusion.

4. video gesture classification method as described in claim 1, which is characterized in that input fusion feature in the step 4 Classify in support vector machines.