CN112307892A

CN112307892A - Hand motion recognition method based on first visual angle RGB-D data

Info

Publication number: CN112307892A
Application number: CN202011018265.6A
Authority: CN
Inventors: 杨谦; 许屹; 郑星; 华晓; 严伟雄; 张晓�; 汪勇; 周伟红; 许潜航; 杨永峰; 黄炎阶; 段凌霄
Original assignee: Quzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Quzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-02-02

Abstract

The invention provides a hand motion recognition method based on first visual angle RGB-D data, which comprises the following steps: wearing an RGB-D sensor to acquire a plurality of video clips; preprocessing data acquired by an RGB-D sensor, enhancing the data, and manufacturing a corresponding label to form a data set; after different actions are subjected to size unified processing, extracting spatial information aiming at an RGB image sequence; calculating optical flow between two adjacent frames in the RGB image sequence to obtain a corresponding optical flow image sequence, and extracting time sequence information of the optical flow image based on a Resnet network; extracting image sequence structure information of the depth image sequence by adopting an attention mechanism method; and aiming at the characteristics extracted from the three data, a multi-mode learning network is adopted to respectively extract the common information and the specific information of the three data for training, and finally, the common information and the specific information are fused to identify the action. The method can fully combine the information of the RGB video and the depth video, and has better robustness and higher identification accuracy.

Description

Hand motion recognition method based on first visual angle RGB-D data

Technical Field

The invention relates to the technical field of behavior recognition of computer vision, in particular to a hand motion recognition method based on RGB-D data of a first visual angle.

Background

The traditional third visual angle video is not high in resolution ratio for imaging a remote target, is easily interfered by factors such as shielding and illumination, and is difficult to support subsequent visual tasks such as identification and tracking. The high-resolution RGBD video image can be acquired based on the first visual angle, the automatic movement can be performed to find the optimal visual angle with less shielding, the sight line is kept concentrated around the hand, and the characteristics lay a foundation for realizing high-precision action recognition from data. The existing hand motion recognition method is more prone to uniformly recognizing the motion characteristics of the hand and the operated object. Minghuang Ma et al propose a first-view-based dual-stream network framework through the use of a deep Convolutional Neural Network (CNN), wherein one sub-network analyzes appearance information of hands and an operation object, and the other sub-network analyzes motion information of the heads and the hands of the operator, so that object attributes and hand motion characteristics can be acquired simultaneously. Suriya Singh et al propose a three-stream network framework based on first perspective motion recognition, the first network extracting motion information of the operator's hands and heads, and the second network and the third network extracting spatial information and temporal information in the images, respectively. Guillermo Garcia-Hernando et al studied hand motion recognition based on first perspective, collecting over 100000 frames of RGB-D video sequences in experiments, including 45 daily motion categories, involving 26 different objects. Both RGB-D motion recognition and 3D pose estimation are relatively new fields, which is the first attempt to associate them with a complete human body. Most of the existing motion recognition methods are based on the third view angle, and the recognition methods of the first view angle are relatively few. Moreover, the existing method based on the first view angle processes RGB data or skeleton data, but high-precision skeleton data in a real scene is difficult to obtain, and the subsequent recognition result is seriously affected by the skeleton data with insufficient precision. Few methods for combining RGB data and depth data exist in the current method, and the identification accuracy and robustness of other methods need to be improved.

Disclosure of Invention

The invention solves the problem that the subsequent identification result is seriously influenced by the difficulty in acquiring high-precision skeleton data in a real scene, provides the hand motion identification method based on the RGB-D data of the first visual angle, can fully combine the information of the RGB video and the depth video aiming at the RGB-D data, overcomes the problems of low resolution and shielding in the traditional third visual angle video based on the first visual angle, and has better robustness and higher identification accuracy.

In order to realize the purpose, the following technical scheme is provided:

a hand motion recognition method based on first visual angle RGB-D data comprises the following steps:

step 1, wearing an RGB-D sensor to collect a plurality of video clips including RGB video clips and depth video clips, carrying out image conversion on the RGB video clips and the depth video clips to obtain an RGB image sequence of a single frame and a depth image sequence of the single frame, and carrying out registration on the RGB image sequence and the depth image sequence;

step 2, preprocessing data acquired by the RGB-D sensor, enhancing the data, and manufacturing a corresponding label to form a data set;

step 3, after carrying out size unification processing on different actions, extracting spatial information aiming at the RGB image sequence; extracting the characteristics of the image sequence by adopting a method based on an attention mechanism, and extracting the time sequence information of the RGB image through an LSTM network;

step 4, calculating the optical flow between two adjacent frames in the RGB image sequence to obtain a corresponding optical flow image sequence, and extracting the time sequence information of the optical flow image based on the Resnet network;

step 5, extracting image sequence structure information of the depth image sequence by adopting an attention mechanism method;

and 6, aiming at the characteristics extracted from the three data, respectively extracting the common information and the specific information of the three data by adopting a multi-mode learning network for training, and finally, fusing the common information and the specific information to recognize the action.

The hand motion recognition based on visible light is easily influenced by illumination and background change, and the invention fully utilizes the depth information provided by the RGB-D camera, wherein the depth information is mainly used for separating a foreground from the background and is not influenced by illumination. The accuracy of motion recognition under complex conditions is improved by adding the spatial features of the RGB data to the structural features of the depth information. When processing the RGB image sequence, the invention adopts an attention mechanism, utilizes the prior information in CNN pre-training to identify and code the object to obtain probability graphs of different areas, and performs weighted fusion with the output of the feature extraction network. The method can be used for carrying out important learning in the training process near the object region operated by hands. The invention realizes the fusion of multi-mode characteristics, and can realize information complementation, high precision and strong robustness by combining RGB (red, green and blue) spatial information, time sequence information of an optical flow image and structure information of a depth image compared with the method only utilizing the characteristics of the spatial information or the time sequence information.

Preferably, the step 3 specifically includes:

pre-training prior information in object identification codes by using a CNN network to obtain weights of different areas; referring to CAM class activation mapping, in the last convolution layer of the CNN network for feature extraction, the activation value of a unit l at a spatial position i is defined as f_l(i)，

For the weight corresponding to class c in cell l, CAM can be expressed as

Based on Resnet-34 as a backbone network, performing CAM calculation on each frame of RGB image frame, performing softmax operation on spatial dimension, converting CAM into a probability map, fusing the obtained attribute map attention heat map and the last layer of convolutional layer output map to obtain a weighted feature map, and finally inputting the feature map of each frame into an LSTM network and extracting timing information.

When the method is used for processing the depth image, the attention mechanism is built in the LSTM, the output gate is modified, the extraction effect of the attribute map in the continuous frames is smoother, after the output gate of the recursion unit is improved, the overall prediction is influenced, the recursion is controlled, the potential memory state in the depth image sequence is favorably smoothed and tracked, and the effect of extracting the structure information based on the depth data is obviously improved.

Preferably, the step 4 specifically includes:

calculating optical flow between two adjacent frames of the RGB video by using a TVL1 algorithm, finding a velocity vector of each pixel point in an image by the optical flow, obtaining I (x, y, t) ═ I (x + dx, y + dy, t + dt) according to the tiny motion of the optical flow and the assumption that the brightness is constant, and expanding by using a dielectric Taylor as the following formula

Order to

In that

Under the assumption that the optical flow is solved by using a least square method; extracting optical flows of adjacent frames on the whole picture frame of the video, and then removing irrelevant noise actions; after obtaining the optical flow images, time series information is extracted for the optical flow image sequence, 5 optical flow graphs are overlapped together in the form of an optical flow stack, and the optical flow graphs are input into a Resnet network to extract the time series information of the image sequence.

Preferably, the step 5 specifically includes: modifying the LSTM network structure, in the input part, to the feature X_tTo carry out

Pooling operation to obtain a corresponding value upsilon_aIt is delivered to RNN network, and a of the previous frame is combined_t-1，s_t-1Can obtain a_t，s_t(ii) a According to a_t，s_tBinds to upsilon_aObtaining an attention map heat map s of the frame through a softmax function, and adding s to X_tFusing to obtain an extracted characteristic diagram; combining c of previous frame_t-1，o_t-1C of this frame can be obtained_t，o_t(ii) a By passing

A pooling operation coupling the input gate and output gate sections, and finally upsilon_c⊙c_tAs an output of the network; wherein X_tIs an input feature, a_tIs a memory state, s, in the RNN network_tIs an output state in the RNN network, c_tIs an LSTM networkMemory state in o_tIs the output state, upsilon, in the LSTM network_aAnd upsilon_cAre mutually coupled pooling operations.

Preferably, the step 6 specifically includes:

the features extracted in each network are represented as

Wherein X_iRepresenting features in the ith modality, K being the total number of modalities; the fusion function is defined as: x → h (X); introduces two intermediate characteristics

And g (x), wherein g (x) comprises common features of the different modalities represented by the formula:

for feature X and feature function g_i(X_i) The relationship between them is represented by the following formula:

wherein F is a non-linear function, W_iAnd b_iRespectively representing a weight matrix and a bias matrix;

the correlation between the data is calculated by Cauchy estimator for different data sources, which is expressed as the following formula

Unique features representing different modalities, and g_i(X_i) Similarly, the following formula:

calculating the unique information of different data by adopting orthogonality constraint, and performing weighted addition on the two parts and the original multi-classification cross entropy function to form a loss function of the whole network;

Φ_d(f_i(X_i),f_j(X_j))＝|f_i(X_i)⊙f_j(X_j)|

Φ_d(f_i(X_i),g_i(X_i))＝|f_i(X_i)⊙g_i(X_i)|

distributing different weights to the two intermediate features, and finally performing weighted fusion to obtain fusion features, wherein a fusion function is shown as the following formula;

0≤α₁,α₂,...,α_K,β≤1

where the hyper-parameters alpha and beta correspond to the weights of the intermediate features.

Preferably, the weight ratio of the common features to the characteristic features is 4: 1.

Preferably, in the characteristic feature, the characteristic feature is obtained by adding RGB data: optical flow data: and performing weighted fusion on the depth data with the weight of 4:4:2, and predicting the action label through information after weighted fusion to obtain a recognition result.

The invention has the beneficial effects that:

1. the hand motion recognition based on visible light is easily influenced by illumination and background change, and the invention fully utilizes the depth information provided by the RGB-D camera, wherein the depth information is mainly used for separating a foreground from the background and is not influenced by illumination. The accuracy of motion recognition under complex conditions is improved by adding the spatial features of the RGB data to the structural features of the depth information.

2. When processing the RGB image sequence, the invention adopts an attention mechanism, utilizes the prior information in CNN pre-training to identify and code the object to obtain probability graphs of different areas, and performs weighted fusion with the output of the feature extraction network. The method can be used for carrying out important learning in the training process near the object region operated by hands.

3. When the method is used for processing the depth image, the attention mechanism is built in the LSTM, the output gate is modified, the extraction effect of the attribute map in the continuous frames is smoother, after the output gate of the recursion unit is improved, the overall prediction is influenced, the recursion is controlled, the potential memory state in the depth image sequence is favorably smoothed and tracked, and the effect of extracting the structure information based on the depth data is obviously improved.

4. The invention realizes the fusion of multi-mode characteristics, and can realize information complementation, high precision and strong robustness by combining RGB (red, green and blue) spatial information, time sequence information of an optical flow image and structure information of a depth image compared with the method only utilizing the characteristics of the spatial information or the time sequence information.

Drawings

Fig. 1 is a general flow chart of the present invention.

Fig. 2 is a schematic diagram of a feature extraction network for RGB data.

Fig. 3 is a schematic diagram of a feature extraction network for depth data.

Fig. 4 is a schematic diagram of a multimodal learning network.

Detailed Description

Example (b):

the embodiment provides a hand motion recognition method based on first-view RGB-D data, comprising the following steps:

the step 3 specifically comprises the following steps:

For the weight corresponding to class c in cell l, CAM can be expressed as

the step 4 specifically comprises the following steps:

Order to

In that

the step 5 specifically comprises the following steps: modifying the LSTM network structure, in the input part, to the feature X_tTo carry out

A pooling operation coupling the input gate and output gate sections, and finally upsilon_c⊙c_tAs an output of the network; wherein X_tIs an input feature, a_tIs a token in RNN networksMemory state, s_tIs an output state in the RNN network, c_tIs the memory state in the LSTM network, o_tIs the output state, upsilon, in the LSTM network_aAnd upsilon_cAre mutually coupled pooling operations.

The step 6 specifically comprises the following steps:

the features extracted in each network are represented as

Φ_d(f_i(X_i),f_j(X_j))＝|f_i(X_i)⊙f_j(X_j)|

Φ_d(f_i(X_i),g_i(X_i))＝|f_i(X_i)⊙g_i(X_i)|

0≤α₁,α₂,...,α_K,β≤1

where the hyper-parameters alpha and beta correspond to the weights of the intermediate features. The weight ratio of the common features to the characteristic features is 4: 1. Among the characteristic features, by using RGB data: optical flow data: and performing weighted fusion on the depth data with the weight of 4:4:2, and predicting the action label through information after weighted fusion to obtain a recognition result.

The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and specific examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.

The method comprises the steps of preprocessing collected RGB-D data, extracting spatial information of RGB images, time sequence information of optical flow images and structure information of depth images, learning common features and independent features of different modes by combining the extracted feature information, and finally fusing the feature information to predict the types of actions in the video. Referring to fig. 1, the method specifically includes the following steps:

step 1, wearing an RGB-D sensor to collect a plurality of video segments, and registering RGB data and corresponding depth data.

The scheme of the acquisition system adopts a hardware architecture of a CPU, a ToF depth sensor and an RGB image acquisition device. In the framework, a CPU is responsible for initializing a system, managing and configuring a ToF sensor and an RGB image acquisition device, further processing and calculating depth phase data to obtain a depth image, and registering the depth image and the RGB image. The ToF depth sensor is responsible for acquiring scene depth phase data. The RGB image acquisition equipment is responsible for acquiring RGB visible light images of a scene.

And 2, preprocessing the RGB-D data obtained by sampling, enhancing the data, and manually marking corresponding labels to form a data set.

Because the acquisition system is worn on the head, the acquired video shakes seriously along with the movement of a person, the swinging of the head and the movement of sight line, and for high-precision identification, the acquired original video is subjected to video image stabilization processing and the converted image is subjected to denoising processing, so that a foundation can be provided for subsequent high-precision identification.

In the power industry, the related video data is relatively less, and a large amount of data sets are needed for training an effective model, so that data enhancement needs to be performed on the obtained RGB data and depth data. The existing data is processed, such as turning, translation or rotation, so that more data are created, and the generalization capability of the model trained by the network is stronger.

And labeling each video segment, and recording the type of the action, the starting frame number and the ending frame number of the action sequence.

And 3, performing size unification processing on different actions, extracting spatial information aiming at the RGB image sequence, and obtaining weights of different areas by adopting a method based on an attention mechanism and utilizing the prior information in CNN pre-training object identification codes by adopting a method based on an attention mechanism as shown in FIG. 2. The invention uses CAM (class activation mapping) to define the activation value of a cell l at a spatial position i as f in the last convolutional layer of the CNN network for feature extraction_l(i)，

For the weight corresponding to the category c in the cell l, CAM can be represented by equation (1)

The method extracts the category with the highest score in the image area, the image generated by the CAM represents the saliency map of the image, and the network can be trained aiming at the area near the operated object, and the method is based on Resnet-34 as a main network, carries out CAM calculation on each frame of RGB image, and then carries out softmax operation on the spatial dimension, as expressed in formula (2). The CAM is converted into a probability map, and then the obtained attribute map (attention map) and the output map of the last convolution layer are fused to obtain a feature map. Then, the feature map of each frame is input into the LSTM network, and the time sequence information is extracted.

Where f (i) is the output of the last convolution layer of the feature extraction network at position i, M_c(i) Is CAM of class c at position i, f_SA(i) The image features are weighted by a spatial attention mechanism.

Having acquired the image features, the next step is to time-sequentially encode the features of each frame, and the present invention performs this operation using the LSTM network, which is widely used in other methods, and the operating principle of convLSTM used in the present invention is similar to that of the conventional LSTM. Using convLSTM network for time sequence coding, the change in two dimensions of space and time sequence can be observed at the same time, and the operation of convLSTM module is expressed as following formula.

h_t＝o_t⊙tanh(c_t)(8)

Where σ is sigmoid function, i_t，f_t，o_t，c_tAnd h_tRepresenting convLSTM network input state, forgetting state, output state, memory state and hidden state. W, b represent weights and biases at the time of training. Remembered state c in convLSTM networks_tThe method is used for saving the characteristics of the whole video and then performing a spatial average pooling operation to obtain the characteristic descriptors of the whole video. Used to represent the characteristic information of the whole segment of RGB video.

Step 4, calculating an optical flow between two adjacent frames of the RGB video by using a TVL1 algorithm, finding a velocity vector of each pixel point in the image by the optical flow, obtaining I (x, y, t) ═ I (x + dx, y + dy, t + dt) according to a slight motion of the optical flow and an assumption that the brightness is constant, and performing a taylor expansion as represented by the following formula (9).

In addition

In that

Under the assumption of (2), the optical flow is solved using the least squares method. And extracting optical flows of adjacent frames on the whole picture frame of the video, and then removing irrelevant noise actions. In order to remove noise caused by the vibration of the sensor, the invention filters the displacement value of the optical flow point between the continuous frames by setting the displacement value. After obtaining the optical flow images, time series information is extracted for the optical flow image sequence, 5 optical flow graphs are overlapped together in the form of an optical flow stack, and the optical flow graphs are input into a Resnet network to extract the time series information of the image sequence.

And 5, extracting structural information aiming at the depth image sequence. With the method of the attention mechanism, compared with the method of processing RGB data, the attention mechanism is built in the LSTM network, and the output gate is modified, which makes the extraction effect of the attribute map in the continuous depth frame smoother, and after the output gate of the recursion unit is improved, not only the overall prediction is influenced, but also the recursion is controlled, which is helpful for smoothing and tracking the potential memory state in the sequence.

As shown in FIG. 3, the LSTM network structure is modified, in the input part, to feature X_tTo carry out

Pooling operation to obtain a corresponding value upsilon_aIt is delivered to RNN network, and a of the previous frame is combined_t-1，s_t-1Can obtain a_t，s_t. According to a_t，s_tBinds to upsilon_aBy means of softmax functionObtain the attention map heat map s of this frame, and sum s and X_tAnd obtaining the extracted feature map by fusion. Combining c of previous frame_t-1，o_t-1C of this frame can be obtained_t，o_t. By passing

A pooling operation coupling the input gate and output gate sections, and finally upsilon_c⊙c_tAs an output of the network. The process is as follows:

(i_a,f_a,s_t,a)＝(σ,σ,σ,η)(W_a*[υ_a,s_t-1⊙η(a_t-1)]) (11)

a_t＝f_a⊙a_t-1+i_a⊙a (12)

s＝softmax(υ_a+s_t⊙η(a_t)) (13)

(i_c,f_c,c)＝(σ,σ,η)(W_c*[s⊙X_t,o_t-1⊙η(c_t-1)]) (14)

c_t＝f_c⊙c_t-1+i_c⊙c (15)

o_t＝σ(W_o*[υ_c⊙c_t,o_t-1⊙η(c_t-1)]) (17)

wherein X_tIs an input feature, a_tIs a memory state, s, in the RNN network_tIs an output state in the RNN network, c_tIs the memory state in the LSTM network, o_tIs the output state, upsilon, in the LSTM network_aAnd upsilon_cAre mutually coupled pooling operations. σ and η are both activation functions.

And 6, fusing the features extracted from the multi-modal data source, respectively extracting the common information and the specific information of the features for training, and finally identifying the motion in the image.

As shown in fig. 4, the present embodiment represents the features extracted from each network in steps 3,4, and 5 as

Wherein X_iRepresenting the features in the ith modality, K is the total number of modalities, here taken to be 3. The present invention defines the fusion function as: x → h (X), which merges the input features X into output features h (X). In order to fully exploit common features and unique features of different modes, the invention introduces two intermediate features

And g (x), wherein g (x) comprises common features of different modalities as represented by the following formula (18).

For feature X and feature function g_i(X_i) The relationship therebetween is represented by the following formula (19).

Wherein F is a non-linear function, W_iAnd b_iRespectively representing a weight matrix and a bias matrix.

Considering that the change of illumination and the camera motion caused by the head motion in the first-view video cause the abnormality of a small part of data, the robustness is not high enough by directly adopting the L1 norm and the L2 norm, and in the aspect of learning of common features, the correlation between data calculated by adopting Cauchy estimators on different data sources is smoother than that of the L1 and that of the L2 norm, as shown in (20)

Unique features representing different modalities, and g_i(X_i) Similarly; represented by the following formula (21).

In the aspect of learning the characteristic features, orthogonality constraints (such as (22)) are adopted to calculate the characteristic information of different data, so that the characteristic information of each data is independent, and the characteristic information and the common information are also independent. And the two parts are added with the original multi-classification cross entropy function in a weighting mode to form a loss function of the whole network.

The two intermediate features are assigned with different weights, and finally, the fusion features are obtained through weighted fusion, and the fusion function is shown as the following formulas (23) and (24).

In the invention, according to the experimental result, in the selection of the network fusion weight, the proportion of a common information part and a specific information part is 4:1, and in the specific information fusion part, the proportion of RGB data stream, optical flow data and depth data stream is 2: 2: 1.

and finally, weighting and summing the common features and the unique features, and then inputting a softmax function to predict the action tag to obtain an identification result.

The invention has the beneficial effects that:

Claims

1. A hand motion recognition method based on RGB-D data of a first visual angle is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step 3 comprises:

For the weight corresponding to class c in cell l, CAM can be expressed as

Based on Resnet-34 as a backbone network, CAM calculation is carried out on each frame of RGB image frame, and the CAM is emptyAnd performing softmax operation on the inter-dimension, converting the CAM into a probability map, fusing the obtained attention map attention heat map and the output map of the last convolutional layer to obtain a weighted feature map, inputting the feature map of each frame into an LSTM network, and extracting timing information.

3. The method as claimed in claim 1, wherein the step 4 comprises:

Order to

In that

4. The method as claimed in claim 1, wherein the step 5 comprises: modifying the LSTM network structure, in the input part, to the feature X_tTo carry out

A pooling operation coupling the input gate and output gate sections, and finally upsilon_c⊙c_tAs an output of the network; wherein X_tIs an input feature, a_tIs a memory state, s, in the RNN network_tIs an output state in the RNN network, c_tIs the memory state in the LSTM network, o_tIs the output state, upsilon, in the LSTM network_aAnd upsilon_cAre mutually coupled pooling operations.

5. The method as claimed in claim 1, wherein the step 6 comprises:

the features extracted in each network are represented as

Φ_d(f_i(X_i),f_j(X_j))＝|f_i(X_i)⊙f_j(X_j)|

Φ_d(f_i(X_i),g_i(X_i))＝|f_i(X_i)⊙g_i(X_i)|

0≤α₁,α₂,...,α_K,β≤1

6. The method as claimed in claim 5, wherein the common feature and the characteristic feature have a weight ratio of 4: 1.

7. The method as claimed in claim 6, wherein the characteristic features are obtained by applying the RGB data: optical flow data: and performing weighted fusion on the depth data with the weight of 4:4:2, and predicting the action label through information after weighted fusion to obtain a recognition result.