CN112307892A - Hand motion recognition method based on first visual angle RGB-D data - Google Patents

Hand motion recognition method based on first visual angle RGB-D data Download PDF

Info

Publication number
CN112307892A
CN112307892A CN202011018265.6A CN202011018265A CN112307892A CN 112307892 A CN112307892 A CN 112307892A CN 202011018265 A CN202011018265 A CN 202011018265A CN 112307892 A CN112307892 A CN 112307892A
Authority
CN
China
Prior art keywords
information
rgb
data
network
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011018265.6A
Other languages
Chinese (zh)
Inventor
杨谦
许屹
郑星
华晓
严伟雄
张晓�
汪勇
周伟红
许潜航
杨永峰
黄炎阶
段凌霄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
Quzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd filed Critical Quzhou Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority to CN202011018265.6A priority Critical patent/CN112307892A/en
Publication of CN112307892A publication Critical patent/CN112307892A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/38Registration of image sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a hand motion recognition method based on first visual angle RGB-D data, which comprises the following steps: wearing an RGB-D sensor to acquire a plurality of video clips; preprocessing data acquired by an RGB-D sensor, enhancing the data, and manufacturing a corresponding label to form a data set; after different actions are subjected to size unified processing, extracting spatial information aiming at an RGB image sequence; calculating optical flow between two adjacent frames in the RGB image sequence to obtain a corresponding optical flow image sequence, and extracting time sequence information of the optical flow image based on a Resnet network; extracting image sequence structure information of the depth image sequence by adopting an attention mechanism method; and aiming at the characteristics extracted from the three data, a multi-mode learning network is adopted to respectively extract the common information and the specific information of the three data for training, and finally, the common information and the specific information are fused to identify the action. The method can fully combine the information of the RGB video and the depth video, and has better robustness and higher identification accuracy.

Description

Hand motion recognition method based on first visual angle RGB-D data
Technical Field
The invention relates to the technical field of behavior recognition of computer vision, in particular to a hand motion recognition method based on RGB-D data of a first visual angle.
Background
The traditional third visual angle video is not high in resolution ratio for imaging a remote target, is easily interfered by factors such as shielding and illumination, and is difficult to support subsequent visual tasks such as identification and tracking. The high-resolution RGBD video image can be acquired based on the first visual angle, the automatic movement can be performed to find the optimal visual angle with less shielding, the sight line is kept concentrated around the hand, and the characteristics lay a foundation for realizing high-precision action recognition from data. The existing hand motion recognition method is more prone to uniformly recognizing the motion characteristics of the hand and the operated object. Minghuang Ma et al propose a first-view-based dual-stream network framework through the use of a deep Convolutional Neural Network (CNN), wherein one sub-network analyzes appearance information of hands and an operation object, and the other sub-network analyzes motion information of the heads and the hands of the operator, so that object attributes and hand motion characteristics can be acquired simultaneously. Suriya Singh et al propose a three-stream network framework based on first perspective motion recognition, the first network extracting motion information of the operator's hands and heads, and the second network and the third network extracting spatial information and temporal information in the images, respectively. Guillermo Garcia-Hernando et al studied hand motion recognition based on first perspective, collecting over 100000 frames of RGB-D video sequences in experiments, including 45 daily motion categories, involving 26 different objects. Both RGB-D motion recognition and 3D pose estimation are relatively new fields, which is the first attempt to associate them with a complete human body. Most of the existing motion recognition methods are based on the third view angle, and the recognition methods of the first view angle are relatively few. Moreover, the existing method based on the first view angle processes RGB data or skeleton data, but high-precision skeleton data in a real scene is difficult to obtain, and the subsequent recognition result is seriously affected by the skeleton data with insufficient precision. Few methods for combining RGB data and depth data exist in the current method, and the identification accuracy and robustness of other methods need to be improved.
Disclosure of Invention
The invention solves the problem that the subsequent identification result is seriously influenced by the difficulty in acquiring high-precision skeleton data in a real scene, provides the hand motion identification method based on the RGB-D data of the first visual angle, can fully combine the information of the RGB video and the depth video aiming at the RGB-D data, overcomes the problems of low resolution and shielding in the traditional third visual angle video based on the first visual angle, and has better robustness and higher identification accuracy.
In order to realize the purpose, the following technical scheme is provided:
a hand motion recognition method based on first visual angle RGB-D data comprises the following steps:
step 1, wearing an RGB-D sensor to collect a plurality of video clips including RGB video clips and depth video clips, carrying out image conversion on the RGB video clips and the depth video clips to obtain an RGB image sequence of a single frame and a depth image sequence of the single frame, and carrying out registration on the RGB image sequence and the depth image sequence;
step 2, preprocessing data acquired by the RGB-D sensor, enhancing the data, and manufacturing a corresponding label to form a data set;
step 3, after carrying out size unification processing on different actions, extracting spatial information aiming at the RGB image sequence; extracting the characteristics of the image sequence by adopting a method based on an attention mechanism, and extracting the time sequence information of the RGB image through an LSTM network;
step 4, calculating the optical flow between two adjacent frames in the RGB image sequence to obtain a corresponding optical flow image sequence, and extracting the time sequence information of the optical flow image based on the Resnet network;
step 5, extracting image sequence structure information of the depth image sequence by adopting an attention mechanism method;
and 6, aiming at the characteristics extracted from the three data, respectively extracting the common information and the specific information of the three data by adopting a multi-mode learning network for training, and finally, fusing the common information and the specific information to recognize the action.
The hand motion recognition based on visible light is easily influenced by illumination and background change, and the invention fully utilizes the depth information provided by the RGB-D camera, wherein the depth information is mainly used for separating a foreground from the background and is not influenced by illumination. The accuracy of motion recognition under complex conditions is improved by adding the spatial features of the RGB data to the structural features of the depth information. When processing the RGB image sequence, the invention adopts an attention mechanism, utilizes the prior information in CNN pre-training to identify and code the object to obtain probability graphs of different areas, and performs weighted fusion with the output of the feature extraction network. The method can be used for carrying out important learning in the training process near the object region operated by hands. The invention realizes the fusion of multi-mode characteristics, and can realize information complementation, high precision and strong robustness by combining RGB (red, green and blue) spatial information, time sequence information of an optical flow image and structure information of a depth image compared with the method only utilizing the characteristics of the spatial information or the time sequence information.
Preferably, the step 3 specifically includes:
pre-training prior information in object identification codes by using a CNN network to obtain weights of different areas; referring to CAM class activation mapping, in the last convolution layer of the CNN network for feature extraction, the activation value of a unit l at a spatial position i is defined as fl(i),
Figure BDA0002699823830000021
For the weight corresponding to class c in cell l, CAM can be expressed as
Figure BDA0002699823830000022
Based on Resnet-34 as a backbone network, performing CAM calculation on each frame of RGB image frame, performing softmax operation on spatial dimension, converting CAM into a probability map, fusing the obtained attribute map attention heat map and the last layer of convolutional layer output map to obtain a weighted feature map, and finally inputting the feature map of each frame into an LSTM network and extracting timing information.
When the method is used for processing the depth image, the attention mechanism is built in the LSTM, the output gate is modified, the extraction effect of the attribute map in the continuous frames is smoother, after the output gate of the recursion unit is improved, the overall prediction is influenced, the recursion is controlled, the potential memory state in the depth image sequence is favorably smoothed and tracked, and the effect of extracting the structure information based on the depth data is obviously improved.
Preferably, the step 4 specifically includes:
calculating optical flow between two adjacent frames of the RGB video by using a TVL1 algorithm, finding a velocity vector of each pixel point in an image by the optical flow, obtaining I (x, y, t) ═ I (x + dx, y + dy, t + dt) according to the tiny motion of the optical flow and the assumption that the brightness is constant, and expanding by using a dielectric Taylor as the following formula
Figure BDA0002699823830000023
Order to
Figure BDA00026998238300000211
In that
Figure BDA0002699823830000024
Under the assumption that the optical flow is solved by using a least square method; extracting optical flows of adjacent frames on the whole picture frame of the video, and then removing irrelevant noise actions; after obtaining the optical flow images, time series information is extracted for the optical flow image sequence, 5 optical flow graphs are overlapped together in the form of an optical flow stack, and the optical flow graphs are input into a Resnet network to extract the time series information of the image sequence.
Preferably, the step 5 specifically includes: modifying the LSTM network structure, in the input part, to the feature XtTo carry out
Figure BDA0002699823830000029
Pooling operation to obtain a corresponding value upsilonaIt is delivered to RNN network, and a of the previous frame is combinedt-1,st-1Can obtain at,st(ii) a According to at,stBinds to upsilonaObtaining an attention map heat map s of the frame through a softmax function, and adding s to XtFusing to obtain an extracted characteristic diagram; combining c of previous framet-1,ot-1C of this frame can be obtainedt,ot(ii) a By passing
Figure BDA00026998238300000210
A pooling operation coupling the input gate and output gate sections, and finally upsilonc⊙ctAs an output of the network; wherein XtIs an input feature, atIs a memory state, s, in the RNN networktIs an output state in the RNN network, ctIs an LSTM networkMemory state in otIs the output state, upsilon, in the LSTM networkaAnd upsiloncAre mutually coupled pooling operations.
Preferably, the step 6 specifically includes:
the features extracted in each network are represented as
Figure BDA0002699823830000025
Wherein XiRepresenting features in the ith modality, K being the total number of modalities; the fusion function is defined as: x → h (X); introduces two intermediate characteristics
Figure BDA0002699823830000026
And g (x), wherein g (x) comprises common features of the different modalities represented by the formula:
Figure BDA0002699823830000027
for feature X and feature function gi(Xi) The relationship between them is represented by the following formula:
Figure BDA0002699823830000028
wherein F is a non-linear function, WiAnd biRespectively representing a weight matrix and a bias matrix;
the correlation between the data is calculated by Cauchy estimator for different data sources, which is expressed as the following formula
Figure BDA0002699823830000031
Figure BDA0002699823830000032
Unique features representing different modalities, and gi(Xi) Similarly, the following formula:
Figure BDA0002699823830000033
calculating the unique information of different data by adopting orthogonality constraint, and performing weighted addition on the two parts and the original multi-classification cross entropy function to form a loss function of the whole network;
Φd(fi(Xi),fj(Xj))=|fi(Xi)⊙fj(Xj)|
Φd(fi(Xi),gi(Xi))=|fi(Xi)⊙gi(Xi)|
distributing different weights to the two intermediate features, and finally performing weighted fusion to obtain fusion features, wherein a fusion function is shown as the following formula;
Figure BDA0002699823830000034
Figure BDA0002699823830000035
0≤α12,...,αK,β≤1
where the hyper-parameters alpha and beta correspond to the weights of the intermediate features.
Preferably, the weight ratio of the common features to the characteristic features is 4: 1.
Preferably, in the characteristic feature, the characteristic feature is obtained by adding RGB data: optical flow data: and performing weighted fusion on the depth data with the weight of 4:4:2, and predicting the action label through information after weighted fusion to obtain a recognition result.
The invention has the beneficial effects that:
1. the hand motion recognition based on visible light is easily influenced by illumination and background change, and the invention fully utilizes the depth information provided by the RGB-D camera, wherein the depth information is mainly used for separating a foreground from the background and is not influenced by illumination. The accuracy of motion recognition under complex conditions is improved by adding the spatial features of the RGB data to the structural features of the depth information.
2. When processing the RGB image sequence, the invention adopts an attention mechanism, utilizes the prior information in CNN pre-training to identify and code the object to obtain probability graphs of different areas, and performs weighted fusion with the output of the feature extraction network. The method can be used for carrying out important learning in the training process near the object region operated by hands.
3. When the method is used for processing the depth image, the attention mechanism is built in the LSTM, the output gate is modified, the extraction effect of the attribute map in the continuous frames is smoother, after the output gate of the recursion unit is improved, the overall prediction is influenced, the recursion is controlled, the potential memory state in the depth image sequence is favorably smoothed and tracked, and the effect of extracting the structure information based on the depth data is obviously improved.
4. The invention realizes the fusion of multi-mode characteristics, and can realize information complementation, high precision and strong robustness by combining RGB (red, green and blue) spatial information, time sequence information of an optical flow image and structure information of a depth image compared with the method only utilizing the characteristics of the spatial information or the time sequence information.
Drawings
Fig. 1 is a general flow chart of the present invention.
Fig. 2 is a schematic diagram of a feature extraction network for RGB data.
Fig. 3 is a schematic diagram of a feature extraction network for depth data.
Fig. 4 is a schematic diagram of a multimodal learning network.
Detailed Description
Example (b):
the embodiment provides a hand motion recognition method based on first-view RGB-D data, comprising the following steps:
step 1, wearing an RGB-D sensor to collect a plurality of video clips including RGB video clips and depth video clips, carrying out image conversion on the RGB video clips and the depth video clips to obtain an RGB image sequence of a single frame and a depth image sequence of the single frame, and carrying out registration on the RGB image sequence and the depth image sequence;
step 2, preprocessing data acquired by the RGB-D sensor, enhancing the data, and manufacturing a corresponding label to form a data set;
step 3, after carrying out size unification processing on different actions, extracting spatial information aiming at the RGB image sequence; extracting the characteristics of the image sequence by adopting a method based on an attention mechanism, and extracting the time sequence information of the RGB image through an LSTM network;
the step 3 specifically comprises the following steps:
pre-training prior information in object identification codes by using a CNN network to obtain weights of different areas; referring to CAM class activation mapping, in the last convolution layer of the CNN network for feature extraction, the activation value of a unit l at a spatial position i is defined as fl(i),
Figure BDA0002699823830000041
For the weight corresponding to class c in cell l, CAM can be expressed as
Figure BDA0002699823830000042
Based on Resnet-34 as a backbone network, performing CAM calculation on each frame of RGB image frame, performing softmax operation on spatial dimension, converting CAM into a probability map, fusing the obtained attribute map attention heat map and the last layer of convolutional layer output map to obtain a weighted feature map, and finally inputting the feature map of each frame into an LSTM network and extracting timing information.
Step 4, calculating the optical flow between two adjacent frames in the RGB image sequence to obtain a corresponding optical flow image sequence, and extracting the time sequence information of the optical flow image based on the Resnet network;
the step 4 specifically comprises the following steps:
calculating optical flow between two adjacent frames of the RGB video by using a TVL1 algorithm, finding a velocity vector of each pixel point in an image by the optical flow, obtaining I (x, y, t) ═ I (x + dx, y + dy, t + dt) according to the tiny motion of the optical flow and the assumption that the brightness is constant, and expanding by using a dielectric Taylor as the following formula
Figure BDA0002699823830000043
Order to
Figure BDA0002699823830000044
In that
Figure BDA0002699823830000045
Under the assumption that the optical flow is solved by using a least square method; extracting optical flows of adjacent frames on the whole picture frame of the video, and then removing irrelevant noise actions; after obtaining the optical flow images, time series information is extracted for the optical flow image sequence, 5 optical flow graphs are overlapped together in the form of an optical flow stack, and the optical flow graphs are input into a Resnet network to extract the time series information of the image sequence.
Step 5, extracting image sequence structure information of the depth image sequence by adopting an attention mechanism method;
the step 5 specifically comprises the following steps: modifying the LSTM network structure, in the input part, to the feature XtTo carry out
Figure BDA00026998238300000411
Pooling operation to obtain a corresponding value upsilonaIt is delivered to RNN network, and a of the previous frame is combinedt-1,st-1Can obtain at,st(ii) a According to at,stBinds to upsilonaObtaining an attention map heat map s of the frame through a softmax function, and adding s to XtFusing to obtain an extracted characteristic diagram; combining c of previous framet-1,ot-1C of this frame can be obtainedt,ot(ii) a By passing
Figure BDA00026998238300000412
A pooling operation coupling the input gate and output gate sections, and finally upsilonc⊙ctAs an output of the network; wherein XtIs an input feature, atIs a token in RNN networksMemory state, stIs an output state in the RNN network, ctIs the memory state in the LSTM network, otIs the output state, upsilon, in the LSTM networkaAnd upsiloncAre mutually coupled pooling operations.
And 6, aiming at the characteristics extracted from the three data, respectively extracting the common information and the specific information of the three data by adopting a multi-mode learning network for training, and finally, fusing the common information and the specific information to recognize the action.
The step 6 specifically comprises the following steps:
the features extracted in each network are represented as
Figure BDA0002699823830000046
Wherein XiRepresenting features in the ith modality, K being the total number of modalities; the fusion function is defined as: x → h (X); introduces two intermediate characteristics
Figure BDA0002699823830000047
And g (x), wherein g (x) comprises common features of the different modalities represented by the formula:
Figure BDA0002699823830000048
for feature X and feature function gi(Xi) The relationship between them is represented by the following formula:
Figure BDA0002699823830000049
wherein F is a non-linear function, WiAnd biRespectively representing a weight matrix and a bias matrix;
the correlation between the data is calculated by Cauchy estimator for different data sources, which is expressed as the following formula
Figure BDA00026998238300000410
Figure BDA0002699823830000051
Unique features representing different modalities, and gi(Xi) Similarly, the following formula:
Figure BDA0002699823830000052
calculating the unique information of different data by adopting orthogonality constraint, and performing weighted addition on the two parts and the original multi-classification cross entropy function to form a loss function of the whole network;
Φd(fi(Xi),fj(Xj))=|fi(Xi)⊙fj(Xj)|
Φd(fi(Xi),gi(Xi))=|fi(Xi)⊙gi(Xi)|
distributing different weights to the two intermediate features, and finally performing weighted fusion to obtain fusion features, wherein a fusion function is shown as the following formula;
Figure BDA0002699823830000053
Figure BDA0002699823830000054
0≤α12,...,αK,β≤1
where the hyper-parameters alpha and beta correspond to the weights of the intermediate features. The weight ratio of the common features to the characteristic features is 4: 1. Among the characteristic features, by using RGB data: optical flow data: and performing weighted fusion on the depth data with the weight of 4:4:2, and predicting the action label through information after weighted fusion to obtain a recognition result.
The following detailed description of the embodiments of the present invention will be made with reference to the accompanying drawings and specific examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.
The method comprises the steps of preprocessing collected RGB-D data, extracting spatial information of RGB images, time sequence information of optical flow images and structure information of depth images, learning common features and independent features of different modes by combining the extracted feature information, and finally fusing the feature information to predict the types of actions in the video. Referring to fig. 1, the method specifically includes the following steps:
step 1, wearing an RGB-D sensor to collect a plurality of video segments, and registering RGB data and corresponding depth data.
The scheme of the acquisition system adopts a hardware architecture of a CPU, a ToF depth sensor and an RGB image acquisition device. In the framework, a CPU is responsible for initializing a system, managing and configuring a ToF sensor and an RGB image acquisition device, further processing and calculating depth phase data to obtain a depth image, and registering the depth image and the RGB image. The ToF depth sensor is responsible for acquiring scene depth phase data. The RGB image acquisition equipment is responsible for acquiring RGB visible light images of a scene.
And 2, preprocessing the RGB-D data obtained by sampling, enhancing the data, and manually marking corresponding labels to form a data set.
Because the acquisition system is worn on the head, the acquired video shakes seriously along with the movement of a person, the swinging of the head and the movement of sight line, and for high-precision identification, the acquired original video is subjected to video image stabilization processing and the converted image is subjected to denoising processing, so that a foundation can be provided for subsequent high-precision identification.
In the power industry, the related video data is relatively less, and a large amount of data sets are needed for training an effective model, so that data enhancement needs to be performed on the obtained RGB data and depth data. The existing data is processed, such as turning, translation or rotation, so that more data are created, and the generalization capability of the model trained by the network is stronger.
And labeling each video segment, and recording the type of the action, the starting frame number and the ending frame number of the action sequence.
And 3, performing size unification processing on different actions, extracting spatial information aiming at the RGB image sequence, and obtaining weights of different areas by adopting a method based on an attention mechanism and utilizing the prior information in CNN pre-training object identification codes by adopting a method based on an attention mechanism as shown in FIG. 2. The invention uses CAM (class activation mapping) to define the activation value of a cell l at a spatial position i as f in the last convolutional layer of the CNN network for feature extractionl(i),
Figure BDA0002699823830000055
For the weight corresponding to the category c in the cell l, CAM can be represented by equation (1)
Figure BDA0002699823830000056
The method extracts the category with the highest score in the image area, the image generated by the CAM represents the saliency map of the image, and the network can be trained aiming at the area near the operated object, and the method is based on Resnet-34 as a main network, carries out CAM calculation on each frame of RGB image, and then carries out softmax operation on the spatial dimension, as expressed in formula (2). The CAM is converted into a probability map, and then the obtained attribute map (attention map) and the output map of the last convolution layer are fused to obtain a feature map. Then, the feature map of each frame is input into the LSTM network, and the time sequence information is extracted.
Figure BDA0002699823830000057
Where f (i) is the output of the last convolution layer of the feature extraction network at position i, Mc(i) Is CAM of class c at position i, fSA(i) The image features are weighted by a spatial attention mechanism.
Having acquired the image features, the next step is to time-sequentially encode the features of each frame, and the present invention performs this operation using the LSTM network, which is widely used in other methods, and the operating principle of convLSTM used in the present invention is similar to that of the conventional LSTM. Using convLSTM network for time sequence coding, the change in two dimensions of space and time sequence can be observed at the same time, and the operation of convLSTM module is expressed as following formula.
Figure BDA0002699823830000061
Figure BDA0002699823830000062
Figure BDA0002699823830000063
Figure BDA0002699823830000064
Figure BDA0002699823830000065
ht=ot⊙tanh(ct)(8)
Where σ is sigmoid function, it,ft,ot,ctAnd htRepresenting convLSTM network input state, forgetting state, output state, memory state and hidden state. W, b represent weights and biases at the time of training. Remembered state c in convLSTM networkstThe method is used for saving the characteristics of the whole video and then performing a spatial average pooling operation to obtain the characteristic descriptors of the whole video. Used to represent the characteristic information of the whole segment of RGB video.
Step 4, calculating an optical flow between two adjacent frames of the RGB video by using a TVL1 algorithm, finding a velocity vector of each pixel point in the image by the optical flow, obtaining I (x, y, t) ═ I (x + dx, y + dy, t + dt) according to a slight motion of the optical flow and an assumption that the brightness is constant, and performing a taylor expansion as represented by the following formula (9).
Figure BDA0002699823830000066
In addition
Figure BDA0002699823830000067
In that
Figure BDA0002699823830000068
Under the assumption of (2), the optical flow is solved using the least squares method. And extracting optical flows of adjacent frames on the whole picture frame of the video, and then removing irrelevant noise actions. In order to remove noise caused by the vibration of the sensor, the invention filters the displacement value of the optical flow point between the continuous frames by setting the displacement value. After obtaining the optical flow images, time series information is extracted for the optical flow image sequence, 5 optical flow graphs are overlapped together in the form of an optical flow stack, and the optical flow graphs are input into a Resnet network to extract the time series information of the image sequence.
And 5, extracting structural information aiming at the depth image sequence. With the method of the attention mechanism, compared with the method of processing RGB data, the attention mechanism is built in the LSTM network, and the output gate is modified, which makes the extraction effect of the attribute map in the continuous depth frame smoother, and after the output gate of the recursion unit is improved, not only the overall prediction is influenced, but also the recursion is controlled, which is helpful for smoothing and tracking the potential memory state in the sequence.
As shown in FIG. 3, the LSTM network structure is modified, in the input part, to feature XtTo carry out
Figure BDA00026998238300000610
Pooling operation to obtain a corresponding value upsilonaIt is delivered to RNN network, and a of the previous frame is combinedt-1,st-1Can obtain at,st. According to at,stBinds to upsilonaBy means of softmax functionObtain the attention map heat map s of this frame, and sum s and XtAnd obtaining the extracted feature map by fusion. Combining c of previous framet-1,ot-1C of this frame can be obtainedt,ot. By passing
Figure BDA00026998238300000611
A pooling operation coupling the input gate and output gate sections, and finally upsilonc⊙ctAs an output of the network. The process is as follows:
Figure BDA0002699823830000069
(ia,fa,st,a)=(σ,σ,σ,η)(Wa*[υa,st-1⊙η(at-1)]) (11)
at=fa⊙at-1+ia⊙a (12)
s=softmax(υa+st⊙η(at)) (13)
(ic,fc,c)=(σ,σ,η)(Wc*[s⊙Xt,ot-1⊙η(ct-1)]) (14)
ct=fc⊙ct-1+ic⊙c (15)
Figure BDA0002699823830000071
ot=σ(Wo*[υc⊙ct,ot-1⊙η(ct-1)]) (17)
wherein XtIs an input feature, atIs a memory state, s, in the RNN networktIs an output state in the RNN network, ctIs the memory state in the LSTM network, otIs the output state, upsilon, in the LSTM networkaAnd upsiloncAre mutually coupled pooling operations. σ and η are both activation functions.
And 6, fusing the features extracted from the multi-modal data source, respectively extracting the common information and the specific information of the features for training, and finally identifying the motion in the image.
As shown in fig. 4, the present embodiment represents the features extracted from each network in steps 3,4, and 5 as
Figure BDA0002699823830000072
Wherein XiRepresenting the features in the ith modality, K is the total number of modalities, here taken to be 3. The present invention defines the fusion function as: x → h (X), which merges the input features X into output features h (X). In order to fully exploit common features and unique features of different modes, the invention introduces two intermediate features
Figure BDA0002699823830000073
And g (x), wherein g (x) comprises common features of different modalities as represented by the following formula (18).
Figure BDA0002699823830000074
For feature X and feature function gi(Xi) The relationship therebetween is represented by the following formula (19).
Figure BDA0002699823830000075
Wherein F is a non-linear function, WiAnd biRespectively representing a weight matrix and a bias matrix.
Considering that the change of illumination and the camera motion caused by the head motion in the first-view video cause the abnormality of a small part of data, the robustness is not high enough by directly adopting the L1 norm and the L2 norm, and in the aspect of learning of common features, the correlation between data calculated by adopting Cauchy estimators on different data sources is smoother than that of the L1 and that of the L2 norm, as shown in (20)
Figure BDA0002699823830000076
Figure BDA0002699823830000077
Unique features representing different modalities, and gi(Xi) Similarly; represented by the following formula (21).
Figure BDA0002699823830000078
In the aspect of learning the characteristic features, orthogonality constraints (such as (22)) are adopted to calculate the characteristic information of different data, so that the characteristic information of each data is independent, and the characteristic information and the common information are also independent. And the two parts are added with the original multi-classification cross entropy function in a weighting mode to form a loss function of the whole network.
Figure BDA00026998238300000711
The two intermediate features are assigned with different weights, and finally, the fusion features are obtained through weighted fusion, and the fusion function is shown as the following formulas (23) and (24).
Figure BDA0002699823830000079
Figure BDA00026998238300000710
Where the hyper-parameters alpha and beta correspond to the weights of the intermediate features.
In the invention, according to the experimental result, in the selection of the network fusion weight, the proportion of a common information part and a specific information part is 4:1, and in the specific information fusion part, the proportion of RGB data stream, optical flow data and depth data stream is 2: 2: 1.
and finally, weighting and summing the common features and the unique features, and then inputting a softmax function to predict the action tag to obtain an identification result.
The invention has the beneficial effects that:
1. the hand motion recognition based on visible light is easily influenced by illumination and background change, and the invention fully utilizes the depth information provided by the RGB-D camera, wherein the depth information is mainly used for separating a foreground from the background and is not influenced by illumination. The accuracy of motion recognition under complex conditions is improved by adding the spatial features of the RGB data to the structural features of the depth information.
2. When processing the RGB image sequence, the invention adopts an attention mechanism, utilizes the prior information in CNN pre-training to identify and code the object to obtain probability graphs of different areas, and performs weighted fusion with the output of the feature extraction network. The method can be used for carrying out important learning in the training process near the object region operated by hands.
3. When the method is used for processing the depth image, the attention mechanism is built in the LSTM, the output gate is modified, the extraction effect of the attribute map in the continuous frames is smoother, after the output gate of the recursion unit is improved, the overall prediction is influenced, the recursion is controlled, the potential memory state in the depth image sequence is favorably smoothed and tracked, and the effect of extracting the structure information based on the depth data is obviously improved.
4. The invention realizes the fusion of multi-mode characteristics, and can realize information complementation, high precision and strong robustness by combining RGB (red, green and blue) spatial information, time sequence information of an optical flow image and structure information of a depth image compared with the method only utilizing the characteristics of the spatial information or the time sequence information.

Claims (7)

1. A hand motion recognition method based on RGB-D data of a first visual angle is characterized by comprising the following steps:
step 1, wearing an RGB-D sensor to collect a plurality of video clips including RGB video clips and depth video clips, carrying out image conversion on the RGB video clips and the depth video clips to obtain an RGB image sequence of a single frame and a depth image sequence of the single frame, and carrying out registration on the RGB image sequence and the depth image sequence;
step 2, preprocessing data acquired by the RGB-D sensor, enhancing the data, and manufacturing a corresponding label to form a data set;
step 3, after carrying out size unification processing on different actions, extracting spatial information aiming at the RGB image sequence; extracting the characteristics of the image sequence by adopting a method based on an attention mechanism, and extracting the time sequence information of the RGB image through an LSTM network;
step 4, calculating the optical flow between two adjacent frames in the RGB image sequence to obtain a corresponding optical flow image sequence, and extracting the time sequence information of the optical flow image based on the Resnet network;
step 5, extracting image sequence structure information of the depth image sequence by adopting an attention mechanism method;
and 6, aiming at the characteristics extracted from the three data, respectively extracting the common information and the specific information of the three data by adopting a multi-mode learning network for training, and finally, fusing the common information and the specific information to recognize the action.
2. The method as claimed in claim 1, wherein the step 3 comprises:
pre-training prior information in object identification codes by using a CNN network to obtain weights of different areas; referring to CAM class activation mapping, in the last convolution layer of the CNN network for feature extraction, the activation value of a unit l at a spatial position i is defined as fl(i),
Figure FDA0002699823820000011
For the weight corresponding to class c in cell l, CAM can be expressed as
Figure FDA0002699823820000012
Based on Resnet-34 as a backbone network, CAM calculation is carried out on each frame of RGB image frame, and the CAM is emptyAnd performing softmax operation on the inter-dimension, converting the CAM into a probability map, fusing the obtained attention map attention heat map and the output map of the last convolutional layer to obtain a weighted feature map, inputting the feature map of each frame into an LSTM network, and extracting timing information.
3. The method as claimed in claim 1, wherein the step 4 comprises:
calculating optical flow between two adjacent frames of the RGB video by using a TVL1 algorithm, finding a velocity vector of each pixel point in an image by the optical flow, obtaining I (x, y, t) ═ I (x + dx, y + dy, t + dt) according to the tiny motion of the optical flow and the assumption that the brightness is constant, and expanding by using a dielectric Taylor as the following formula
Figure FDA0002699823820000013
Order to
Figure FDA0002699823820000014
In that
Figure FDA0002699823820000015
Under the assumption that the optical flow is solved by using a least square method; extracting optical flows of adjacent frames on the whole picture frame of the video, and then removing irrelevant noise actions; after obtaining the optical flow images, time series information is extracted for the optical flow image sequence, 5 optical flow graphs are overlapped together in the form of an optical flow stack, and the optical flow graphs are input into a Resnet network to extract the time series information of the image sequence.
4. The method as claimed in claim 1, wherein the step 5 comprises: modifying the LSTM network structure, in the input part, to the feature XtTo carry out
Figure FDA0002699823820000016
Pooling operation to obtain a corresponding value upsilonaIt is delivered to RNN network, and a of the previous frame is combinedt-1,st-1Can obtain at,st(ii) a According to at,stBinds to upsilonaObtaining an attention map heat map s of the frame through a softmax function, and adding s to XtFusing to obtain an extracted characteristic diagram; combining c of previous framet-1,ot-1C of this frame can be obtainedt,ot(ii) a By passing
Figure FDA0002699823820000017
A pooling operation coupling the input gate and output gate sections, and finally upsilonc⊙ctAs an output of the network; wherein XtIs an input feature, atIs a memory state, s, in the RNN networktIs an output state in the RNN network, ctIs the memory state in the LSTM network, otIs the output state, upsilon, in the LSTM networkaAnd upsiloncAre mutually coupled pooling operations.
5. The method as claimed in claim 1, wherein the step 6 comprises:
the features extracted in each network are represented as
Figure FDA0002699823820000021
Wherein XiRepresenting features in the ith modality, K being the total number of modalities; the fusion function is defined as: x → h (X); introduces two intermediate characteristics
Figure FDA0002699823820000022
And g (x), wherein g (x) comprises common features of the different modalities represented by the formula:
Figure FDA0002699823820000023
for feature X and feature function gi(Xi) The relationship between them is represented by the following formula:
Figure FDA0002699823820000024
wherein F is a non-linear function, WiAnd biRespectively representing a weight matrix and a bias matrix;
the correlation between the data is calculated by Cauchy estimator for different data sources, which is expressed as the following formula
Figure FDA0002699823820000025
Figure FDA0002699823820000026
Unique features representing different modalities, and gi(Xi) Similarly, the following formula:
Figure FDA0002699823820000027
calculating the unique information of different data by adopting orthogonality constraint, and performing weighted addition on the two parts and the original multi-classification cross entropy function to form a loss function of the whole network;
Φd(fi(Xi),fj(Xj))=|fi(Xi)⊙fj(Xj)|
Φd(fi(Xi),gi(Xi))=|fi(Xi)⊙gi(Xi)|
distributing different weights to the two intermediate features, and finally performing weighted fusion to obtain fusion features, wherein a fusion function is shown as the following formula;
Figure FDA0002699823820000028
Figure FDA0002699823820000029
0≤α12,...,αK,β≤1
where the hyper-parameters alpha and beta correspond to the weights of the intermediate features.
6. The method as claimed in claim 5, wherein the common feature and the characteristic feature have a weight ratio of 4: 1.
7. The method as claimed in claim 6, wherein the characteristic features are obtained by applying the RGB data: optical flow data: and performing weighted fusion on the depth data with the weight of 4:4:2, and predicting the action label through information after weighted fusion to obtain a recognition result.
CN202011018265.6A 2020-09-24 2020-09-24 Hand motion recognition method based on first visual angle RGB-D data Pending CN112307892A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011018265.6A CN112307892A (en) 2020-09-24 2020-09-24 Hand motion recognition method based on first visual angle RGB-D data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011018265.6A CN112307892A (en) 2020-09-24 2020-09-24 Hand motion recognition method based on first visual angle RGB-D data

Publications (1)

Publication Number Publication Date
CN112307892A true CN112307892A (en) 2021-02-02

Family

ID=74489178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011018265.6A Pending CN112307892A (en) 2020-09-24 2020-09-24 Hand motion recognition method based on first visual angle RGB-D data

Country Status (1)

Country Link
CN (1) CN112307892A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065451A (en) * 2021-03-29 2021-07-02 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113111842A (en) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN114896307A (en) * 2022-06-30 2022-08-12 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017206147A1 (en) * 2016-06-02 2017-12-07 Intel Corporation Recognition of activity in a video image sequence using depth information
CN109389621A (en) * 2018-09-11 2019-02-26 淮阴工学院 RGB-D method for tracking target based on the fusion of multi-mode depth characteristic

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017206147A1 (en) * 2016-06-02 2017-12-07 Intel Corporation Recognition of activity in a video image sequence using depth information
CN109389621A (en) * 2018-09-11 2019-02-26 淮阴工学院 RGB-D method for tracking target based on the fusion of multi-mode depth characteristic

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
SWATHIKIRAN SUDHAKARAN等: "Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition", 《ARXIV.ORG》, 31 July 2018 (2018-07-31), pages 4 *
SWATHIKIRAN SUDHAKARAN等: "LSTA: Long Short-Term Attention for Egocentric Action Recognition", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》, 9 January 2020 (2020-01-09), pages 9956 - 9958 *
YANSONG TANG等: "Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition", 《IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY》, vol. 29, no. 10, 11 October 2018 (2018-10-11), pages 2 - 3 *
赵小川: "《MATLAB图像处理 程序实现与模块化仿真 第2版》", 30 November 2018, 北京:北京航空航天大学出版社, pages: 206 - 208 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065451A (en) * 2021-03-29 2021-07-02 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113065451B (en) * 2021-03-29 2022-08-09 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113111842A (en) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN113111842B (en) * 2021-04-26 2023-06-27 浙江商汤科技开发有限公司 Action recognition method, device, equipment and computer readable storage medium
CN114896307A (en) * 2022-06-30 2022-08-12 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment
CN114896307B (en) * 2022-06-30 2022-09-27 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN111311666B (en) Monocular vision odometer method integrating edge features and deep learning
CN107808131B (en) Dynamic gesture recognition method based on dual-channel deep convolutional neural network
CN108388882B (en) Gesture recognition method based on global-local RGB-D multi-mode
CN112307892A (en) Hand motion recognition method based on first visual angle RGB-D data
CN111931602B (en) Attention mechanism-based multi-flow segmented network human body action recognition method and system
Xu et al. Aligning correlation information for domain adaptation in action recognition
CN112651262B (en) Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment
CN114187665B (en) Multi-person gait recognition method based on human skeleton heat map
CN111126223A (en) Video pedestrian re-identification method based on optical flow guide features
CN111582232A (en) SLAM method based on pixel-level semantic information
CN113743544A (en) Cross-modal neural network construction method, pedestrian retrieval method and system
Avola et al. 3D hand pose and shape estimation from RGB images for keypoint-based hand gesture recognition
CN113608663B (en) Fingertip tracking method based on deep learning and K-curvature method
CN112989889A (en) Gait recognition method based on posture guidance
CN112101262A (en) Multi-feature fusion sign language recognition method and network model
CN117671738A (en) Human body posture recognition system based on artificial intelligence
Yang et al. S3Net: A single stream structure for depth guided image relighting
Shabaninia et al. Transformers in action recognition: A review on temporal modeling
CN111582036A (en) Cross-view-angle person identification method based on shape and posture under wearable device
CN111680560A (en) Pedestrian re-identification method based on space-time characteristics
Rong et al. Picking point recognition for ripe tomatoes using semantic segmentation and morphological processing
Munsif et al. Attention-based deep learning framework for action recognition in a dark environment
CN113255429A (en) Method and system for estimating and tracking human body posture in video
CN117173792A (en) Multi-person gait recognition system based on three-dimensional human skeleton
Shi et al. Multilevel cross-aware RGBD indoor semantic segmentation for bionic binocular robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210202

RJ01 Rejection of invention patent application after publication