CN116740290A - Three-dimensional interaction double-hand reconstruction method and system based on deformable attention - Google Patents

Three-dimensional interaction double-hand reconstruction method and system based on deformable attention Download PDF

Info

Publication number
CN116740290A
CN116740290A CN202311024598.3A CN202311024598A CN116740290A CN 116740290 A CN116740290 A CN 116740290A CN 202311024598 A CN202311024598 A CN 202311024598A CN 116740290 A CN116740290 A CN 116740290A
Authority
CN
China
Prior art keywords
hand
grid
attention
feature
vertex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311024598.3A
Other languages
Chinese (zh)
Other versions
CN116740290B (en
Inventor
杨文姬
黎家瑞
王映龙
钱文彬
钟表
李佳航
廖彦文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Agricultural University
Original Assignee
Jiangxi Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Agricultural University filed Critical Jiangxi Agricultural University
Priority to CN202311024598.3A priority Critical patent/CN116740290B/en
Publication of CN116740290A publication Critical patent/CN116740290A/en
Application granted granted Critical
Publication of CN116740290B publication Critical patent/CN116740290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention belongs to the technical field of image information processing, and particularly relates to a three-dimensional interaction double-hand reconstruction method and system based on deformable attention, wherein the method obtains image grid features by extracting multi-scale image features and sampling multi-scale image feature grids; initializing global feature vectors into vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron; and embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into the interaction blocks, reconstructing an interaction hand reconstruction grid by the three interaction blocks in a mode from thick to thin, and directly returning the interaction hand reconstruction grid to the three-dimensional coordinates of the vertices of the surface of the two hands to obtain a result after the interaction hand is reconstructed. The invention integrates the convolution attention and the deformable multi-head self-attention to model the local and global interaction of two hands, solves the mutual shielding between the two hands, and reconstructs high-quality interactive hand reconstruction grid.

Description

Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
Technical Field
The invention belongs to the technical field of image information processing, and particularly relates to a three-dimensional interaction double-hand reconstruction method and system based on deformable attention.
Background
The hand three-dimensional reconstruction is a hotspot problem in the fields of computer vision and man-machine interaction, and has wide application in the fields of virtual reality, man-machine interaction, robots, digital medicine and the like. For example, interactive two-handed reconstruction is used in the operating room to assist in minimally invasive surgery. The interactive two-hand reconstruction technology can capture the actions and positions of the hands of a doctor in real time by means of a sensor or a camera device, and map the actions and positions into a deep learning network model. During minimally invasive surgery, a physician may wear a sensor device that continuously tracks the motion of the physician's hands and transmits the motion to a computer system in real time. The computer system trains and reconstructs the hand model of the doctor in the deep learning network model according to the hand motions of the doctor, and simulates the motions and positions of the surgical instruments. At the same time, the computer system may also provide haptic feedback, such as a slight shock or pressure sensation, to enhance the physician's haptic sensation. Through the interactive two-hand reconstruction technology, doctors can observe own hand actions and the positions of surgical instruments in real time in a virtual environment without directly depending on the surgical field. This provides the physician with more comprehensive information that helps them perform operations and decisions more accurately. The surgeon can control surgical instruments in the virtual environment, such as rotation, grasping, and cutting, by gestures, which are precisely replicated in the actual procedure. The application of the interactive two-hand reconstruction technology enables the minimally invasive surgery to be more accurate and controllable, and reduces the risk of the surgery and the possibility of complications.
The interactive two-hand reconstruction techniques can be largely divided into two categories. 1) The method of the non-parametric model directly reconstructs the three-dimensional hand grid. In the method, adjacent vertex characteristics are aggregated by a deep learning method, a graph rolling network is mostly adopted, each vertex is estimated by the deep learning network, a plurality of layers of Transformer encoders are introduced for modeling the long-range dependency relationship of the hand, so that global interaction can be modeled without being limited by any grid topology, and then a hierarchical architecture is designed for generating a high-precision grid model from coarse to fine. 2) The parameterized model method uses a parameterized hand model MANO, which regresses hand shape and posture parameters from a single RGB image to estimate a three-dimensional hand grid, 10-dimensional shape parameters describing the length fat and lean information of the hand, and 48-dimensional posture parameters consisting of rotation vectors of 16 joints of the hand. After the shape parameters and the posture parameters of the parameterized hand model MANO are given, a hand grid model can be generated, and meanwhile, the joint positions of the hand can be obtained. The aim of training the network is achieved by a method of supervising the parameterized hand model MANO.
Currently, a one-hand three-dimensional model can be reconstructed through a single RGB image, but most two-hand three-dimensional reconstruction depends on depth pictures. The existing method for reconstructing the two-hand three-dimensional model through a single RGB image often cannot accurately reconstruct the two-hand three-dimensional model under the condition of tight interaction. Thus, existing methods have difficulty obtaining a two-handed reconstruction grid that is exactly identical to the image.
Disclosure of Invention
Reconstructing the interactive hand from a single RGB image is a very challenging task. On the one hand, the severe mutual occlusion and similar local appearance between the two hands confuses the extraction of visual features, resulting in an estimated hand grid that is not aligned with the image. On the other hand, a complex interaction mode exists between the two hands of the interaction, which greatly increases the solving space of the hand gestures and increases the complexity of the hand gestures. The transform-based method effectively models non-local interactions between three-dimensional mesh vertices and hand joints, while the graph convolution neural network can model neighborhood vertex local interactions of a pre-specified mesh topology. Therefore, the invention provides a three-dimensional interaction double-hand reconstruction method and system based on deformable attention, which are used for modeling local and global interaction of two hands by fusing graph convolution attention and deformable multi-head self-attention in a transducer coder, solving the mutual shielding between the two hands, reducing dislocation between a hand grid and an image and mutual collision between the two hands, enabling the generated two hands to have minimum artifact and permeation, reconstructing high-quality interaction hand reconstruction grid, reducing model complexity and improving the performance of a model.
The invention is realized by the following technical scheme. The three-dimensional interaction double-hand reconstruction method based on deformable attention comprises the following steps:
step S1: inputting a single Zhang Baohan two-handed color (RGB) image to the encoder-decoder structure, extracting image features, global feature vectors, hand two-dimensional joint features, hand segmentation features, and dense map coding features;
step S2: connecting the image features with two-dimensional joint features of the hands, hand segmentation features and dense mapping coding features, and obtaining multi-scale image features through a convolution layer;
step S3: sampling the multi-scale image feature grids to obtain image grid features;
step S4: initializing global feature vectors into vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron;
step S5: embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into interaction blocks, reconstructing an interaction hand reconstruction grid by a plurality of interaction blocks in a mode from thick to thin, and performing up-sampling operation behind each interaction block; each interaction block includes a graph convolution attention (Graphformer) module and an interaction hand deformable attention module;
step S6: the interactive hand reconstruction grid directly returns the three-dimensional coordinates of the vertexes of the surfaces of the two hands to obtain a result after the interactive hand is reconstructed.
The invention provides a three-dimensional interactive double-hand reconstruction system based on deformable attention, which comprises an encoder-decoder structure, a grid sampling module and an interactive module, wherein the interactive module uses a plurality of interactive blocks to reconstruct an interactive hand reconstruction grid in a manner from thick to thin; each interactive block is followed by an upsampling operation, each interactive block being identical in structure, each interactive block comprising a graph convolution attention (Graphformer) module and an interactive hand deformable attention module.
Specifically, the invention adopts ResNet-50 as backbone network to construct encoder-decoder structure, pre-trains the encoder-decoder structure, and extracts two-dimensional joint characteristics of hand segmentation characteristics, dense mapping coding characteristics and global characteristic vectors of the hand segmentation characteristics by the trained encoder-decoder structure.
In particular, the input of the interaction block includes vertex queriesAnd joint query->H ε L, R represents the left hand or right hand, and the processing steps of the graph convolution attention module are as follows:
step 201: obtaining corresponding vertex queries of the left hand and the right hand by multiplying the vertex tokens of the left hand and the right hand and the joint tokens by a trainable parameter matrixAnd joint query->
Step 202: querying the vertex of the left handJoint inquiry->And image grid feature->Input to the convolution attention module, which outputs the left-hand vertex feature +.>The method comprises the steps of carrying out a first treatment on the surface of the Likewise, the vertex of the right hand is queried +.>Joint inquiry->And image grid feature->Input to the convolution attention module, which outputs the right-hand vertex feature +.>
Specifically, the graph convolution attention module fuses the graph residual block into a transducer encoder, and specifically comprises the following steps:
step 301: querying the vertex of the left handJoint inquiry->And image grid feature->Firstly, carrying out normalization operation through a normalization layer;
step 302: vertex query for left handAnd joint query->Performing deformable multi-head attention, and then performing residual connection to obtain left hand feature +.>
Step 303: characterization of left handInputting the left-hand image grid characteristics into an image residual block to obtain left-hand image grid characteristics +.>
Step 304: left-hand grid feature outputting graph residual blockResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>The method comprises the steps of carrying out a first treatment on the surface of the Right-hand mesh vertex feature +.>
Specifically, the graph residual block comprises a main path and a branch path, the main path sequentially passes through a normalization layer and a multi-layer perceptron, and two layers of graph volumes are laminated, the vertex characteristics of the left hand are continuously refined in the graph volumes, and the obtained vertex characteristics of the left hand pass through the normalization layer and the multi-layer perceptron and then pass through the left hand characteristics of the multi-layer perceptronResidual connection is carried out to obtain the left-hand graph grid characteristic +.>
Specifically, the interactive hand deformable attention module expresses the interactive relationship between the two hands by using symmetrical deformable multi-head attention, and the specific process is as follows:
step 401: performing a deformable multi-headed self-attention on each hand to obtain query features, key features, and value features for each hand;
step 402: obtaining key features and value features of one hand through deformable multi-headed attention using the query features of the other hand;
step 403: performing interactive hand deformable attention to mesh vertex features of left and right hands:
is an interactive attention feature encoding a right-hand to left-hand correlation, < >>Is an interactive attention feature encoding left-hand to right-hand correlation, d is a normalization constant,/->For bilinear interpolation function, +.>Deformable relative positional deviation, +.>For parameterized deviation table, +.>Representing the relative positional offset of the computed query feature and the deformed key feature, Q L Representing the query characteristics of the left hand, Q R Query feature representing right hand, +.>Key feature K representing right hand R Is a transpose of (2); />Key feature K representing left hand L Transpose of V R Representing the value characteristic of the right hand, V L A value characteristic representing the left hand;
step 404: combining the attention features of the interaction hand with the grid vertex features of the left hand and the right hand through a multi-layer perceptron:
is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP denotes a multi-layer perceptron.
Step 405: and taking the output left and right hand grid vertex characteristics as the input of the next interaction block.
Specifically, the process of performing a deformable multi-headed self-attention on each hand to obtain query features, key features, and value features for each hand is:
1) Inputting grid vertex characteristics of a hand, establishing global grid reference points according to the grid vertex characteristics of the hand, and downsampling the grid according to factors by the grid vertex characteristics of the hand to obtain a grid;
2) The values of the grid reference points are linear two-dimensional coordinates of the grid, the values of the grid reference points are normalized to be between [ -1,1] according to the size of the grid, (-1, 1) represents the grid point of the upper left corner, and (1, 1) represents the grid point of the lower right corner;
3) The grid vertex characteristics of the hand are mapped through a first parameter matrix to obtain query characteristics, and then the query characteristics are input into a lightweight sub-network to obtain the offset of each query characteristic;
4) Adding the obtained value of the grid reference point and the obtained offset to obtain a deformed value of the grid reference point;
5) Feature sampling is carried out at the position of the deformed grid reference point, bilinear interpolation is used as a sampling function, the sampled deformed feature is used as a key and a value, and the key feature and the value feature are obtained through mapping through a second parameter matrix and a third parameter matrix respectively;
6) And applying relative position deviation coding to the query features, the key features and the value features, then splicing the features output by each head through a multi-head attention layer, and obtaining final output features through a mapping weight matrix.
The invention also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.
The present invention provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.
The invention maps the global image feature vector into the vertex tokens of the left hand and the right hand, the joint tokens and the position embedment through an MLP (multi-layer perceptron), and then multiplies the vertex tokens and the joint tokens by a trainable parameter matrix to obtain the vertex query and the joint query of the left hand and the right hand. The graph rolling network is integrated into a transducer encoder, training is performed by inputting pixels into the network, and the input dimension is reduced every time. In addition, the key characteristics and the value characteristics in the self-attention are selected in a data-dependent manner by adopting the deformable multi-head self-attention, so that the interactive hand deformable attention module can concentrate on related areas and capture more information characteristics, and the relation between tokens (token) of the left hand and the right hand can be better modeled. Compared with the prior art, the technical scheme of the invention reduces the complexity of the network model and the time cost of training, breaks through the limitation that the relationship with the image data can not be well established by adopting sparse attention in the prior art, and improves the performance of the model. In the aspect of interactive hand reconstruction, grid features are better aligned with the images, so that dislocation between the hand grids and the images and mutual collision between the hands are reduced, and the interactive hand reconstruction effect is better.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic structural diagram of an interactive block.
Fig. 3 is a schematic diagram of a structure of the convolution attention module.
Fig. 4 is a schematic structural diagram of the residual block of fig. 4.
Fig. 5 is a schematic structural view of the interactive hand deformable attention module.
Detailed Description
The invention is further described in detail below with reference to the drawings and examples.
Referring to fig. 1, the three-dimensional interactive two-hand reconstruction method based on deformable attention comprises the following steps:
step S1: inputting a single Zhang Baohan two-handed color (RGB) image to an encoder-decoder structure, extracting image features F by the encoder i (i=1, 2, 3), i representing a feature level, corresponding to the i-th interaction block, i=1, 2,3; decoder extracts hand two-dimensional joint characteristics T i (i=1, 2, 3), hand segmentation feature and dense map coding feature D i (i=1, 2, 3); the encoder-decoder extracts the global feature vector F G
Step S2: image feature F i Joint characteristics T of hand two-dimensional joint i Hand segmentation feature and dense mapping coding feature D i Connecting, and obtaining multi-scale image characteristics through a 1X 1 convolution layer;/>Representing image feature space ϕ i Representing the ith multiscale image feature, C i Representing feature dimensions, H i Representing multi-scale image height, W i Representing the width of a multi-scale image, H i ×W i Representing the resolution of the multi-scale image;
step S3: sampling operation is carried out on the multi-scale image characteristic grids to obtain image grid characteristics
Step S4: global feature vector F G Initializing vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron;
step S5: embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into interaction blocks, reconstructing an interaction hand reconstruction grid by three interaction blocks in a mode from thick to thin, and performing up-sampling operation behind each interaction block;
step S6: the interactive hand reconstruction grid directly returns the three-dimensional coordinates of the vertexes of the surfaces of the two hands to obtain a result after the interactive hand is reconstructed.
The embodiment provides a three-dimensional interactive two-hand reconstruction system based on deformable attention, which comprises an encoder-decoder structure, a grid sampling module and an interactive module, wherein the interactive module uses three interactive blocks to reconstruct an interactive hand reconstruction grid from thick to thin. Each interaction block is followed by an upsampling operation, each interaction block being identical in structure and comprising a graph convolution attention (Graphformer) module and an interaction hand deformable attention module.
Specifically, in this embodiment, an encoder-decoder structure is constructed by using a res net-50 as a backbone network, and the encoder-decoder structure is pre-trained to extract two-dimensional joint features of a hand with hand segmentation features, dense mapping coding features and global feature vectors.
FIG. 2 illustrates a detailed structure of interaction blocks, each of which contains a graph convolution attention (Graphformer) module and an interaction hand deformable attention module, the input of which includes two queries: vertex queryAnd joint query->H ε L, R represents the left hand or right hand, and the processing steps of the graph convolution attention module are as follows:
step 201: obtaining corresponding vertex queries of the left hand and the right hand by multiplying the vertex tokens of the left hand and the right hand and the joint tokens by a trainable parameter matrixAnd joint query->,h∈L,R;
Step 202: querying the vertex of the left handJoint inquiry->And image grid feature->Input to the convolution attention module, which outputs the left-hand vertex feature +.>The method comprises the steps of carrying out a first treatment on the surface of the Likewise, the vertex of the right hand is queried +.>Joint inquiry->And image grid feature->Input to the convolution attention module, which outputs the right-hand vertex feature +.>
Fig. 3 shows the detailed structure of the picture convolution attention module, which fuses the picture residual block into the transform encoder, and the specific steps are as follows:
step 301: taking the left hand as an example, the vertex of the left hand is queriedJoint inquiry->And image grid feature->Firstly, carrying out normalization operation through a normalization layer;
step 302: vertex query for left handAnd joint query->Performing deformable multi-head attention, and then performing residual connection to obtain left hand feature +.>
Step 303: characterization of left handInputting the left-hand image grid characteristics into an image residual block to obtain left-hand image grid characteristics +.>. The residual block of the graph shown in fig. 4 comprises a main path and a branch path, wherein the main path sequentially passes through a normalization layer and a multi-layer perceptron, and is formed by layering two layers of graph rolls, the vertex characteristics of the left hand are continuously refined in the graph roll layering, and the obtained vertex characteristics of the left hand are subjected to the normalization layer and the multi-layer perceptron and then are subjected to the left hand characteristics of the multi-layer perceptron>Residual connection is carried out to obtain the left-hand graph grid characteristic +.>
Step 304: left-hand grid feature outputting graph residual blockResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>. The right-hand grid vertex characteristics can also be obtained in the same way
FIG. 5 shows the detailed structure of the interactive hand deformable attention module, which expresses the interactive relationship between the hands by using symmetrical deformable multi-head attention, and the specific process is as follows:
step 401: performing deformable multi-headed self-attention on each hand to obtain query features Q for each hand h Key feature K h Sum value feature V h . 1) Grid vertex feature of input handH W is the size of the grid vertex feature of the hand, H is the height, W is the width, C is the feature dimension of the grid vertex feature of the hand, according to the grid vertex feature F of the hand h Establishing a global grid reference point, wherein the size of the grid is determined by the grid vertex characteristics F of hands h Downsampling according to a factor r to obtain a grid,H G ×W G Is the size of the grid P, H G =h/r, representing grid height, W G =w/r, representing the mesh width; 2) The values of the grid reference points are the linear two-dimensional coordinates of grid P, ranging from { (0, 0), … (H G -1, W G -1) normalizing the values of the grid reference points to [ -1,1] according to the size of the grid](1, 1) represents the grid point of the upper left corner, and (1, 1) represents the grid point of the lower right corner; 3) To obtain the offset of each grid reference point, the grid vertex feature F of the hand h By a first parameter matrix W q Mapping to obtain query feature Q h Then query feature Q h Input to a lightweight subnetwork theta offset (-), get each query feature Q h Offset of +.>The method comprises the steps of carrying out a first treatment on the surface of the 4) The value p of the grid reference point obtained and the offset obtained are + ->Adding to obtain deformed gridValue p + of reference point>The method comprises the steps of carrying out a first treatment on the surface of the 5) Feature sampling at the position of deformed grid reference points using bilinear interpolation as sampling function +.>The deformation characteristics after sampling are +.>As keys and values, respectively pass through a second parameter matrix W k And a third parameter matrix W v Mapping to obtain key feature K h Sum value feature V h The method comprises the steps of carrying out a first treatment on the surface of the 6) For query feature Q h Key feature K h Sum value feature V h Applying relative position deviation coding, then passing through multiple head attention layers, splicing the characteristics output by each head together, and passing through a mapping weight matrix W 0 And obtaining the final output characteristics. Thus, not only the flexibility and efficiency of the original self-attention module are enhanced, but also more information features can be captured. The deformable attention calculation is as follows:
,/>,/>,/>
step 402: query feature Q using one hand h Acquisition of key feature K of the other hand by deformable multi-head attention h Sum value feature V h
Step 403: performing interactive hand deformable attention to grid vertex features of left and right hands, specifically as follows:
is an interactive attention feature encoding a right-hand to left-hand correlation, < >>Is an interactive attention feature encoding left-hand to right-hand correlation, d is a normalization constant,/->For bilinear interpolation function, +.>For coding the relative position between each pair of query features and key features, enhancing general attention by spatial information, for deformable relative position deviations>For parameterized deviation table, +.>Representing the relative positional offset of the computed query feature and the deformed key feature, the normalized range [ -1,1] is computed because the deformable attention has successive key positions]Relative displacement in, interpolation in parameterized bias table +.>To cover all possible offset values, Q L Representing the query characteristics of the left hand, Q R Query feature representing right hand, +.>Key feature K representing right hand R Is a transpose of (2); />Key feature K representing left hand L Transpose of V R Representing the value characteristic of the right hand, V L A value characteristic representing the left hand;
step 404: combining the attention features of the interaction hand with the grid vertex features of the left hand and the right hand through a multi-layer perceptron:
is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP denotes a multi-layer perceptron.
Step 405: left-hand and right-hand grid vertex feature to be output、/>As input to the next interaction block.
The present embodiment provides an electronic device, including: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.
The present embodiment provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.
The above-described invention is merely representative of embodiments of the present invention and should not be construed as limiting the scope of the invention, nor any limitation in any way as to the structure of the embodiments of the present invention. It should be noted that it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. The three-dimensional interaction double-hand reconstruction method based on deformable attention is characterized by comprising the following steps of:
step S1: inputting a color image of both hands of a single Zhang Baohan to the encoder-decoder structure, extracting image features, global feature vectors, hand two-dimensional joint features, hand segmentation features and dense mapping coding features;
step S2: connecting the image features with two-dimensional joint features of the hands, hand segmentation features and dense mapping coding features, and obtaining multi-scale image features through a convolution layer;
step S3: sampling the multi-scale image feature grids to obtain image grid features;
step S4: initializing global feature vectors into vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron;
step S5: embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into interaction blocks, reconstructing an interaction hand reconstruction grid by a plurality of interaction blocks in a mode from thick to thin, and performing up-sampling operation behind each interaction block; each interaction block comprises a graph convolution attention module and an interaction hand deformable attention module;
step S6: the interactive hand reconstruction grid directly returns the three-dimensional coordinates of the vertexes of the surfaces of the two hands to obtain a result after the interactive hand is reconstructed.
2. The deformable attention-based three-dimensional interactive two-hand reconstruction method of claim 1, wherein the input of the interaction block comprises a vertex queryAnd joint query->H ε L, R represents the left hand or right hand, and the processing steps of the graph convolution attention module are as follows:
step 201: obtaining corresponding vertex queries of the left hand and the right hand by multiplying the vertex tokens of the left hand and the right hand and the joint tokens by a trainable parameter matrixAnd joint query->
Step 202: querying the vertex of the left handJoint inquiry->And image grid feature->Input to the convolution attention module, which outputs the left-hand vertex feature +.>The method comprises the steps of carrying out a first treatment on the surface of the Likewise, the vertex of the right hand is queried +.>Joint inquiry->And image grid feature->Input to the volume attention moduleThe graph convolution attention module outputs the right-hand vertex feature +.>
3. The deformable attention-based three-dimensional interactive two-hand reconstruction method according to claim 1, characterized in that in particular the graph convolution attention module fuses the graph residual block into a transform encoder, comprising the following steps:
step 301: querying the vertex of the left handJoint inquiry->And image grid feature->Firstly, carrying out normalization operation through a normalization layer;
step 302: vertex query for left handAnd joint query->Performing deformable multi-head attention, and then performing residual connection to obtain left hand feature +.>
Step 303: characterization of left handInputting the left-hand image grid characteristics into an image residual block to obtain left-hand image grid characteristics +.>
Step 304: residual pictureLeft-hand grid feature of difference block outputResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>The method comprises the steps of carrying out a first treatment on the surface of the Right-hand mesh vertex feature +.>
4. The three-dimensional interactive double-hand reconstruction method based on deformable attention as claimed in claim 3, wherein the graph residual block comprises a main path and a branch path, the main path sequentially passes through a normalization layer and a multi-layer perceptron, two layers of graph volume layers are laminated, the vertex characteristics of the left hand are continuously refined in the graph volume layers, and the obtained vertex characteristics of the left hand pass through the normalization layer and the multi-layer perceptron and then pass through the left hand characteristics of the multi-layer perceptronResidual connection is carried out to obtain the left-hand graph grid characteristic +.>
5. The three-dimensional interactive double-hand reconstruction method based on deformable attention according to claim 3, wherein the interactive hand deformable attention module expresses the interactive relationship between the two hands by using symmetrical deformable multiple heads of attention, and the specific process is as follows:
step 401: performing a deformable multi-headed self-attention on each hand to obtain query features, key features, and value features for each hand;
step 402: obtaining key features and value features of one hand through deformable multi-headed attention using the query features of the other hand;
step 403: performing interactive hand deformable attention to mesh vertex features of left and right hands:
is an interactive attention feature encoding a right-hand to left-hand correlation, < >>Is an interactive attention feature encoding left-hand to right-hand correlation, d is a normalization constant,/->For bilinear interpolation function, +.>Deformable relative positional deviation, +.>For parameterized deviation table, +.>Representing the relative positional offset of the computed query feature and the deformed key feature, Q L Representing the query characteristics of the left hand, Q R Query feature representing right hand, +.>Key feature K representing right hand R Is a transpose of (2); />Key feature K representing left hand L Transpose of V R Representing the value characteristic of the right hand, V L A value characteristic representing the left hand;
step 404: combining the attention features of the interaction hand with the grid vertex features of the left hand and the right hand through a multi-layer perceptron:
is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP represents a multi-layer perceptron;
step 405: and taking the output left and right hand grid vertex characteristics as the input of the next interaction block.
6. The method of deformable attention-based three-dimensional interactive two-hand reconstruction of claim 5, wherein the process of performing deformable multi-headed self-attention on each hand to obtain query features, key features and value features for each hand is:
inputting grid vertex characteristics of a hand, establishing global grid reference points according to the grid vertex characteristics of the hand, and downsampling the grid according to factors by the grid vertex characteristics of the hand to obtain a grid;
the values of the grid reference points are linear two-dimensional coordinates of the grid, the values of the grid reference points are normalized to be between [ -1,1] according to the size of the grid, (-1, 1) represents the grid point of the upper left corner, and (1, 1) represents the grid point of the lower right corner;
the grid vertex characteristics of the hand are mapped through a first parameter matrix to obtain query characteristics, and then the query characteristics are input into a lightweight sub-network to obtain the offset of each query characteristic;
adding the obtained value of the grid reference point and the obtained offset to obtain a deformed value of the grid reference point;
feature sampling is carried out at the position of the deformed grid reference point, bilinear interpolation is used as a sampling function, the sampled deformed feature is used as a key and a value, and the key feature and the value feature are obtained through mapping through a second parameter matrix and a third parameter matrix respectively;
and applying relative position deviation coding to the query features, the key features and the value features, then splicing the features output by each head through a multi-head attention layer, and obtaining final output features through a mapping weight matrix.
7. The three-dimensional interactive double-hand reconstruction system based on the deformable attention comprises an encoder-decoder structure, a grid sampling module and an interactive module, and is characterized in that the interactive module uses a plurality of interactive blocks to reconstruct an interactive hand reconstruction grid in a manner from thick to thin; the up-sampling operation is carried out behind each interaction block, the structures of the interaction blocks are the same, and each interaction block comprises a graph convolution attention module and an interaction hand deformable attention module; the interactive hand deformable attention module expresses the interactive relation between two hands by using symmetrical deformable multi-head attention, and the specific process is as follows:
step 401: performing a deformable multi-headed self-attention on each hand to obtain query features, key features, and value features for each hand;
step 402: obtaining key features and value features of one hand through deformable multi-headed attention using the query features of the other hand;
step 403: performing interactive hand deformable attention to mesh vertex features of left and right hands:
is an interactive attention feature encoding a right-hand to left-hand correlation, < >>Is an interactive attention feature encoding left-hand to right-hand correlation, d is a normalization constant,/->For bilinear interpolation function, +.>Deformable relative positional deviation, +.>For parameterized deviation table, +.>Representing the relative positional offset of the computed query feature and the deformed key feature, Q L Representing the query characteristics of the left hand, Q R Query feature representing right hand, +.>Key feature K representing right hand R Is a transpose of (2); />Key feature K representing left hand L Transpose of V R Representing the value characteristic of the right hand, V L A value characteristic representing the left hand;
step 404: combining the attention features of the interaction hand with the grid vertex features of the left hand and the right hand through a multi-layer perceptron:
wherein ,is the left-hand grid vertex feature; />Is a right-hand grid vertex feature; />Is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP represents a multi-layer perceptron;
step 405: and taking the output left and right hand grid vertex characteristics as the input of the next interaction block.
8. The deformable attention-based three-dimensional interactive two-hand reconstruction system of claim 7, wherein the convolution attention module fuses the residual block of the map into a transducer encoder by the steps of:
step 301: querying the vertex of the left handJoint inquiry->And image grid feature->Firstly, carrying out normalization operation through a normalization layer;
step 302: vertex query for left handAnd joint query->Performing deformable multi-head attention, and then performing residual connection to obtain left hand feature +.>
Step 303: characterization of left handInputting the left-hand image grid characteristics into an image residual block to obtain left-hand image grid characteristics +.>
Step 304: left-hand grid feature outputting graph residual blockResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>The method comprises the steps of carrying out a first treatment on the surface of the Right-hand mesh vertex feature +.>
9. An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the deformable attention-based three-dimensional interactive two-hand reconstruction method of any one of claims 1-6.
10. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the deformable attention-based three-dimensional interactive two-hand reconstruction method of any one of claims 1-6.
CN202311024598.3A 2023-08-15 2023-08-15 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention Active CN116740290B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311024598.3A CN116740290B (en) 2023-08-15 2023-08-15 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311024598.3A CN116740290B (en) 2023-08-15 2023-08-15 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention

Publications (2)

Publication Number Publication Date
CN116740290A true CN116740290A (en) 2023-09-12
CN116740290B CN116740290B (en) 2023-11-07

Family

ID=87901631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311024598.3A Active CN116740290B (en) 2023-08-15 2023-08-15 Three-dimensional interaction double-hand reconstruction method and system based on deformable attention

Country Status (1)

Country Link
CN (1) CN116740290B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333635A (en) * 2023-10-23 2024-01-02 中国传媒大学 Interactive two-hand three-dimensional reconstruction method and system based on single RGB image

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100207942A1 (en) * 2009-01-28 2010-08-19 Eigen, Inc. Apparatus for 3-d free hand reconstruction
CN113888697A (en) * 2021-09-28 2022-01-04 中国科学院软件研究所 Three-dimensional reconstruction method under two-hand interaction state
CN114998520A (en) * 2022-06-02 2022-09-02 东南大学 Three-dimensional interactive hand reconstruction method and system based on implicit expression
CN115170762A (en) * 2022-05-12 2022-10-11 中南民族大学 Single-view three-dimensional human hand reconstruction method, equipment and readable storage medium
CN115272608A (en) * 2022-07-12 2022-11-01 聚好看科技股份有限公司 Human hand reconstruction method and equipment
CN115880724A (en) * 2022-12-17 2023-03-31 杭州电子科技大学 Light-weight three-dimensional hand posture estimation method based on RGB image
CN116188695A (en) * 2023-02-28 2023-05-30 华中科技大学 Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100207942A1 (en) * 2009-01-28 2010-08-19 Eigen, Inc. Apparatus for 3-d free hand reconstruction
CN113888697A (en) * 2021-09-28 2022-01-04 中国科学院软件研究所 Three-dimensional reconstruction method under two-hand interaction state
CN115170762A (en) * 2022-05-12 2022-10-11 中南民族大学 Single-view three-dimensional human hand reconstruction method, equipment and readable storage medium
CN114998520A (en) * 2022-06-02 2022-09-02 东南大学 Three-dimensional interactive hand reconstruction method and system based on implicit expression
CN115272608A (en) * 2022-07-12 2022-11-01 聚好看科技股份有限公司 Human hand reconstruction method and equipment
CN115880724A (en) * 2022-12-17 2023-03-31 杭州电子科技大学 Light-weight three-dimensional hand posture estimation method based on RGB image
CN116188695A (en) * 2023-02-28 2023-05-30 华中科技大学 Construction method of three-dimensional hand gesture model and three-dimensional hand gesture estimation method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KEVIN LIN等: "Mesh Graphormer", ARXIV:2104.00272V2, pages 1 - 6 *
MENGCHENG LI等: "Interacting Attention Graph for Single Image Two-Hand Reconstruction", ARXIV:202203.09364V2, pages 1 - 7 *
ZHUOFAN XIA等: "Vision Transformer with Deformable Attention", ARXIV:2201.00520V3, pages 2 - 5 *
陈炫琦 等: "基于注意力引导空域图卷积SRU的动态手势识别", 控制与决策 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333635A (en) * 2023-10-23 2024-01-02 中国传媒大学 Interactive two-hand three-dimensional reconstruction method and system based on single RGB image
CN117333635B (en) * 2023-10-23 2024-04-26 中国传媒大学 Interactive two-hand three-dimensional reconstruction method and system based on single RGB image

Also Published As

Publication number Publication date
CN116740290B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
Saito et al. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
Moon et al. Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling
CN113421328B (en) Three-dimensional human body virtual reconstruction method and device
CN112950775A (en) Three-dimensional face model reconstruction method and system based on self-supervision learning
CN111583399B (en) Image processing method, device, equipment, medium and electronic equipment
Jiang et al. Disentangled human body embedding based on deep hierarchical neural network
CN112330729A (en) Image depth prediction method and device, terminal device and readable storage medium
CN116740290B (en) Three-dimensional interaction double-hand reconstruction method and system based on deformable attention
CN115880724A (en) Light-weight three-dimensional hand posture estimation method based on RGB image
JP2023524252A (en) Generative nonlinear human shape model
CN111462274A (en) Human body image synthesis method and system based on SMP L model
CN115937406A (en) Three-dimensional reconstruction method, device, equipment and storage medium
Lifkooee et al. Real-time avatar pose transfer and motion generation using locally encoded laplacian offsets
JP2024510230A (en) Multi-view neural human prediction using implicitly differentiable renderer for facial expression, body pose shape and clothing performance capture
Correia et al. 3D reconstruction of human bodies from single-view and multi-view images: A systematic review
Khan et al. Towards monocular neural facial depth estimation: Past, present, and future
Gan et al. Fine-grained multi-view hand reconstruction using inverse rendering
CA3177593A1 (en) Transformer-based shape models
Wang et al. Structure and motion recovery based on spatial-and-temporal-weighted factorization
Jian et al. Realistic face animation generation from videos
Huang et al. Detail-preserving controllable deformation from sparse examples
Qinran et al. Video‐Driven 2D Character Animation
WO2015042867A1 (en) Method for editing facial expression based on single camera and motion capture data
Duffy Advances in Applied Human Modeling and Simulation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant