CN116740290A

CN116740290A - Three-dimensional interaction double-hand reconstruction method and system based on deformable attention

Info

Publication number: CN116740290A
Application number: CN202311024598.3A
Authority: CN
Inventors: 杨文姬; 黎家瑞; 王映龙; 钱文彬; 钟表; 李佳航; 廖彦文
Original assignee: Jiangxi Agricultural University
Current assignee: Jiangxi Agricultural University
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-09-12
Anticipated expiration: 2043-08-15
Also published as: CN116740290B

Abstract

The invention belongs to the technical field of image information processing, and particularly relates to a three-dimensional interaction double-hand reconstruction method and system based on deformable attention, wherein the method obtains image grid features by extracting multi-scale image features and sampling multi-scale image feature grids; initializing global feature vectors into vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron; and embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into the interaction blocks, reconstructing an interaction hand reconstruction grid by the three interaction blocks in a mode from thick to thin, and directly returning the interaction hand reconstruction grid to the three-dimensional coordinates of the vertices of the surface of the two hands to obtain a result after the interaction hand is reconstructed. The invention integrates the convolution attention and the deformable multi-head self-attention to model the local and global interaction of two hands, solves the mutual shielding between the two hands, and reconstructs high-quality interactive hand reconstruction grid.

Description

Three-dimensional interaction double-hand reconstruction method and system based on deformable attention

Technical Field

The invention belongs to the technical field of image information processing, and particularly relates to a three-dimensional interaction double-hand reconstruction method and system based on deformable attention.

Background

The hand three-dimensional reconstruction is a hotspot problem in the fields of computer vision and man-machine interaction, and has wide application in the fields of virtual reality, man-machine interaction, robots, digital medicine and the like. For example, interactive two-handed reconstruction is used in the operating room to assist in minimally invasive surgery. The interactive two-hand reconstruction technology can capture the actions and positions of the hands of a doctor in real time by means of a sensor or a camera device, and map the actions and positions into a deep learning network model. During minimally invasive surgery, a physician may wear a sensor device that continuously tracks the motion of the physician's hands and transmits the motion to a computer system in real time. The computer system trains and reconstructs the hand model of the doctor in the deep learning network model according to the hand motions of the doctor, and simulates the motions and positions of the surgical instruments. At the same time, the computer system may also provide haptic feedback, such as a slight shock or pressure sensation, to enhance the physician's haptic sensation. Through the interactive two-hand reconstruction technology, doctors can observe own hand actions and the positions of surgical instruments in real time in a virtual environment without directly depending on the surgical field. This provides the physician with more comprehensive information that helps them perform operations and decisions more accurately. The surgeon can control surgical instruments in the virtual environment, such as rotation, grasping, and cutting, by gestures, which are precisely replicated in the actual procedure. The application of the interactive two-hand reconstruction technology enables the minimally invasive surgery to be more accurate and controllable, and reduces the risk of the surgery and the possibility of complications.

The interactive two-hand reconstruction techniques can be largely divided into two categories. 1) The method of the non-parametric model directly reconstructs the three-dimensional hand grid. In the method, adjacent vertex characteristics are aggregated by a deep learning method, a graph rolling network is mostly adopted, each vertex is estimated by the deep learning network, a plurality of layers of Transformer encoders are introduced for modeling the long-range dependency relationship of the hand, so that global interaction can be modeled without being limited by any grid topology, and then a hierarchical architecture is designed for generating a high-precision grid model from coarse to fine. 2) The parameterized model method uses a parameterized hand model MANO, which regresses hand shape and posture parameters from a single RGB image to estimate a three-dimensional hand grid, 10-dimensional shape parameters describing the length fat and lean information of the hand, and 48-dimensional posture parameters consisting of rotation vectors of 16 joints of the hand. After the shape parameters and the posture parameters of the parameterized hand model MANO are given, a hand grid model can be generated, and meanwhile, the joint positions of the hand can be obtained. The aim of training the network is achieved by a method of supervising the parameterized hand model MANO.

Currently, a one-hand three-dimensional model can be reconstructed through a single RGB image, but most two-hand three-dimensional reconstruction depends on depth pictures. The existing method for reconstructing the two-hand three-dimensional model through a single RGB image often cannot accurately reconstruct the two-hand three-dimensional model under the condition of tight interaction. Thus, existing methods have difficulty obtaining a two-handed reconstruction grid that is exactly identical to the image.

Disclosure of Invention

Reconstructing the interactive hand from a single RGB image is a very challenging task. On the one hand, the severe mutual occlusion and similar local appearance between the two hands confuses the extraction of visual features, resulting in an estimated hand grid that is not aligned with the image. On the other hand, a complex interaction mode exists between the two hands of the interaction, which greatly increases the solving space of the hand gestures and increases the complexity of the hand gestures. The transform-based method effectively models non-local interactions between three-dimensional mesh vertices and hand joints, while the graph convolution neural network can model neighborhood vertex local interactions of a pre-specified mesh topology. Therefore, the invention provides a three-dimensional interaction double-hand reconstruction method and system based on deformable attention, which are used for modeling local and global interaction of two hands by fusing graph convolution attention and deformable multi-head self-attention in a transducer coder, solving the mutual shielding between the two hands, reducing dislocation between a hand grid and an image and mutual collision between the two hands, enabling the generated two hands to have minimum artifact and permeation, reconstructing high-quality interaction hand reconstruction grid, reducing model complexity and improving the performance of a model.

The invention is realized by the following technical scheme. The three-dimensional interaction double-hand reconstruction method based on deformable attention comprises the following steps:

step S1: inputting a single Zhang Baohan two-handed color (RGB) image to the encoder-decoder structure, extracting image features, global feature vectors, hand two-dimensional joint features, hand segmentation features, and dense map coding features;

step S2: connecting the image features with two-dimensional joint features of the hands, hand segmentation features and dense mapping coding features, and obtaining multi-scale image features through a convolution layer;

step S3: sampling the multi-scale image feature grids to obtain image grid features;

step S4: initializing global feature vectors into vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron;

step S5: embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into interaction blocks, reconstructing an interaction hand reconstruction grid by a plurality of interaction blocks in a mode from thick to thin, and performing up-sampling operation behind each interaction block; each interaction block includes a graph convolution attention (Graphformer) module and an interaction hand deformable attention module;

step S6: the interactive hand reconstruction grid directly returns the three-dimensional coordinates of the vertexes of the surfaces of the two hands to obtain a result after the interactive hand is reconstructed.

The invention provides a three-dimensional interactive double-hand reconstruction system based on deformable attention, which comprises an encoder-decoder structure, a grid sampling module and an interactive module, wherein the interactive module uses a plurality of interactive blocks to reconstruct an interactive hand reconstruction grid in a manner from thick to thin; each interactive block is followed by an upsampling operation, each interactive block being identical in structure, each interactive block comprising a graph convolution attention (Graphformer) module and an interactive hand deformable attention module.

Specifically, the invention adopts ResNet-50 as backbone network to construct encoder-decoder structure, pre-trains the encoder-decoder structure, and extracts two-dimensional joint characteristics of hand segmentation characteristics, dense mapping coding characteristics and global characteristic vectors of the hand segmentation characteristics by the trained encoder-decoder structure.

In particular, the input of the interaction block includes vertex queriesAnd joint query->H ε L, R represents the left hand or right hand, and the processing steps of the graph convolution attention module are as follows:

step 201: obtaining corresponding vertex queries of the left hand and the right hand by multiplying the vertex tokens of the left hand and the right hand and the joint tokens by a trainable parameter matrixAnd joint query->；

Step 202: querying the vertex of the left handJoint inquiry->And image grid feature->Input to the convolution attention module, which outputs the left-hand vertex feature +.>The method comprises the steps of carrying out a first treatment on the surface of the Likewise, the vertex of the right hand is queried +.>Joint inquiry->And image grid feature->Input to the convolution attention module, which outputs the right-hand vertex feature +.>。

Specifically, the graph convolution attention module fuses the graph residual block into a transducer encoder, and specifically comprises the following steps:

step 301: querying the vertex of the left handJoint inquiry->And image grid feature->Firstly, carrying out normalization operation through a normalization layer;

step 302: vertex query for left handAnd joint query->Performing deformable multi-head attention, and then performing residual connection to obtain left hand feature +.>；

Step 303: characterization of left handInputting the left-hand image grid characteristics into an image residual block to obtain left-hand image grid characteristics +.>；

Step 304: left-hand grid feature outputting graph residual blockResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>The method comprises the steps of carrying out a first treatment on the surface of the Right-hand mesh vertex feature +.>。

Specifically, the graph residual block comprises a main path and a branch path, the main path sequentially passes through a normalization layer and a multi-layer perceptron, and two layers of graph volumes are laminated, the vertex characteristics of the left hand are continuously refined in the graph volumes, and the obtained vertex characteristics of the left hand pass through the normalization layer and the multi-layer perceptron and then pass through the left hand characteristics of the multi-layer perceptronResidual connection is carried out to obtain the left-hand graph grid characteristic +.>。

Specifically, the interactive hand deformable attention module expresses the interactive relationship between the two hands by using symmetrical deformable multi-head attention, and the specific process is as follows:

step 401: performing a deformable multi-headed self-attention on each hand to obtain query features, key features, and value features for each hand;

step 402: obtaining key features and value features of one hand through deformable multi-headed attention using the query features of the other hand;

step 403: performing interactive hand deformable attention to mesh vertex features of left and right hands:

；

is an interactive attention feature encoding a right-hand to left-hand correlation, < >>Is an interactive attention feature encoding left-hand to right-hand correlation, d is a normalization constant,/->For bilinear interpolation function, +.>Deformable relative positional deviation, +.>For parameterized deviation table, +.>Representing the relative positional offset of the computed query feature and the deformed key feature, Q _L Representing the query characteristics of the left hand, Q _R Query feature representing right hand, +.>Key feature K representing right hand _R Is a transpose of (2); />Key feature K representing left hand _L Transpose of V _R Representing the value characteristic of the right hand, V _L A value characteristic representing the left hand;

step 404: combining the attention features of the interaction hand with the grid vertex features of the left hand and the right hand through a multi-layer perceptron:

；

is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP denotes a multi-layer perceptron.

Step 405: and taking the output left and right hand grid vertex characteristics as the input of the next interaction block.

Specifically, the process of performing a deformable multi-headed self-attention on each hand to obtain query features, key features, and value features for each hand is:

1) Inputting grid vertex characteristics of a hand, establishing global grid reference points according to the grid vertex characteristics of the hand, and downsampling the grid according to factors by the grid vertex characteristics of the hand to obtain a grid;

2) The values of the grid reference points are linear two-dimensional coordinates of the grid, the values of the grid reference points are normalized to be between [ -1,1] according to the size of the grid, (-1, 1) represents the grid point of the upper left corner, and (1, 1) represents the grid point of the lower right corner;

3) The grid vertex characteristics of the hand are mapped through a first parameter matrix to obtain query characteristics, and then the query characteristics are input into a lightweight sub-network to obtain the offset of each query characteristic;

4) Adding the obtained value of the grid reference point and the obtained offset to obtain a deformed value of the grid reference point;

5) Feature sampling is carried out at the position of the deformed grid reference point, bilinear interpolation is used as a sampling function, the sampled deformed feature is used as a key and a value, and the key feature and the value feature are obtained through mapping through a second parameter matrix and a third parameter matrix respectively;

6) And applying relative position deviation coding to the query features, the key features and the value features, then splicing the features output by each head through a multi-head attention layer, and obtaining final output features through a mapping weight matrix.

The invention also provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.

The present invention provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.

The invention maps the global image feature vector into the vertex tokens of the left hand and the right hand, the joint tokens and the position embedment through an MLP (multi-layer perceptron), and then multiplies the vertex tokens and the joint tokens by a trainable parameter matrix to obtain the vertex query and the joint query of the left hand and the right hand. The graph rolling network is integrated into a transducer encoder, training is performed by inputting pixels into the network, and the input dimension is reduced every time. In addition, the key characteristics and the value characteristics in the self-attention are selected in a data-dependent manner by adopting the deformable multi-head self-attention, so that the interactive hand deformable attention module can concentrate on related areas and capture more information characteristics, and the relation between tokens (token) of the left hand and the right hand can be better modeled. Compared with the prior art, the technical scheme of the invention reduces the complexity of the network model and the time cost of training, breaks through the limitation that the relationship with the image data can not be well established by adopting sparse attention in the prior art, and improves the performance of the model. In the aspect of interactive hand reconstruction, grid features are better aligned with the images, so that dislocation between the hand grids and the images and mutual collision between the hands are reduced, and the interactive hand reconstruction effect is better.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic structural diagram of an interactive block.

Fig. 3 is a schematic diagram of a structure of the convolution attention module.

Fig. 4 is a schematic structural diagram of the residual block of fig. 4.

Fig. 5 is a schematic structural view of the interactive hand deformable attention module.

Detailed Description

The invention is further described in detail below with reference to the drawings and examples.

Referring to fig. 1, the three-dimensional interactive two-hand reconstruction method based on deformable attention comprises the following steps:

step S1: inputting a single Zhang Baohan two-handed color (RGB) image to an encoder-decoder structure, extracting image features F by the encoder _i (i=1, 2, 3), i representing a feature level, corresponding to the i-th interaction block, i=1, 2,3; decoder extracts hand two-dimensional joint characteristics T _i (i=1, 2, 3), hand segmentation feature and dense map coding feature D _i (i=1, 2, 3); the encoder-decoder extracts the global feature vector F _G ；

Step S2: image feature F _i Joint characteristics T of hand two-dimensional joint _i Hand segmentation feature and dense mapping coding feature D _i Connecting, and obtaining multi-scale image characteristics through a 1X 1 convolution layer；/>Representing image feature space ϕ _i Representing the ith multiscale image feature, C _i Representing feature dimensions, H _i Representing multi-scale image height, W _i Representing the width of a multi-scale image, H _i ×W _i Representing the resolution of the multi-scale image;

step S3: sampling operation is carried out on the multi-scale image characteristic grids to obtain image grid characteristics；

Step S4: global feature vector F _G Initializing vertex tokens of left and right hands, joint tokens and position embedding through a multi-layer perceptron;

step S5: embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into interaction blocks, reconstructing an interaction hand reconstruction grid by three interaction blocks in a mode from thick to thin, and performing up-sampling operation behind each interaction block;

The embodiment provides a three-dimensional interactive two-hand reconstruction system based on deformable attention, which comprises an encoder-decoder structure, a grid sampling module and an interactive module, wherein the interactive module uses three interactive blocks to reconstruct an interactive hand reconstruction grid from thick to thin. Each interaction block is followed by an upsampling operation, each interaction block being identical in structure and comprising a graph convolution attention (Graphformer) module and an interaction hand deformable attention module.

Specifically, in this embodiment, an encoder-decoder structure is constructed by using a res net-50 as a backbone network, and the encoder-decoder structure is pre-trained to extract two-dimensional joint features of a hand with hand segmentation features, dense mapping coding features and global feature vectors.

FIG. 2 illustrates a detailed structure of interaction blocks, each of which contains a graph convolution attention (Graphformer) module and an interaction hand deformable attention module, the input of which includes two queries: vertex queryAnd joint query->H ε L, R represents the left hand or right hand, and the processing steps of the graph convolution attention module are as follows:

step 201: obtaining corresponding vertex queries of the left hand and the right hand by multiplying the vertex tokens of the left hand and the right hand and the joint tokens by a trainable parameter matrixAnd joint query->,h∈L,R；

Fig. 3 shows the detailed structure of the picture convolution attention module, which fuses the picture residual block into the transform encoder, and the specific steps are as follows:

step 301: taking the left hand as an example, the vertex of the left hand is queriedJoint inquiry->And image grid feature->Firstly, carrying out normalization operation through a normalization layer;

Step 303: characterization of left handInputting the left-hand image grid characteristics into an image residual block to obtain left-hand image grid characteristics +.>. The residual block of the graph shown in fig. 4 comprises a main path and a branch path, wherein the main path sequentially passes through a normalization layer and a multi-layer perceptron, and is formed by layering two layers of graph rolls, the vertex characteristics of the left hand are continuously refined in the graph roll layering, and the obtained vertex characteristics of the left hand are subjected to the normalization layer and the multi-layer perceptron and then are subjected to the left hand characteristics of the multi-layer perceptron>Residual connection is carried out to obtain the left-hand graph grid characteristic +.>。

Step 304: left-hand grid feature outputting graph residual blockResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>. The right-hand grid vertex characteristics can also be obtained in the same way。

FIG. 5 shows the detailed structure of the interactive hand deformable attention module, which expresses the interactive relationship between the hands by using symmetrical deformable multi-head attention, and the specific process is as follows:

step 401: performing deformable multi-headed self-attention on each hand to obtain query features Q for each hand _h Key feature K _h Sum value feature V _h . 1) Grid vertex feature of input handH W is the size of the grid vertex feature of the hand, H is the height, W is the width, C is the feature dimension of the grid vertex feature of the hand, according to the grid vertex feature F of the hand _h Establishing a global grid reference point, wherein the size of the grid is determined by the grid vertex characteristics F of hands _h Downsampling according to a factor r to obtain a grid，H _G ×W _G Is the size of the grid P, H _G =h/r, representing grid height, W _G =w/r, representing the mesh width; 2) The values of the grid reference points are the linear two-dimensional coordinates of grid P, ranging from { (0, 0), … (H _G -1, W _G -1) normalizing the values of the grid reference points to [ -1,1] according to the size of the grid](1, 1) represents the grid point of the upper left corner, and (1, 1) represents the grid point of the lower right corner; 3) To obtain the offset of each grid reference point, the grid vertex feature F of the hand _h By a first parameter matrix W _q Mapping to obtain query feature Q _h Then query feature Q _h Input to a lightweight subnetwork theta _offset (-), get each query feature Q _h Offset of +.>The method comprises the steps of carrying out a first treatment on the surface of the 4) The value p of the grid reference point obtained and the offset obtained are + ->Adding to obtain deformed gridValue p + of reference point>The method comprises the steps of carrying out a first treatment on the surface of the 5) Feature sampling at the position of deformed grid reference points using bilinear interpolation as sampling function +.>The deformation characteristics after sampling are +.>As keys and values, respectively pass through a second parameter matrix W _k And a third parameter matrix W _v Mapping to obtain key feature K _h Sum value feature V _h The method comprises the steps of carrying out a first treatment on the surface of the 6) For query feature Q _h Key feature K _h Sum value feature V _h Applying relative position deviation coding, then passing through multiple head attention layers, splicing the characteristics output by each head together, and passing through a mapping weight matrix W ₀ And obtaining the final output characteristics. Thus, not only the flexibility and efficiency of the original self-attention module are enhanced, but also more information features can be captured. The deformable attention calculation is as follows:

，/>，/>，/>，。

step 402: query feature Q using one hand _h Acquisition of key feature K of the other hand by deformable multi-head attention _h Sum value feature V _h ；

Step 403: performing interactive hand deformable attention to grid vertex features of left and right hands, specifically as follows:

；

is an interactive attention feature encoding a right-hand to left-hand correlation, < >>Is an interactive attention feature encoding left-hand to right-hand correlation, d is a normalization constant,/->For bilinear interpolation function, +.>For coding the relative position between each pair of query features and key features, enhancing general attention by spatial information, for deformable relative position deviations>For parameterized deviation table, +.>Representing the relative positional offset of the computed query feature and the deformed key feature, the normalized range [ -1,1] is computed because the deformable attention has successive key positions]Relative displacement in, interpolation in parameterized bias table +.>To cover all possible offset values, Q _L Representing the query characteristics of the left hand, Q _R Query feature representing right hand, +.>Key feature K representing right hand _R Is a transpose of (2); />Key feature K representing left hand _L Transpose of V _R Representing the value characteristic of the right hand, V _L A value characteristic representing the left hand;

；

Step 405: left-hand and right-hand grid vertex feature to be output、/>As input to the next interaction block.

The present embodiment provides an electronic device, including: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.

The present embodiment provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the deformable attention-based three-dimensional interactive two-hand reconstruction method.

The above-described invention is merely representative of embodiments of the present invention and should not be construed as limiting the scope of the invention, nor any limitation in any way as to the structure of the embodiments of the present invention. It should be noted that it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The three-dimensional interaction double-hand reconstruction method based on deformable attention is characterized by comprising the following steps of:

step S1: inputting a color image of both hands of a single Zhang Baohan to the encoder-decoder structure, extracting image features, global feature vectors, hand two-dimensional joint features, hand segmentation features and dense mapping coding features;

step S5: embedding and inputting the image grid characteristics, the vertex tokens of the left hand and the right hand, the joint tokens and the positions into interaction blocks, reconstructing an interaction hand reconstruction grid by a plurality of interaction blocks in a mode from thick to thin, and performing up-sampling operation behind each interaction block; each interaction block comprises a graph convolution attention module and an interaction hand deformable attention module;

2. The deformable attention-based three-dimensional interactive two-hand reconstruction method of claim 1, wherein the input of the interaction block comprises a vertex queryAnd joint query->H ε L, R represents the left hand or right hand, and the processing steps of the graph convolution attention module are as follows:

Step 202: querying the vertex of the left handJoint inquiry->And image grid feature->Input to the convolution attention module, which outputs the left-hand vertex feature +.>The method comprises the steps of carrying out a first treatment on the surface of the Likewise, the vertex of the right hand is queried +.>Joint inquiry->And image grid feature->Input to the volume attention moduleThe graph convolution attention module outputs the right-hand vertex feature +.>。

3. The deformable attention-based three-dimensional interactive two-hand reconstruction method according to claim 1, characterized in that in particular the graph convolution attention module fuses the graph residual block into a transform encoder, comprising the following steps:

Step 304: residual pictureLeft-hand grid feature of difference block outputResidual connection is then performed through a normalization layer and a multi-layer perceptron to obtain the left-hand grid vertex characteristic +.>The method comprises the steps of carrying out a first treatment on the surface of the Right-hand mesh vertex feature +.>。

4. The three-dimensional interactive double-hand reconstruction method based on deformable attention as claimed in claim 3, wherein the graph residual block comprises a main path and a branch path, the main path sequentially passes through a normalization layer and a multi-layer perceptron, two layers of graph volume layers are laminated, the vertex characteristics of the left hand are continuously refined in the graph volume layers, and the obtained vertex characteristics of the left hand pass through the normalization layer and the multi-layer perceptron and then pass through the left hand characteristics of the multi-layer perceptronResidual connection is carried out to obtain the left-hand graph grid characteristic +.>。

5. The three-dimensional interactive double-hand reconstruction method based on deformable attention according to claim 3, wherein the interactive hand deformable attention module expresses the interactive relationship between the two hands by using symmetrical deformable multiple heads of attention, and the specific process is as follows:

；

is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP represents a multi-layer perceptron;

6. The method of deformable attention-based three-dimensional interactive two-hand reconstruction of claim 5, wherein the process of performing deformable multi-headed self-attention on each hand to obtain query features, key features and value features for each hand is:

inputting grid vertex characteristics of a hand, establishing global grid reference points according to the grid vertex characteristics of the hand, and downsampling the grid according to factors by the grid vertex characteristics of the hand to obtain a grid;

the values of the grid reference points are linear two-dimensional coordinates of the grid, the values of the grid reference points are normalized to be between [ -1,1] according to the size of the grid, (-1, 1) represents the grid point of the upper left corner, and (1, 1) represents the grid point of the lower right corner;

the grid vertex characteristics of the hand are mapped through a first parameter matrix to obtain query characteristics, and then the query characteristics are input into a lightweight sub-network to obtain the offset of each query characteristic;

adding the obtained value of the grid reference point and the obtained offset to obtain a deformed value of the grid reference point;

feature sampling is carried out at the position of the deformed grid reference point, bilinear interpolation is used as a sampling function, the sampled deformed feature is used as a key and a value, and the key feature and the value feature are obtained through mapping through a second parameter matrix and a third parameter matrix respectively;

and applying relative position deviation coding to the query features, the key features and the value features, then splicing the features output by each head through a multi-head attention layer, and obtaining final output features through a mapping weight matrix.

7. The three-dimensional interactive double-hand reconstruction system based on the deformable attention comprises an encoder-decoder structure, a grid sampling module and an interactive module, and is characterized in that the interactive module uses a plurality of interactive blocks to reconstruct an interactive hand reconstruction grid in a manner from thick to thin; the up-sampling operation is carried out behind each interaction block, the structures of the interaction blocks are the same, and each interaction block comprises a graph convolution attention module and an interaction hand deformable attention module; the interactive hand deformable attention module expresses the interactive relation between two hands by using symmetrical deformable multi-head attention, and the specific process is as follows:

；

wherein ,is the left-hand grid vertex feature; />Is a right-hand grid vertex feature; />Is the output left-hand grid vertex feature, < >>Is the output right-hand grid vertex feature; MLP represents a multi-layer perceptron;

8. The deformable attention-based three-dimensional interactive two-hand reconstruction system of claim 7, wherein the convolution attention module fuses the residual block of the map into a transducer encoder by the steps of:

9. An electronic device, comprising: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the deformable attention-based three-dimensional interactive two-hand reconstruction method of any one of claims 1-6.

10. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the deformable attention-based three-dimensional interactive two-hand reconstruction method of any one of claims 1-6.