CN114067057A

CN114067057A - Human body reconstruction method, model and device based on attention mechanism

Info

Publication number: CN114067057A
Application number: CN202111382077.6A
Authority: CN
Inventors: 方贤勇; 汪楷; 汪粼波
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-02-18

Abstract

The invention belongs to the field of computer vision, and particularly relates to a human body reconstruction method, a human body reconstruction model and a human body reconstruction device based on an attention mechanism. The reconstruction method comprises the following steps: the method comprises the following steps: constructing a human body reconstruction network model, wherein the human body reconstruction network model comprises a feature extraction module, an attention module, a fusion module, a parameter inference module and an SMPL sub-module; secondly, acquiring a plurality of original images containing characters, and preprocessing the original images to form a training data set; thirdly, training the human body reconstruction network model by using the training data set in the previous step through a minimum network loss function; and step four, inputting the human body image to be processed into the trained network model after preprocessing, and generating the human body three-dimensional model with the specific posture. The method solves the problem that the existing method is difficult to accurately reconstruct a three-dimensional human body model with accurate posture and shape according to a single human body image with shielding.

Description

Human body reconstruction method, model and device based on attention mechanism

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a human body reconstruction method, a human body reconstruction model and a human body reconstruction device based on an attention mechanism.

Background

The virtual reality technology is a new artificial intelligence technology and is widely applied to scenes such as virtual fitting, body type animation, human body motion simulation games and the like. In the application of the technologies, three-dimensional modeling of a human body by using images is an important link. The existing method for reconstructing a human body three-dimensional model from an image mainly comprises two types, namely an optimization-based method and a regression-based method. The former fits a parameterized body model to the two-dimensional viewing of a given image through an iterative optimization process, with the emphasis on using the two-dimensional joint point locations and contours to implement the fitting and modeling process. The latter mainly constructs a deep learning network, and performs feature extraction on the input single image in a deep neural network so as to obtain information such as human model parameters, volume representation of a three-dimensional human body, model vertexes and the like; and generating a three-dimensional human body model by using the information.

The two methods mentioned above have better model reconstruction effect under the condition that the target person in the image has no obstruction or the obstruction condition is not obvious. However, in practical applications, it is very common that the target person in the image is blocked by other people or objects; therefore, the above methods have limitations in their applications. Particularly, when a deep learning network is adopted to reconstruct a three-dimensional model, the deep neural network cannot effectively distinguish key information and redundant information in a human body image, and predicts parameters of the three-dimensional model by using all pixel characteristics in the human body image. Therefore, obvious errors occur, and the obstruction can generate serious interference on the actual three-dimensional human body model, so that the human body posture and shape in the constructed three-dimensional model are not in accordance with the actual situation.

Disclosure of Invention

In order to solve the problem that the existing human body three-dimensional model reconstruction method is difficult to accurately reconstruct a three-dimensional human body model with accurate posture and shape according to a single human body image with shielding, a human body reconstruction method, a human body model and a human body reconstruction device based on an attention mechanism are provided.

The invention is realized by adopting the following technical scheme:

a human body reconstruction method based on an attention mechanism comprises the following steps:

the method comprises the following steps: and constructing a human body reconstruction network model, wherein the human body reconstruction network model comprises a feature extraction module, an attention module, a fusion module, a parameter inference module and an SMPL sub-module. The feature extraction module is used for generating a corresponding original feature map according to the input human body image. The attention module comprises two pooling layers, a convolution layer and a Sigmoid operation layer; the two pooling layers are an average pooling layer and a maximum pooling layer, respectively. The attention module is used for generating an attention diagram according to the input original feature map. And the fusion module is used for carrying out fusion operation on the original feature map and the attention map to obtain a body attention feature map. The parameter inference module comprises a pooling layer and three full-connection layers; and the parameter inference module is used for generating the SMPL parameters of the corresponding target person in the human body image according to the input body attention feature map. The SMPL submodule is used for generating a three-dimensional human body model corresponding to the target person according to the SMPL parameters.

And secondly, acquiring a plurality of human body images containing the target person as original images, and preprocessing the original images to form a training data set, wherein the original images in the training data set at least comprise human body images with part of human body images blocked by the persons.

And step three, training the human body reconstruction network model by using the training data set in the step three through a minimum network loss function.

Step four, storing the human body reconstruction network model after training; and inputting the human body image to be processed into a stored network model after preprocessing, and generating a human body three-dimensional model with a specific gesture.

As a further improvement of the invention, the feature extraction module is obtained by simplifying and repackaging the deep convolutional neural network Resnet50, and only the convolutional part in the original network model is reserved in the simplification process; the input human body image is processed by convolution of the characteristic extraction module to obtain an original characteristic diagram.

As a further improvement of the invention, the attention module takes the output of the feature extraction module as input, the input original feature map respectively passes through an average pooling layer and a maximum pooling layer in the attention module, and the two pooling results are subjected to feature splicing and then sequentially pass through convolution processing and Sigmoid operation to obtain the attention map.

In the attention module, the pooling operation formula for the average pooling layer is:

F_avg＝AvgPool(F)；

the pooling operation formula of the maximum pooling layer is as follows:

F_max＝MaxPool(F)；

in the above formula, F represents the original characteristic diagram, F_avgFeature graphs after the average pooling operation, F_maxThe feature map after the maximum pooling operation is shown, MaxPool (. cndot.) shows the maximum pooling operation, and AvgPool (. cndot.) shows the average pooling operation.

The generation operation formula of the attention map is as follows:

M(F)＝σ(f(cat(F_avg,F_max)))；

in the above formula, M (F) represents an attention map; σ (-) denotes Sigmoid activation function; the f (-) table is a convolution operation; cat (-) represents the concatenation operation of the feature map.

As a further improvement of the invention, in the fusion module, the fused body attention feature map is obtained by performing corresponding element multiplication operation on the attention map and the original feature map. Wherein, the formula of the fusion operation is as follows:

in the above formula, F' represents a body attention feature map, and m (F) represents an attention map;

representing multiplication operations by corresponding elements; f denotes the original feature map.

As a further improvement of the present invention, the pooling layer in the parameter inference module is an average pooling layer. The first two of the three fully connected layers each have 1024 neurons and are connected by a Dropout operation. The third fully-connected layer has 85 neurons and is directly connected to the last fully-connected layer. Wherein, the three fully connected layers form an iterative regression part in the parameter inference module.

As a further improvement of the present invention, in the parameter inference module, the SMPL parameter is generated as follows:

(1) and obtaining a feature phi by averaging and pooling the input body attention feature map F'.

(2) The SMPL pose parameter θ, the shape parameter β, and the camera parameter c are pieced together, and are formulated as:

Θ＝cat(θ,β,c)；

in the above formula, θ represents a pose parameter of the SMPL model; beta represents a shape parameter of the SMPL model; c represents a camera parameter; Θ represents a concatenated set of parameters of pose parameter θ, shape parameter β, and camera parameter c.

(3) The initialization parameter set Θ is formed by the average pose parameter, the average shape parameter and the average camera parameter₀The feature phi is related to the parameter set theta₀And splicing is carried out to be used as the input of an iterative regression part in the parameter inference module.

(4) Generating a residual error of a parameter set corresponding to the current input, and then updating the current parameter set, wherein an updating formula is as follows:

Θ_t+1＝Θ_t+ΔΘ_t；

in the above formula, theta_tRepresenting the parameter set, Θ, corresponding to the current input_t+1Representing a parameter set Θ_tUpdated State, Δ Θ_tRepresenting a parameter set Θ_tThe residual error of (a).

(5) Iterating the update operation of the previous step 3 times; in each iterative updating process, the parameter set obtained by the last updating is spliced with the characteristic phi to be used as the input of the iterative regression part of the parameter inference module at this time, and the parameter set is updated.

(6) After the iterative operation is finished, the SMPL parameters including the final posture parameter theta and the form parameter beta and the corresponding camera parameters c are obtained.

As a further improvement of the invention, in the SMPL submodule, an SMPL parameter is input into an SMPL function to obtain a three-dimensional human body model; the expression of the SMPL function is:

in the above formula, the first and second carbon atoms are,

the vertex coordinates of the three-dimensional human body model under the T posture are obtained; b is_P(theta) and B_S(β) represents the amount of offset from the vertex vector of the SMPL standard template caused by the pose parameter θ and the morphology parameter β, respectively; j (beta) is the position of the joint point of the model corresponding to the morphological parameter beta; w (-) is a linear hybrid skin function;

is the mixing weight.

As a further improvement of the present invention, the pre-processing process of the original image comprises:

(1) and positioning a target person in the human body image, and performing cutting operation on the image so as to enable the target person to be positioned in the central area of the human body image.

(2) The size of the cut human body image is adjusted, and the pixel values of the adjusted image are unified to 224 × 224.

(3) And carrying out normalization processing on the adjusted image to obtain data elements in the training data set.

As a further improvement of the invention, in the training process of the network model, all parameters in the network model are adjusted by adopting an Adam algorithm under the condition of minimizing a loss function, so as to train the network.

The expression for the minimization loss function is as follows:

L＝λ_2DL_2Djoint+λ_3DL_3Djoint+λ_paraL_SMPL；

in the above formula, L_2DjointRepresenting a 2D joint loss function; l is_3DjointRepresenting a 3D joint loss function; l is_SMPLRepresenting an SMPL parameter loss function; lambda [ alpha ]_2DA weight coefficient representing a 2D joint loss function; lambda [ alpha ]_3DA weight coefficient representing a 3D joint loss function; lambda [ alpha ]_paraA weight coefficient representing a SMPL parameter loss function.

Wherein the 2D joint loss function L_2DjointThe expression of (a) is:

in the above formula, v_iThe visibility of the ith 2D joint point is represented, the value is 0 or 1, 0 represents invisible, and 1 represents visible; n represents the number of 2D joint points;

a predicted value representing an ith 2D joint point; k is a radical of_iRepresenting the true value of the ith 2D joint point; wherein the predicted value of the 2D joint

Are derived from the predicted 3D joint projection.

3D joint loss function L_3DjointThe expression of (a) is:

in the above formula, M represents the number of images participating in the calculation of the 3D joint point;

representing a 3D joint point prediction value of an ith image; j. the design is a square_iThe true value of the 3D joint point representing the ith image.

SMPL parameter loss function L_SMPLThe expression of (a) is:

in the above equation, O represents the number of images participating in the calculation of the SMPL parameter,

and

respectively representing the predicted values of the posture parameter and the morphological parameter of the ith image, theta_iAnd beta_iRespectively representing the posture parameter and the real value of the morphological parameter of the ith image.

The invention also comprises a human body reconstruction model, and the human body reconstruction method based on the attention mechanism adopts the human body reconstruction model to process the input human body image with the shielding function so as to generate the three-dimensional human body model of the target task in the human body image. The human body reconstruction model comprises the following steps: the system comprises a preprocessing module, a feature extraction module, an attention module, a fusion module, a parameter inference module and an SMPL submodule.

Wherein the preprocessing module is used for: (1) positioning a target person in the human body image, and performing cutting operation on the image to enable the target person to be located in the central area of the human body image; (2) adjusting the size of the cut human body image, wherein the pixel values of the adjusted image are unified to 224 multiplied by 224; (3) and carrying out normalization processing on the adjusted image.

The feature extraction module adopts the convolution part in the deep convolution neural network Resnet50 as a backbone network. The output of the preprocessing module is used as the input of the feature extraction module; the feature extraction module is used for extracting features in the preprocessed human body image through convolution operation, and then generating a corresponding original feature map.

The attention module comprises a maximum pooling sub-module, an average pooling sub-module, a feature splicing sub-module, a convolution sub-module and a Sigmoid operation sub-module. The output of the feature extraction module is used as the input of the attention module; the original feature map is processed by a maximum pooling submodule and an average pooling submodule respectively in the attention module to obtain two feature maps, and the two feature maps are spliced in a feature splicing submodule; and obtaining an attention diagram after convolution processing in the convolution submodule and Sigmoid operation in the Sigmoid operation submodule.

And the fusion module uses the original feature map output by the feature extraction module and the attention map output by the attention module, and then multiplies the original feature map and the attention map by corresponding elements to obtain a fused body attention feature map.

The parameter inference module comprises an average pooling layer, a full-connection layer I, a full-connection layer II and a full-connection layer III. The first full connection layer and the second full connection layer are provided with 1024 neurons and are connected through Dropout operation; the full connection layer III is provided with 85 neurons, and the full connection layer II is directly connected with the full connection layer III. The full connection layer I, the full connection layer II and the full connection layer III form an iterative regression part of the network model. The output of the fusion module is used as the input of the parameter inference module; and the parameter inference module generates the SMPL parameters after iterative updating according to different input data.

The SMPL submodule is used for generating a three-dimensional human body model of a target person corresponding to the human body image according to the SMPL parameters output by the parameter deduction submodule.

The technical scheme provided by the invention has the following beneficial effects:

in the three-dimensional human body reconstruction method based on the attention mechanism, the introduced attention mechanism can process the features in the human body image, so that the network focuses on the features containing important information, and the attention to the unimportant information is ignored and reduced. The original feature map is weighted by the attention map generated by the attention mechanism, so that the network focuses attention on information related to a human body part in an image, the attention on other information is reduced, and the interference of the obstruction information on the network is reduced. Meanwhile, the network model can deduce the condition of the shielded body part by utilizing the characteristics of the visible part of the body, thereby ensuring the integrity of the extracted characteristic information. The human body posture and the shape reflected in the finally constructed three-dimensional human body model are ensured to be more in line with reality.

Drawings

Fig. 1 is a flowchart illustrating steps of a human body reconstruction method based on an attention mechanism in embodiment 1 of the present invention.

Fig. 2 is a block diagram of an attention module in embodiment 1 of the present invention.

Fig. 3 is a flowchart of steps of a process of generating SMPL parameters of a target person in the parameter inference module according to embodiment 1 of the present invention.

Fig. 4 is a flowchart of steps of a three-dimensional human body model reconstruction process in embodiment 1 of the present invention.

Fig. 5 is a schematic block diagram of a human body reconstruction model provided in embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

The present embodiment provides a human body reconstruction method based on an attention mechanism, as shown in fig. 1, the human body reconstruction method includes the following steps:

s1: and constructing a human body reconstruction network model, wherein the human body reconstruction network model comprises a feature extraction module, an attention module, a fusion module, a parameter inference module and an SMPL sub-module. The method specifically comprises the following steps:

s11: simplifying the Resnet50 of the deep convolutional neural network, wherein only the convolutional part in the original network model is reserved in the simplification process, and then repackaging the simplified network to obtain the required feature extraction module; the feature extraction module is used for generating a corresponding original feature map according to the input human body image.

S12: an attention module is constructed that includes two pooling layers, a convolutional layer and a Sigmoid operational layer. Wherein, the two pooling layers in the attention module are an average pooling layer and a maximum pooling layer, respectively. The attention module is used for generating an attention diagram according to the input original feature map. As shown in fig. 2, the processing procedure of the attention module is as follows:

the input original feature map firstly passes through an average pooling layer and a maximum pooling layer in an attention module, and the two pooling results are subjected to feature splicing and then sequentially subjected to convolution processing and Sigmoid operation to obtain the attention map.

Wherein, in the attention module, the pooling operation formula of the average pooling layer is:

F_avg＝AvgPool(F)；

the pooling operation formula of the maximum pooling layer is as follows:

F_max＝MaxPool(F)；

The generation operation formula of the attention map is as follows:

M(F)＝σ(f(cat(F_avg,F_max)))；

S13: and constructing a fusion module for carrying out fusion operation on the original feature map and the attention map to obtain a body attention feature map. The feature fusion method is to multiply the attention diagram and the original feature diagram by corresponding elements.

Wherein, the formula of the fusion operation is as follows:

S14: and constructing a parameter inference module comprising a pooling layer and three fully-connected layers. And the parameter inference module is used for generating the SMPL parameters of the corresponding target person in the human body image according to the input body attention feature map and the original feature map.

Wherein the pooling layer in the parameter inference module is an average pooling layer. The first two of the three fully connected layers each have 1024 neurons and are connected by a Dropout operation. The third fully-connected layer has 85 neurons and is directly connected to the last fully-connected layer. Wherein, the three fully connected layers form an iterative regression part in the parameter inference module.

Specifically, in this embodiment, as shown in fig. 3, the SMPL parameter of the target person is generated as follows:

Θ＝cat(θ,β,c)；

Θ_t+1＝Θ_t+ΔΘ_t；

S15: the parameter inference module is followed by an SMPL submodule. The SMPL (Skinned Multi-Person Linear Model) in the present embodiment is a vertex-based three-dimensional naked body Model of a human body, which is capable of accurately representing different shapes (shape) and postures (position) of the human body.

In the SMPL submodule, inputting an SMPL parameter into an SMPL function to obtain a three-dimensional human body model; the expression of the SMPL function is:

in the above formula, the first and second carbon atoms are,

is the mixing weight.

S2: the method comprises the steps of obtaining a plurality of human body images containing target characters as original images, preprocessing the original images to further form a training data set, wherein the original images in the training data set at least comprise human body images with character occlusion parts.

In this embodiment, the pre-processing procedure for the original image includes:

S3: and training the human body reconstruction network model by minimizing a network loss function by using the training data set in the step.

In the training process of the network model, all parameters in the network model are adjusted by adopting an Adam algorithm under the condition of minimizing a loss function, and the network is trained.

The expression of the loss function is as follows:

L＝λ_2DL_2Djoint+λ_3DL_3Djoint+λ_paraL_SMPL；

Wherein the 2D joint loss function L_2DjointThe expression of (a) is:

a predicted value representing an ith 2D joint point; k is a radical of_iRepresenting the true value of the ith 2D joint point; wherein, the predicted value of the 2D joint point,

are derived from the predicted 3D joint orthographic projection.

Specifically, in this embodiment, the projection formula is:

in the above formula, the first and second carbon atoms are,

represents the 3D joint point prediction value,

to represent

The corresponding 2D joint prediction value, (-) represents a projection function based on the camera parameters c.

3D joint loss function L_3DjointThe expression of (a) is:

SMPL parameter loss function L_SMpLThe expression of (a) is:

and

S4: storing the trained human body reconstruction network model; and inputting the human body image to be processed into a stored network model after preprocessing, and generating a human body three-dimensional model with a specific gesture.

The processing procedure of the human body reconstruction network model is shown in fig. 4, and specifically includes that the preprocessed human body image is firstly subjected to convolution processing through a feature extraction model, features related to a target person in the human body image are extracted, and an original feature map is generated. And then, the original characteristic diagram is divided into two paths backwards and respectively transmitted to the attention module and the parameter inference module. The method comprises the steps of inputting an original feature map into an attention module, firstly carrying out average pooling processing and maximum pooling processing to obtain an average pooling feature map and a maximum pooling feature map respectively, splicing the two types of pooling feature maps, and then carrying out convolution processing and Sigmoid operation in sequence to obtain an attention map. Then the fusion module simultaneously receives the attention diagram output by the attention module and the original feature diagram output by the feature extraction module; carrying out fusion processing on the attention diagram and the original feature map to obtain a body attention feature map; the body attention feature map is input into a parameter inference module, and the parameter inference module generates a corresponding SMPL parameter according to the body attention feature map and carries out iterative updating on the SMPL parameter. The SMPL parameters include posture parameters and morphology parameters. And finally, inputting the SMPL parameters subjected to iterative updating into an SMPL submodule to generate a three-dimensional human body model of the target person.

The method provided by the invention can process the characteristics in the human body image through the introduced attention mechanism, so that the network focuses on the characteristics containing important information, and the attention to the unimportant information is ignored and reduced. The original feature map is weighted by the attention map generated by the attention mechanism, so that the network focuses attention on information related to a human body part in an image, the attention on other information is reduced, and the interference of the obstruction information on the network is reduced. At the same time the network model can infer the situation of the occluded body part using the characteristics of the visible part of the body. The human body posture and the shape reflected in the finally constructed three-dimensional human body model are ensured to be more in line with reality.

Example 2

The present embodiment provides a human body reconstruction model, which uses the human body reconstruction method based on the attention mechanism as in embodiment 1 to process an input occluded human body image, so as to generate a three-dimensional human body model of a target task in the human body image. As shown in fig. 5, the human body reconstruction model includes the following: the system comprises a preprocessing module, a feature extraction module, an attention module, a fusion module, a parameter inference module and an SMPL submodel.

The attention module comprises a maximum pooling sub-module, an average pooling sub-module, a feature splicing sub-module, a convolution sub-module and a Sigmoid operation sub-module. The output of the feature extraction module is used as the input of the attention module; the original feature map is processed by a maximum pooling submodule and an average pooling submodule respectively in the attention module to obtain two feature maps, and the two feature maps are subjected to feature splicing in a feature splicing submodule; and obtaining an attention diagram after convolution processing in the convolution submodule and Sigmoid operation in the Sigmoid operation submodule. Note that the detailed generation process of the force diagram has already been described in detail in embodiment 1, and is not described here again.

And the fusion module uses the original feature map output by the feature extraction module and the attention map output by the attention module, and then multiplies the original feature map and the attention map by corresponding elements to obtain a fused body attention feature map. The feature fusion method is to multiply the attention diagram and the original feature diagram by corresponding elements. Wherein, the formula of the fusion operation is as follows:

The parameter inference module comprises an average pooling layer, a full-connection layer I, a full-connection layer II and a full-connection layer III. The fully connected layer I and the fully connected layer II are provided with 1024 neurons and are connected through Dropout operation; the full connection layer III is provided with 85 neurons, and the full connection layer II is directly connected with the full connection layer III. The full connection layer I, the full connection layer II and the full connection layer III form an iterative regression part of the network model. The output of the fusion module is used as the input of the parameter inference module; and the parameter inference module generates the SMPL parameters after iterative updating according to different input data.

In other embodiments, the pre-processing module and the SMPL submodel may or may not be part of the human reconstruction model. When the preprocessing module does not belong to the human body reconstruction model, manual processing can be performed before each human body image is input into the human body reconstruction model, so that the input human body image is more in line with the requirements. Meanwhile, the target person can be positioned in the center of the image through manual processing, and the ratio of the shielding object in the image is relatively reduced. This allows a more accurate result of the three-dimensional phantom to be obtained.

When the SMPL sub-model does not belong to one part of the human body reconstruction model, the existing SMPL model can be called through a related module calling program, corresponding SMPL parameters generated by the human body reconstruction model are input into the SMPL model, and meanwhile, the three-dimensional human body model generated by the SMPL model is obtained. When the method is adopted for processing, the structure and the scale of the human body reconstruction model are simplified, the calculation force can be saved, and the requirement on hardware equipment is reduced. Meanwhile, distributed operation can be adopted for processing in the framework, so that the generation rate of the three-dimensional human body model is improved.

In order to verify the performance of the human body reconstruction model provided by the embodiment, the embodiment also simulates the processing procedure of the model. The simulation experiment environment adopts Intel (R) Xeon (R) CPU E5-2609V [email protected], a 16G memory and an Ubuntu18.04 system, the display card is GTX1080Ti, the programming environment is Pycharm, the deep learning framework is pytorch1.1.0, and the data set adopts a 2D data set Leeds Sports Pose (LSP) data set, an MPII data set and a 3D data set 3DPW data set and a Human3.6M data set.

Simulation shows that the human body reconstruction model provided by the embodiment still has good three-dimensional model reconstruction performance aiming at various human body images with shielding, and the human body posture and the shape of the constructed three-dimensional human body model are very practical, so that the human body reconstruction model has good practical value and is suitable for being applied to various scenes such as virtual fitting, body animation, human body motion simulation games and the like depending on human body three-dimensional modeling.

Example 3

The present embodiment provides an attention-based human body reconstruction apparatus, which is a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the attention-based human body reconstruction method according to embodiment 1.

The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory, a processor communicatively coupled to each other via a system bus.

In this embodiment, the memory (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Of course, the memory may also include both internal and external storage devices for the computer device. In this embodiment, the memory is generally used for storing an operating system, various types of application software, and the like installed in the computer device. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output.

The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to run a program code stored in the memory or process data to implement the processing procedure of the human body reconstruction method based on the attention mechanism in the foregoing embodiment, so as to construct a three-dimensional human body model corresponding to the target task according to a single human body image of the given target person.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A human body reconstruction method based on an attention mechanism is characterized by comprising the following steps:

the method comprises the following steps: constructing a human body reconstruction network model, wherein the human body reconstruction network model comprises a feature extraction module, an attention module, a fusion module, a parameter inference module and an SMPL sub-module; the characteristic extraction module is used for generating a corresponding original characteristic diagram according to the input human body image; the attention module comprises two pooling layers, a convolutional layer and a Sigmoid operation layer; the two pooling layers are an average pooling layer and a maximum pooling layer respectively; the attention module is used for generating an attention diagram according to the input original feature map; the fusion module is used for carrying out fusion operation on the original feature map and the attention map to obtain a body attention feature map; the parameter inference module comprises a pooling layer and three full-connection layers; the parameter inference module is used for generating an SMPL parameter of a corresponding target person in the human body image according to the input body attention feature map; the SMPL submodule is used for generating a three-dimensional human body model corresponding to a target person according to the SMPL parameters;

secondly, acquiring a plurality of human body images containing target characters as original images, and preprocessing the original images to form a training data set, wherein the original images in the training data set at least comprise human body images with part being blocked by the characters;

thirdly, training the human body reconstruction network model by using the training data set in the previous step through a minimum network loss function;

2. The attention mechanism-based human body reconstruction method of claim 1, wherein: the feature extraction module is obtained by simplifying and repackaging a deep convolutional neural network Resnet50, and the simplification process only reserves the convolutional part in the original network model; and after the input human body image is subjected to convolution processing of the feature extraction module, the original feature map is obtained.

3. The attention mechanism-based human body reconstruction method of claim 1, wherein: the attention module takes the output of the feature extraction module as input, the input original feature map respectively passes through an average pooling layer and a maximum pooling layer in the attention module, and the two pooling results are subjected to feature splicing and then sequentially subjected to convolution processing and Sigmoid operation to obtain the attention map;

in the attention module, the pooling operation formula of the average pooling layer is:

F_avg＝AvgPool(F)；

the pooling operation formula of the maximum pooling layer is as follows:

F_max＝MaxPool(F)；

in the above formula, F represents the original characteristic diagram, F_avgFeature graphs after the average pooling operation, F_maxA feature map after maximum pooling operation is shown, MaxPool (. cndot.) shows maximum pooling operation, AvgPool (. cndot.) shows average pooling operation;

the generation operation formula of the attention map is as follows:

M(F)＝σ(f(cat(F_avg,F_max)))；

in the above formula, M (F) represents an attention map; σ (-) denotes Sigmoid activation function; f (-) represents a convolution operation; cat (-) represents the concatenation operation of the feature map.

4. The attention mechanism-based human body reconstruction method of claim 1, wherein: in the fusion module, the fused body attention feature map is obtained by multiplying the attention map and the original feature map by corresponding elements; wherein, the formula of the fusion operation is as follows:

5. The attention mechanism-based human body reconstruction method of claim 1, wherein: the pooling layer in the parameter inference module is an average pooling layer; the first two of the three fully connected layers each have 1024 neurons and operate through Dropout; the third full-connection layer for connection is provided with 85 neurons and is directly connected with the last full-connection layer; wherein three fully connected layers constitute an iterative regression portion in the parameter inference module.

6. The attention mechanism-based human body reconstruction method of claim 5, wherein: in the parameter inference module, the SMPL parameter is generated as follows:

(1) obtaining a feature phi by averaging and pooling the input body attention feature map F';

Θ＝cat(θ,β,c)；

in the above formula, θ represents a pose parameter of the SMPL model; beta represents a shape parameter of the SMPL model; c represents a camera parameter; Θ represents a parameter set of the pose parameter θ, the shape parameter β, and the camera parameter c;

(3) the initialization parameter set Θ is formed by the average pose parameter, the average shape parameter and the average camera parameter₀The feature phi is compared with the parameter set theta₀Performing a splice as an iteration back in the parameter inference moduleInputting a classification part;

Θ_t+1＝Θ_t+ΔΘ_t；

in the above formula, theta_tRepresenting the parameter set, Θ, corresponding to the current input_t+1Representing a parameter set Θ_tUpdated State, Δ Θ_tRepresenting a parameter set Θ_tThe residual error of (a);

(5) iterating the update operation of the previous step 3 times; in each iterative updating process, splicing the parameter set obtained by last updating and the characteristic phi as the input of the iterative regression part of the parameter inference module at this time, and updating the parameter set;

7. The attention mechanism-based human body reconstruction method of claim 1, wherein: in the SMPL sub-module, inputting SMPL parameters into an SMPL function, and mapping the morphological parameters and the posture parameters into vertexes of a model by the SMPL function to obtain the three-dimensional human body model; the expression of the SMPL function is:

in the above formula, the first and second carbon atoms are,

is the mixing weight.

8. The attention mechanism-based human body reconstruction method of claim 1, wherein: the preprocessing process of the original image comprises the following steps:

positioning a target person in the human body image, and performing cutting operation on the image to enable the target person to be located in the central area of the human body image;

adjusting the size of the cut human body image, wherein the pixel values of the adjusted image are unified to 224 multiplied by 224;

and carrying out normalization processing on the adjusted image to obtain data elements in the training data set.

9. The attention mechanism-based human body reconstruction method of claim 1, wherein: in the training process of the network model, all parameters in the network model are adjusted by adopting an Adam algorithm under the condition of minimizing a loss function, and the network is trained;

the expression of the loss function is as follows:

L＝λ_2DL_{2D joint}+λ_3DL_{3D joint}+λ_paraL_SMPL；

in the above formula, L_{2D joint}Representing a 2D joint loss function; l is_{3D joint}Representing a 3D joint loss function; l is_SMPLRepresenting an SMPL parameter loss function; lambda [ alpha ]_2DA weight coefficient representing a 2D joint loss function; lambda [ alpha ]_3DA weight coefficient representing a 3D joint loss function; lambda [ alpha ]_paraA weight coefficient representing a SMPL parameter loss function;

wherein the 2D joint loss function L_{2D joint}The expression of (a) is:

in the above formula, v_iThe visibility of the ith 2D joint point is represented, the value is 0 or 1, 0 represents invisible, and 1 representsVisible; n represents the number of 2D joint points;

Is derived from a predicted 3D joint projection;

3D joint loss function L_{3D joint}The expression of (a) is:

representing a 3D joint point prediction value of an ith image; j. the design is a square_iA 3D joint point true value representing the ith image;

SMPL parameter loss function L_SMPLThe expression of (a) is:

and

10. A human body reconstruction model, characterized in that the human body reconstruction method based on attention mechanism according to any one of claims 1 to 9 is used for processing the input occluded human body image, so as to generate a three-dimensional human body model of the target person in the human body image; the human body reconstruction model comprises the following steps:

a pre-processing module to: (1) positioning a target person in the human body image, and performing cutting operation on the image to enable the target person to be located in the central area of the human body image; (2) adjusting the size of the cut human body image, wherein the pixel values of the adjusted image are unified to 224 multiplied by 224; (3) carrying out normalization processing on the adjusted image;

the feature extraction module adopts a convolution part in a deep convolution neural network Resnet50 as a backbone network; the output of the preprocessing module is used as the input of the feature extraction module; the feature extraction module is used for extracting features in the preprocessed human body image through convolution operation so as to generate a corresponding original feature map;

the attention module comprises a maximum pooling sub-module, an average pooling sub-module, a feature splicing sub-module, a convolution sub-module and a Sigmoid operation sub-module; the output of the feature extraction module is used as the input of the attention module; the original feature map is processed by a maximum pooling submodule and an average pooling submodule respectively in the attention module to obtain two feature maps, and the two feature maps are subjected to feature splicing in a feature splicing submodule; obtaining an attention diagram after convolution processing in the convolution submodule and Sigmoid operation in the Sigmoid operation submodule;

the fusion module is used for simultaneously acquiring the original feature map output by the feature extraction module and the attention map output by the attention module, and then multiplying the original feature map and the attention map by corresponding elements to obtain a fused body attention feature map;

the parameter inference module comprises an average pooling layer, a full-connection layer I, a full-connection layer II and a full-connection layer III; the first full connection layer and the second full connection layer are provided with 1024 neurons and are connected through Dropout operation; the full connection layer III is provided with 85 neurons, and the full connection layer II is directly connected with the full connection layer III; the full connection layer I, the full connection layer II and the full connection layer III form an iterative regression part of the network model; the output of the fusion module is used as the input of the parameter inference module; the parameter inference module generates an SMPL parameter after iterative update according to different input data;

and the SMPL submodule is used for generating a three-dimensional human body model of the target person corresponding to the human body image according to the SMPL parameters output by the parameter deduction submodule.