CN111311732B

CN111311732B - 3D human body grid acquisition method and device

Info

Publication number: CN111311732B
Application number: CN202010085015.8A
Authority: CN
Inventors: 牛新; 赵杨; 窦勇; 姜晶菲; 李荣春; 苏华友; 乔鹏; 潘衡岳
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2023-06-20
Anticipated expiration: 2040-04-26
Also published as: CN111311732A

Abstract

The invention discloses a 3D human body grid acquisition method and device, comprising the following steps: acquiring image characteristics of each frame of image in a video, and inputting the image characteristics of each frame of image into a trained U-shaped image neural network for each frame of image so as to acquire corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is a video containing people; combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames, and optimizing each human body 3D grid parameter by the residual time sequence diagram network based on time sequence, so that the human body shape represented by the optimized human body 3D grid accords with the human body shape in the image.

Description

3D human body grid acquisition method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a 3D human body grid acquisition method and device.

Background

Restoring the shape of a 3D human body in an image is a basic task in computer vision, and compared with restoring only skeletal joints, the shape restoration requires details of the human body, and the restoration of the 3D human body can be applied to various application programs such as robotics, 3D animation, virtual reality and the like.

The method for recovering the shape of the 3D human body in the image comprises the following steps: one is a parametric method, i.e. fitting a 3D human shape by obtaining model parameters corresponding to an image and inputting the model parameters into a predefined human model (such as a SCAPE human model, a SMPL (three-dimensional human model) model, etc.), where the methods all use human model parameters as regression targets, but the model parameters are discontinuous and hardly regressive, and the limited number of model parameters also limits the model expression capacity; the other is a non-parametric method, such as volumetric reconstruction location, pixel depth regression, can better express details, but this loses semantic information and is not easily matched to existing model interfaces.

In order to take advantage of the two methods, the vertex of the SMPL model is a good regression choice, and in the prior art, the position of the SMPL vertex is regressed by using a graph neural network, namely 6890 vertices are regressed to control the shape of the human body, but in the prior art, the shape of the 3D human body is recovered based on a single frame image, and the difference between the shape of the 3D human body and the shape of the human body in the image is relatively large.

Disclosure of Invention

The invention aims at providing a method and a device for acquiring a 3D human body grid aiming at the defects of the prior art, and the aim is achieved through the following technical scheme.

The first aspect of the present invention proposes a 3D human body mesh acquisition method, the method comprising:

acquiring image characteristics of each frame of image in a video, and inputting the image characteristics of each frame of image into a trained U-shaped image neural network for each frame of image so as to acquire corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is a video containing people; and combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames so as to optimize each human body 3D grid parameter based on time sequence by the residual time sequence diagram network and obtain each optimized human body 3D grid parameter.

A second aspect of the present invention proposes a 3D human mesh acquisition device, the device comprising:

the feature acquisition module is used for acquiring the image features of each frame of image in the video;

the image module is used for inputting the image characteristics of each frame of image into a trained U-shaped image neural network so as to obtain corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is a video containing people;

the video module is used for combining the 3D grid parameters of each human body according to the time sequence of the image frames and inputting the combined parameters into the trained residual time sequence diagram network so as to optimize the 3D grid parameters of each human body based on time sequence by the residual time sequence diagram network and obtain the optimized 3D grid parameters of each human body.

In the embodiment of the application, after the image characteristics of each frame of image in the video are acquired, the image characteristics of each frame of image are processed through a U-shaped graph neural network, the image-level human body 3D grid parameters are obtained through regression, and then each human body 3D grid is optimized based on each frame time sequence through a residual time sequence graph network, so that the human body shape represented by the optimized human body 3D grid is consistent with the human body shape in the image.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of an embodiment of a method for acquiring a 3D body mesh according to an exemplary embodiment of the present invention;

FIG. 2 is a block diagram of a neural network of the U-shaped graph shown in the present invention;

FIG. 3 is a diagram of a 3D human mesh acquisition overview of the present invention;

FIG. 4 is a 3D human body mesh map corresponding to up-sampling and down-sampling of a neural network of a U-shaped map shown in the invention;

FIG. 5 is a schematic diagram of a residual timing network of the present invention;

fig. 6 is a flow chart illustrating an embodiment of a 3D body mesh acquisition device according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The 3D human body shape obtained by 6890 SMPL vertexes regressed by the currently adopted graph neural network has a larger difference from the human body shape in the image.

In order to solve the technical problems, the inventor finds that the video contains not only image-level information but also dynamic time sequence information, which can help to reduce uncertainty of human body shape.

The method is concretely realized as follows: after the image characteristics of each frame of image in the video are acquired, the image characteristics of each frame of image are input into a trained U-shaped image neural network, the U-shaped image neural network obtains corresponding human body 3D grid parameters based on the image characteristics, then each human body 3D grid parameter is combined according to the time sequence of the image frames and is input into a trained residual time sequence diagram network, and each human body 3D grid parameter is optimized based on the time sequence by the residual time sequence diagram network, so that each optimized human body 3D grid is consistent with the human body shape in the image.

The 3D human body mesh acquisition method proposed by the present invention is described in detail below with specific embodiments.

Fig. 1 is a flowchart of an embodiment of a 3D human body mesh acquiring method according to an exemplary embodiment of the present invention, where the 3D human body mesh acquiring method may be applied to an electronic device (such as a PC, a terminal, etc.), and as shown in fig. 1, the 3D human body mesh acquiring method includes the following steps:

step 101: and acquiring the image characteristics of each frame of image in the video.

The video is a video containing people, namely a video mainly containing people.

In step 101, the video may be decomposed into single frame images, and respectively input into a preset human body detection system, so that the human body detection system outputs images including human body candidate frames, the candidate frames included in each frame image are subjected to scale transformation, so that the human body is at the middle position of the candidate frame, and then the single frame images including the candidate frames are respectively input into a trained feature extraction network, so that the feature extraction network outputs image features of the images.

For example, the human body detection system may be implemented by using a neural network model, such as R-CNN, and the human body detection system extracts, for each frame of input image, a candidate frame with the highest probability of containing a person.

In one example, preliminary image features per frame of image may be obtained through a feature extraction network ResNet.

In this embodiment, the human body candidate frame included in each frame of image may be scaled to a fixed size, so as to ensure that the human body is in the middle of the candidate frame.

Step 102: for each frame of image, the image characteristics of the frame of image are input into a trained U-shaped image neural network, so that the U-shaped image neural network can obtain corresponding human body 3D grid parameters based on the image characteristics.

In this embodiment, the graph neural network of the U-shaped structure is used to achieve the downsizing and restoration of the graph scale, that is, the graph scale is changed from large to small and then from small to large, so as to help to enlarge the receptive field of each node and deepen the network depth of the small-scale graph, thereby facilitating the extraction of high-level features.

The U-shaped graph neural network is formed by stacking a plurality of graph neural network modules, and the network structure of the U-shaped graph neural network comprises an input module, two first graph neural network modules with first preset scales, a downsampling module, four graph neural network modules with second preset scales, an upsampling module, a splicing module, two second graph neural network modules with first preset scales, a coordinate regressor and a camera coordinate regressor.

Based on the structure of the U-shaped graph neural network, aiming at the process of obtaining corresponding human body 3D grid parameters based on image features by the U-shaped graph neural network, an input module splices the image features with 3D coordinates of each vertex in a template grid of SMPL of a first preset scale, obtains initial features of each vertex and takes the initial features as graph features of the first preset scale, and inputs the graph features into two first graph neural network modules of the first preset scale which are connected in series; the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input to the downsampling module and the splicing module; the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series; the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and input the new graph features into the up-sampling module; the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module; the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale to obtain spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two serially connected second graph neural network modules of the first preset scale; the two second graph neural network modules connected in series sequentially process the input graph features to obtain final graph features of a first preset scale and input the final graph features to the coordinate regressor and the camera coordinate regressor; the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics.

Thus, the human body 3D grid parameters corresponding to each frame of image comprise the 3D coordinates and camera parameters of each vertex in the corresponding first preset scale map feature.

In this embodiment, since the number of vertices of the existing SMPL template mesh is 6890, and the number is large, which brings a heavy calculation burden, the present invention uses a clustering algorithm to cluster the vertices of the existing SMPL template mesh, combine the vertices of the same class, and reduce the graph size of the template mesh to a first preset size, for example, to one fourth of the original vertices, i.e., 1723 vertices, so that the number of vertex features included in the graph features obtained by the input module is consistent with the number of vertices of the template mesh. And then a clustering algorithm is used to obtain a graph of a second preset scale which is reduced again, for example, the graph is reduced to one fourth, namely 431 vertexes, and meanwhile, the corresponding relation among the vertexes of the graph of different scales is saved, namely, an up-sampling matrix and a down-sampling matrix used by the up-sampling module are obtained.

It follows that the first preset scale is larger than the second preset scale. Taking 6890 scale, 1723 scale and 431 scale as examples, the graph of 6890 scale is used for visual display, the graph of 1723 scale is used for input and output of a model, and the graph of 431 scale is used as an abstract graph for constructing a U-shaped structure.

The structure of each graph neural network module in the U-shaped graph neural network is the same, and the difference is that the scale of the graph and parameters in the module are different, and the operation amount of the module is reduced when the scale of the graph is reduced. Each graph neural network module inherits a ResBlock structure, namely all 3*3 convolution layers in the ResBlock structure are replaced by graph convolution layers, all 1*1 convolution layers are replaced by full connection layers of each vertex, and all BatchNorm layers are replaced by GroupNorm layers.

The mathematical formula of the graph roll lamination is as follows:

where x is the input eigenvector, w is the trained parameter matrix,

is an adjacency matrix regularized by row optimization.

It should be noted that, aiming at the training process of the U-shaped graph neural network, a training sample set is obtained, each training sample in the training sample set is marked with a label, the label comprises a human body 3D grid point coordinate, a 3D key point coordinate, a 2D key point coordinate, or only a 2D key point coordinate, and each training sample in the training sample set is used for training the constructed U-shaped graph neural network until convergence;

in this embodiment, the loss functions of the U-shaped graph neural network include three types:

1) Loss function based on 3D grid points:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing labeled human body 3D grid point coordinates, +.>

And representing the parameters of the 3D grid points of the human body obtained by the model calculation, wherein N represents the number of vertexes of the 3D grid.

2) Loss function based on 3D keypoints:

wherein J is _3D，t Representing the coordinates of the marked 3D key points of the human body,

and representing the coordinates of the 3D key points of the human body obtained by the model calculation, wherein M represents the number of the 3D key points of the human body.

3) Loss function based on 2D keypoints:

wherein J is _2D，t Representing the coordinates of the 2D key points of the marked human body,

and representing the coordinates of the human body 2D key points calculated by the model, wherein M represents the number of the human body 2D key points.

Further, the loss function of the U-shaped graph neural network is a linear combination of the three loss functions, and the formula is as follows:

and->

Respectively, preset coefficients.

The method comprises the steps of obtaining human body 3D key point coordinates and human body 2D key point coordinates by aiming at model calculation, multiplying the vertex 3D coordinates of a human body 3D grid by a regression matrix obtained in advance after obtaining human body 3D grid parameters by the model, obtaining the human body 3D key point coordinates, and then carrying out camera transformation by utilizing camera parameters and the obtained human body 3D key point coordinates to obtain the human body 2D key point coordinates.

Step 103: and combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames so as to optimize each human body 3D grid parameter based on time sequence by the residual time sequence diagram network and obtain each optimized human body 3D grid parameter.

In step 103, the residual time chart network obtains an optimization matrix corresponding to each human body 3D grid parameter based on time sequence, and adds each human body 3D grid parameter to the corresponding optimization matrix through residual link to obtain each optimized human body 3D grid parameter.

The residual time sequence diagram network is a small-scale residual time sequence diagram network, the structure of the residual time sequence diagram network is similar to that of a diagram neural network module included in the U-shaped diagram neural network, only a diagram convolution layer is replaced by a time sequence diagram neural network layer, a single-vertex full-connection layer is replaced by a single-frame single-vertex full-connection layer, and the formula of the time sequence diagram neural network layer is as follows:

wherein C (·) represents a two-dimensional convolution operation,

representing the adjacency matrix, X, after row optimization regularization _i，j ∈R ^N×k Representing the input feature vector, Y _i，j Representing the output feature vector.

The method comprises the steps of acquiring a training video containing a person aiming at a training process of a residual time sequence diagram network, wherein human bodies in each frame of image in the training video are marked with human body 3D grid point coordinates, 3D key point coordinates and 2D key point coordinates, and training the constructed residual time sequence diagram network by using the training video until convergence.

In this embodiment, the loss function of the residual timing diagram network includes two types:

1) Loss function based on 3D grid points:

representing labeled human body 3D grid point coordinates, +.>

And (3) representing the coordinates of the 3D grid points of the human body obtained by the model calculation, wherein N represents the number of the vertexes of the 3D grid of the human body, and T represents the time sequence ordering of the images in the video.

2) Loss function of 3D keypoints:

Further, the loss function of the residual time chart network is a linear combination of the two loss functions, and the formula is as follows:

representing a preset coefficient.

Based on the descriptions of the steps 101 to 103, the volumes of the graphs are successfully reduced by merging adjacent points through the U-shaped graph neural network, the receptive field of each vertex of the graphs is enlarged, the graph neural network is deepened to extract advanced features, and feature fusion among different layers is facilitated by establishing jump links in the graphs with the same size, so that the performance is better improved.

As shown in the overall structure diagram of fig. 3, in the first step, candidate frames with artificial subjects in each frame of image in the video containing the person (a section of video containing 3 frames of images is shown in fig. 3) are extracted to form a new video with artificial subjects; secondly, extracting the image characteristics of each image frame in the video mainly comprising people by using a ResNet network; thirdly, inputting the image characteristics of each image frame into a U-shaped image neural network to obtain human body 3D grid parameters corresponding to a single frame; fourth, the human body 3D grid parameters corresponding to each frame are input into a residual time sequence diagram network according to the time sequence combination of the original frames, and the human body 3D grid parameters of the video subjected to time sequence optimization are obtained through the residual time sequence diagram network.

As shown in fig. 4, the number of vertices of the SMPL template mesh is 6890, and the number is large, which can bring heavy calculation burden. Clustering the vertexes of the SMPL template network by using a clustering algorithm, merging the points of the same class, and reducing the scale of the graph by one fourth of the original vertexes, namely 1723 vertexes; a re-scaled down graph, 431 vertices, is obtained using the same method. Meanwhile, the corresponding relation between the image vertexes of different scales, namely an up-sampling matrix used by an up-sampling module and a down-sampling matrix used by a down-sampling module, is saved. The 6890 scale graph is used for visual display, the 1723 scale graph is used for model input and output, and the 431 scale graph is used as an abstract graph for constructing the U-shaped network.

As shown in the receptive field diagram of the residual time sequence diagram network in FIG. 5, the diagram is built simultaneously according to the time sequence relation between the structure inside each frame and different frames, so that each node can obtain the information of adjacent points in time sequence and structure simultaneously, and a better result is obtained. The curves in fig. 5 represent residual links, arrows represent the flow direction of data, data flow from left to right, and dashed lines represent receptive fields for the features of the subsequent layer in the features of the previous layer.

Fig. 6 is a flowchart illustrating an embodiment of a 3D human body mesh acquiring apparatus according to an exemplary embodiment of the present invention, the 3D human body mesh acquiring apparatus may be applied to an electronic device, as shown in fig. 6, and the 3D human body mesh acquiring apparatus includes:

a feature acquiring module 610, configured to acquire an image feature of each frame of image in a video;

the image module 620 is configured to input, for each frame of image, image features of the frame of image into a trained U-shaped image neural network, so that the U-shaped image neural network obtains corresponding human body 3D mesh parameters based on the image features; the video is a video containing people;

the video module 630 is configured to combine and input each human body 3D grid parameter into a trained residual sequence diagram network according to a sequence order of the image frames, so that the residual sequence diagram network optimizes each human body 3D grid parameter based on the sequence, and obtains each optimized human body 3D grid parameter.

In an optional implementation manner, the feature acquiring module 610 is specifically configured to decompose the video into single-frame images, and input the single-frame images into a preset human body detection system respectively, so that an image including a human body candidate frame is output by the human body detection system; performing scale transformation on candidate frames contained in each frame of image to enable a human body to be positioned in the middle of the candidate frames; the single frame images containing the candidate frames are respectively input into a trained feature extraction network, so that the image features of the images are output by the feature extraction network.

In an optional implementation manner, the image module 620 is specifically configured to, in a process that a U-shaped graph neural network obtains corresponding human body 3D grid parameters based on image features, splice the image features with 3D coordinates of each vertex in a template grid of a first preset scale SMPL by using an input module in the U-shaped graph neural network, obtain initial features of each vertex and use the initial features as graph features of the first preset scale, and input the graph features to two first graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input into a downsampling module and a splicing module in the U-shaped graph neural network; the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series in the U-shaped graph neural network; the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and the new graph features are input into the up-sampling module in the U-shaped graph neural network; the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module; the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale, acquires the spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two second graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two serially connected second graph neural network modules sequentially process the input graph features to obtain final graph features of a first preset scale, and the final graph features are input into a coordinate regressor and a camera coordinate regressor in the U-shaped graph neural network; the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics; taking the 3D coordinates and camera parameters of each vertex as the 3D grid parameters of the human body corresponding to the frame image; wherein the first preset scale is greater than the second preset scale.

In an optional implementation manner, the video module 630 is specifically configured to optimize each human body 3D grid parameter based on time sequence in a residual time sequence diagram network, obtain an optimized matrix corresponding to each human body 3D grid parameter based on time sequence in the process of obtaining each optimized human body 3D grid parameter, and add each human body 3D grid parameter to the corresponding optimized matrix through residual link, so as to obtain each optimized human body 3D grid parameter.

In an alternative implementation, the apparatus further comprises (not shown in fig. 6):

the first training module is used for acquiring a training sample set, wherein each training sample in the training sample set is marked with a label, and the label comprises a human body 3D grid point coordinate, a 3D key point coordinate, a 2D key point coordinate or only a 2D key point coordinate; training the constructed U-shaped graph neural network by utilizing each training sample in the training sample set until convergence; the loss value of the U-shaped graph neural network consists of a 3D grid point loss, a 3D key point loss and a 2D key point loss.

the second training module is used for acquiring training videos containing people, wherein the human body in each frame of image in the training videos is marked with human body 3D grid point coordinates, 3D key point coordinates and 2D key point coordinates; training the constructed residual time sequence diagram network by utilizing the training video until convergence; the loss value of the residual time chart network consists of a 3D grid point loss and a 3D key point loss.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A method for acquiring a 3D human mesh, the method comprising:

acquiring image characteristics of each frame of image in a video, and inputting the image characteristics of each frame of image into a trained U-shaped image neural network for each frame of image so as to acquire corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is obtained by shooting a human body;

combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames, so that each human body 3D grid parameter is optimized by the residual time sequence diagram network based on time sequence, and each optimized human body 3D grid parameter is obtained;

the method for acquiring the image characteristics of each frame of image in the video comprises the following steps:

decomposing the video into single-frame images, and respectively inputting the single-frame images into a preset human body detection system to output images containing human body candidate frames by the human body detection system; performing scale transformation on candidate frames contained in each frame of image to enable a human body to be positioned in the middle of the candidate frames; the single frame images containing the candidate frames are respectively input into a trained feature extraction network, so that the image features of the images are output by the feature extraction network.

2. The method of claim 1, wherein the U-shaped graph neural network obtains corresponding human 3D mesh parameters based on image features, comprising:

the input module in the U-shaped graph neural network splices the image characteristics with the 3D coordinates of each vertex in the template grid of the human body three-dimensional model SMPL with a first preset scale, obtains the initial characteristics of each vertex and takes the initial characteristics as graph characteristics of the first preset scale, and inputs the graph characteristics into the two first graph neural network modules which are connected in series and of the first preset scale in the U-shaped graph neural network;

the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input into a downsampling module and a splicing module in the U-shaped graph neural network;

the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series in the U-shaped graph neural network;

the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and the new graph features are input into the up-sampling module in the U-shaped graph neural network;

the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module;

the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale, acquires the spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two second graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network;

the two serially connected second graph neural network modules sequentially process the input graph features to obtain final graph features of a first preset scale, and the final graph features are input into a coordinate regressor and a camera coordinate regressor in the U-shaped graph neural network;

the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics;

taking the 3D coordinates and camera parameters of each vertex as the 3D grid parameters of the human body corresponding to the frame image;

wherein the first preset scale is greater than the second preset scale.

3. The method of claim 1, wherein the residual timing diagram network optimizes each human 3D grid parameter based on timing to obtain each optimized human 3D grid parameter, comprising:

the residual time sequence diagram network obtains an optimization matrix corresponding to each human body 3D grid parameter based on time sequence, and adds each human body 3D grid parameter to the corresponding optimization matrix through residual link to obtain each optimized human body 3D grid parameter.

4. The method of claim 1, wherein the training process for the U-graph neural network comprises:

acquiring a training sample set, wherein each training sample in the training sample set is marked with a label, and the label comprises a human body 3D grid point coordinate, a 3D key point coordinate, a 2D key point coordinate or only a 2D key point coordinate;

training the constructed U-shaped graph neural network by utilizing each training sample in the training sample set until convergence;

the loss value of the U-shaped graph neural network consists of a 3D grid point loss, a 3D key point loss and a 2D key point loss.

5. The method of claim 1, wherein the training process for the residual timing diagram network comprises:

acquiring a training video containing a person, wherein the human body in each frame of image in the training video is marked with a human body 3D grid point coordinate, a 3D key point coordinate and a 2D key point coordinate;

training the constructed residual time sequence diagram network by utilizing the training video until convergence;

the loss value of the residual time chart network consists of a 3D grid point loss and a 3D key point loss.

6. A 3D body mesh acquisition device, the device comprising:

the image module is used for inputting the image characteristics of each frame of image into a trained U-shaped image neural network so as to obtain corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is obtained by shooting a human body;

the video module is used for combining the 3D grid parameters of each human body according to the time sequence of the image frames and inputting the combined parameters into a trained residual time sequence diagram network so as to optimize the 3D grid parameters of each human body based on time sequence by the residual time sequence diagram network and obtain the optimized 3D grid parameters of each human body;

the characteristic acquisition module is specifically used for decomposing the video into single-frame images, and respectively inputting the single-frame images into a preset human body detection system so as to output images containing human body candidate frames by the human body detection system; performing scale transformation on candidate frames contained in each frame of image to enable a human body to be positioned in the middle of the candidate frames; the single frame images containing the candidate frames are respectively input into a trained feature extraction network, so that the image features of the images are output by the feature extraction network.

7. The apparatus of claim 6, wherein the image module is specifically configured to, in a process that a U-shaped graph neural network obtains corresponding human body 3D grid parameters based on image features, the input module in the U-shaped graph neural network concatenates the image features with 3D coordinates of each vertex in a template grid of a three-dimensional model SMPL of a first preset scale, obtain initial features of each vertex and use the initial features as graph features of the first preset scale, and input the graph features to two first graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input into a downsampling module and a splicing module in the U-shaped graph neural network; the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series in the U-shaped graph neural network; the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and the new graph features are input into the up-sampling module in the U-shaped graph neural network; the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module; the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale, acquires the spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two second graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two serially connected second graph neural network modules sequentially process the input graph features to obtain final graph features of a first preset scale, and the final graph features are input into a coordinate regressor and a camera coordinate regressor in the U-shaped graph neural network; the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics; taking the 3D coordinates and camera parameters of each vertex as the 3D grid parameters of the human body corresponding to the frame image; wherein the first preset scale is greater than the second preset scale.

8. The apparatus of claim 6, wherein the video module is specifically configured to optimize each human 3D grid parameter based on time sequence in a residual time sequence diagram network, obtain an optimized matrix corresponding to each human 3D grid parameter based on time sequence in the process of obtaining each optimized human 3D grid parameter, and add each human 3D grid parameter to the corresponding optimized matrix through residual link to obtain each optimized human 3D grid parameter.