CN111311732B - 3D human body grid acquisition method and device - Google Patents

3D human body grid acquisition method and device Download PDF

Info

Publication number
CN111311732B
CN111311732B CN202010085015.8A CN202010085015A CN111311732B CN 111311732 B CN111311732 B CN 111311732B CN 202010085015 A CN202010085015 A CN 202010085015A CN 111311732 B CN111311732 B CN 111311732B
Authority
CN
China
Prior art keywords
graph
human body
image
neural network
preset scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010085015.8A
Other languages
Chinese (zh)
Other versions
CN111311732A (en
Inventor
牛新
赵杨
窦勇
姜晶菲
李荣春
苏华友
乔鹏
潘衡岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202010085015.8A priority Critical patent/CN111311732B/en
Publication of CN111311732A publication Critical patent/CN111311732A/en
Application granted granted Critical
Publication of CN111311732B publication Critical patent/CN111311732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4023Scaling of whole images or parts thereof, e.g. expanding or contracting based on decimating pixels or lines of pixels; based on inserting pixels or lines of pixels
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • G06T2207/10021Stereoscopic video; Stereoscopic image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 3D human body grid acquisition method and device, comprising the following steps: acquiring image characteristics of each frame of image in a video, and inputting the image characteristics of each frame of image into a trained U-shaped image neural network for each frame of image so as to acquire corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is a video containing people; combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames, and optimizing each human body 3D grid parameter by the residual time sequence diagram network based on time sequence, so that the human body shape represented by the optimized human body 3D grid accords with the human body shape in the image.

Description

3D human body grid acquisition method and device
Technical Field
The invention relates to the technical field of image processing, in particular to a 3D human body grid acquisition method and device.
Background
Restoring the shape of a 3D human body in an image is a basic task in computer vision, and compared with restoring only skeletal joints, the shape restoration requires details of the human body, and the restoration of the 3D human body can be applied to various application programs such as robotics, 3D animation, virtual reality and the like.
The method for recovering the shape of the 3D human body in the image comprises the following steps: one is a parametric method, i.e. fitting a 3D human shape by obtaining model parameters corresponding to an image and inputting the model parameters into a predefined human model (such as a SCAPE human model, a SMPL (three-dimensional human model) model, etc.), where the methods all use human model parameters as regression targets, but the model parameters are discontinuous and hardly regressive, and the limited number of model parameters also limits the model expression capacity; the other is a non-parametric method, such as volumetric reconstruction location, pixel depth regression, can better express details, but this loses semantic information and is not easily matched to existing model interfaces.
In order to take advantage of the two methods, the vertex of the SMPL model is a good regression choice, and in the prior art, the position of the SMPL vertex is regressed by using a graph neural network, namely 6890 vertices are regressed to control the shape of the human body, but in the prior art, the shape of the 3D human body is recovered based on a single frame image, and the difference between the shape of the 3D human body and the shape of the human body in the image is relatively large.
Disclosure of Invention
The invention aims at providing a method and a device for acquiring a 3D human body grid aiming at the defects of the prior art, and the aim is achieved through the following technical scheme.
The first aspect of the present invention proposes a 3D human body mesh acquisition method, the method comprising:
acquiring image characteristics of each frame of image in a video, and inputting the image characteristics of each frame of image into a trained U-shaped image neural network for each frame of image so as to acquire corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is a video containing people; and combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames so as to optimize each human body 3D grid parameter based on time sequence by the residual time sequence diagram network and obtain each optimized human body 3D grid parameter.
A second aspect of the present invention proposes a 3D human mesh acquisition device, the device comprising:
the feature acquisition module is used for acquiring the image features of each frame of image in the video;
the image module is used for inputting the image characteristics of each frame of image into a trained U-shaped image neural network so as to obtain corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is a video containing people;
the video module is used for combining the 3D grid parameters of each human body according to the time sequence of the image frames and inputting the combined parameters into the trained residual time sequence diagram network so as to optimize the 3D grid parameters of each human body based on time sequence by the residual time sequence diagram network and obtain the optimized 3D grid parameters of each human body.
In the embodiment of the application, after the image characteristics of each frame of image in the video are acquired, the image characteristics of each frame of image are processed through a U-shaped graph neural network, the image-level human body 3D grid parameters are obtained through regression, and then each human body 3D grid is optimized based on each frame time sequence through a residual time sequence graph network, so that the human body shape represented by the optimized human body 3D grid is consistent with the human body shape in the image.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of an embodiment of a method for acquiring a 3D body mesh according to an exemplary embodiment of the present invention;
FIG. 2 is a block diagram of a neural network of the U-shaped graph shown in the present invention;
FIG. 3 is a diagram of a 3D human mesh acquisition overview of the present invention;
FIG. 4 is a 3D human body mesh map corresponding to up-sampling and down-sampling of a neural network of a U-shaped map shown in the invention;
FIG. 5 is a schematic diagram of a residual timing network of the present invention;
fig. 6 is a flow chart illustrating an embodiment of a 3D body mesh acquisition device according to an exemplary embodiment of the present invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The 3D human body shape obtained by 6890 SMPL vertexes regressed by the currently adopted graph neural network has a larger difference from the human body shape in the image.
In order to solve the technical problems, the inventor finds that the video contains not only image-level information but also dynamic time sequence information, which can help to reduce uncertainty of human body shape.
The method is concretely realized as follows: after the image characteristics of each frame of image in the video are acquired, the image characteristics of each frame of image are input into a trained U-shaped image neural network, the U-shaped image neural network obtains corresponding human body 3D grid parameters based on the image characteristics, then each human body 3D grid parameter is combined according to the time sequence of the image frames and is input into a trained residual time sequence diagram network, and each human body 3D grid parameter is optimized based on the time sequence by the residual time sequence diagram network, so that each optimized human body 3D grid is consistent with the human body shape in the image.
The 3D human body mesh acquisition method proposed by the present invention is described in detail below with specific embodiments.
Fig. 1 is a flowchart of an embodiment of a 3D human body mesh acquiring method according to an exemplary embodiment of the present invention, where the 3D human body mesh acquiring method may be applied to an electronic device (such as a PC, a terminal, etc.), and as shown in fig. 1, the 3D human body mesh acquiring method includes the following steps:
step 101: and acquiring the image characteristics of each frame of image in the video.
The video is a video containing people, namely a video mainly containing people.
In step 101, the video may be decomposed into single frame images, and respectively input into a preset human body detection system, so that the human body detection system outputs images including human body candidate frames, the candidate frames included in each frame image are subjected to scale transformation, so that the human body is at the middle position of the candidate frame, and then the single frame images including the candidate frames are respectively input into a trained feature extraction network, so that the feature extraction network outputs image features of the images.
For example, the human body detection system may be implemented by using a neural network model, such as R-CNN, and the human body detection system extracts, for each frame of input image, a candidate frame with the highest probability of containing a person.
In one example, preliminary image features per frame of image may be obtained through a feature extraction network ResNet.
In this embodiment, the human body candidate frame included in each frame of image may be scaled to a fixed size, so as to ensure that the human body is in the middle of the candidate frame.
Step 102: for each frame of image, the image characteristics of the frame of image are input into a trained U-shaped image neural network, so that the U-shaped image neural network can obtain corresponding human body 3D grid parameters based on the image characteristics.
In this embodiment, the graph neural network of the U-shaped structure is used to achieve the downsizing and restoration of the graph scale, that is, the graph scale is changed from large to small and then from small to large, so as to help to enlarge the receptive field of each node and deepen the network depth of the small-scale graph, thereby facilitating the extraction of high-level features.
The U-shaped graph neural network is formed by stacking a plurality of graph neural network modules, and the network structure of the U-shaped graph neural network comprises an input module, two first graph neural network modules with first preset scales, a downsampling module, four graph neural network modules with second preset scales, an upsampling module, a splicing module, two second graph neural network modules with first preset scales, a coordinate regressor and a camera coordinate regressor.
Based on the structure of the U-shaped graph neural network, aiming at the process of obtaining corresponding human body 3D grid parameters based on image features by the U-shaped graph neural network, an input module splices the image features with 3D coordinates of each vertex in a template grid of SMPL of a first preset scale, obtains initial features of each vertex and takes the initial features as graph features of the first preset scale, and inputs the graph features into two first graph neural network modules of the first preset scale which are connected in series; the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input to the downsampling module and the splicing module; the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series; the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and input the new graph features into the up-sampling module; the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module; the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale to obtain spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two serially connected second graph neural network modules of the first preset scale; the two second graph neural network modules connected in series sequentially process the input graph features to obtain final graph features of a first preset scale and input the final graph features to the coordinate regressor and the camera coordinate regressor; the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics.
Thus, the human body 3D grid parameters corresponding to each frame of image comprise the 3D coordinates and camera parameters of each vertex in the corresponding first preset scale map feature.
In this embodiment, since the number of vertices of the existing SMPL template mesh is 6890, and the number is large, which brings a heavy calculation burden, the present invention uses a clustering algorithm to cluster the vertices of the existing SMPL template mesh, combine the vertices of the same class, and reduce the graph size of the template mesh to a first preset size, for example, to one fourth of the original vertices, i.e., 1723 vertices, so that the number of vertex features included in the graph features obtained by the input module is consistent with the number of vertices of the template mesh. And then a clustering algorithm is used to obtain a graph of a second preset scale which is reduced again, for example, the graph is reduced to one fourth, namely 431 vertexes, and meanwhile, the corresponding relation among the vertexes of the graph of different scales is saved, namely, an up-sampling matrix and a down-sampling matrix used by the up-sampling module are obtained.
It follows that the first preset scale is larger than the second preset scale. Taking 6890 scale, 1723 scale and 431 scale as examples, the graph of 6890 scale is used for visual display, the graph of 1723 scale is used for input and output of a model, and the graph of 431 scale is used as an abstract graph for constructing a U-shaped structure.
The structure of each graph neural network module in the U-shaped graph neural network is the same, and the difference is that the scale of the graph and parameters in the module are different, and the operation amount of the module is reduced when the scale of the graph is reduced. Each graph neural network module inherits a ResBlock structure, namely all 3*3 convolution layers in the ResBlock structure are replaced by graph convolution layers, all 1*1 convolution layers are replaced by full connection layers of each vertex, and all BatchNorm layers are replaced by GroupNorm layers.
The mathematical formula of the graph roll lamination is as follows:
Figure BDA0002381728980000081
where x is the input eigenvector, w is the trained parameter matrix,
Figure BDA0002381728980000082
is an adjacency matrix regularized by row optimization.
It should be noted that, aiming at the training process of the U-shaped graph neural network, a training sample set is obtained, each training sample in the training sample set is marked with a label, the label comprises a human body 3D grid point coordinate, a 3D key point coordinate, a 2D key point coordinate, or only a 2D key point coordinate, and each training sample in the training sample set is used for training the constructed U-shaped graph neural network until convergence;
in this embodiment, the loss functions of the U-shaped graph neural network include three types:
1) Loss function based on 3D grid points:
Figure BDA0002381728980000091
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002381728980000092
representing labeled human body 3D grid point coordinates, +.>
Figure BDA0002381728980000093
And representing the parameters of the 3D grid points of the human body obtained by the model calculation, wherein N represents the number of vertexes of the 3D grid.
2) Loss function based on 3D keypoints:
Figure BDA0002381728980000094
wherein J is 3D,t Representing the coordinates of the marked 3D key points of the human body,
Figure BDA0002381728980000095
and representing the coordinates of the 3D key points of the human body obtained by the model calculation, wherein M represents the number of the 3D key points of the human body.
3) Loss function based on 2D keypoints:
Figure BDA0002381728980000096
wherein J is 2D,t Representing the coordinates of the 2D key points of the marked human body,
Figure BDA0002381728980000097
and representing the coordinates of the human body 2D key points calculated by the model, wherein M represents the number of the human body 2D key points.
Further, the loss function of the U-shaped graph neural network is a linear combination of the three loss functions, and the formula is as follows:
Figure BDA0002381728980000098
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002381728980000099
and->
Figure BDA00023817289800000910
Respectively, preset coefficients.
The method comprises the steps of obtaining human body 3D key point coordinates and human body 2D key point coordinates by aiming at model calculation, multiplying the vertex 3D coordinates of a human body 3D grid by a regression matrix obtained in advance after obtaining human body 3D grid parameters by the model, obtaining the human body 3D key point coordinates, and then carrying out camera transformation by utilizing camera parameters and the obtained human body 3D key point coordinates to obtain the human body 2D key point coordinates.
Step 103: and combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames so as to optimize each human body 3D grid parameter based on time sequence by the residual time sequence diagram network and obtain each optimized human body 3D grid parameter.
In step 103, the residual time chart network obtains an optimization matrix corresponding to each human body 3D grid parameter based on time sequence, and adds each human body 3D grid parameter to the corresponding optimization matrix through residual link to obtain each optimized human body 3D grid parameter.
The residual time sequence diagram network is a small-scale residual time sequence diagram network, the structure of the residual time sequence diagram network is similar to that of a diagram neural network module included in the U-shaped diagram neural network, only a diagram convolution layer is replaced by a time sequence diagram neural network layer, a single-vertex full-connection layer is replaced by a single-frame single-vertex full-connection layer, and the formula of the time sequence diagram neural network layer is as follows:
Figure BDA0002381728980000101
wherein C (·) represents a two-dimensional convolution operation,
Figure BDA0002381728980000102
representing the adjacency matrix, X, after row optimization regularization i,j ∈R N×k Representing the input feature vector, Y i,j Representing the output feature vector.
The method comprises the steps of acquiring a training video containing a person aiming at a training process of a residual time sequence diagram network, wherein human bodies in each frame of image in the training video are marked with human body 3D grid point coordinates, 3D key point coordinates and 2D key point coordinates, and training the constructed residual time sequence diagram network by using the training video until convergence.
In this embodiment, the loss function of the residual timing diagram network includes two types:
1) Loss function based on 3D grid points:
Figure BDA0002381728980000111
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002381728980000112
representing labeled human body 3D grid point coordinates, +.>
Figure BDA0002381728980000113
And (3) representing the coordinates of the 3D grid points of the human body obtained by the model calculation, wherein N represents the number of the vertexes of the 3D grid of the human body, and T represents the time sequence ordering of the images in the video.
2) Loss function of 3D keypoints:
Figure BDA0002381728980000114
wherein J is 3D,t Representing the coordinates of the marked 3D key points of the human body,
Figure BDA0002381728980000115
and representing the coordinates of the 3D key points of the human body obtained by the model calculation, wherein M represents the number of the 3D key points of the human body.
Further, the loss function of the residual time chart network is a linear combination of the two loss functions, and the formula is as follows:
Figure BDA0002381728980000116
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002381728980000117
representing a preset coefficient.
Based on the descriptions of the steps 101 to 103, the volumes of the graphs are successfully reduced by merging adjacent points through the U-shaped graph neural network, the receptive field of each vertex of the graphs is enlarged, the graph neural network is deepened to extract advanced features, and feature fusion among different layers is facilitated by establishing jump links in the graphs with the same size, so that the performance is better improved.
As shown in the overall structure diagram of fig. 3, in the first step, candidate frames with artificial subjects in each frame of image in the video containing the person (a section of video containing 3 frames of images is shown in fig. 3) are extracted to form a new video with artificial subjects; secondly, extracting the image characteristics of each image frame in the video mainly comprising people by using a ResNet network; thirdly, inputting the image characteristics of each image frame into a U-shaped image neural network to obtain human body 3D grid parameters corresponding to a single frame; fourth, the human body 3D grid parameters corresponding to each frame are input into a residual time sequence diagram network according to the time sequence combination of the original frames, and the human body 3D grid parameters of the video subjected to time sequence optimization are obtained through the residual time sequence diagram network.
As shown in fig. 4, the number of vertices of the SMPL template mesh is 6890, and the number is large, which can bring heavy calculation burden. Clustering the vertexes of the SMPL template network by using a clustering algorithm, merging the points of the same class, and reducing the scale of the graph by one fourth of the original vertexes, namely 1723 vertexes; a re-scaled down graph, 431 vertices, is obtained using the same method. Meanwhile, the corresponding relation between the image vertexes of different scales, namely an up-sampling matrix used by an up-sampling module and a down-sampling matrix used by a down-sampling module, is saved. The 6890 scale graph is used for visual display, the 1723 scale graph is used for model input and output, and the 431 scale graph is used as an abstract graph for constructing the U-shaped network.
As shown in the receptive field diagram of the residual time sequence diagram network in FIG. 5, the diagram is built simultaneously according to the time sequence relation between the structure inside each frame and different frames, so that each node can obtain the information of adjacent points in time sequence and structure simultaneously, and a better result is obtained. The curves in fig. 5 represent residual links, arrows represent the flow direction of data, data flow from left to right, and dashed lines represent receptive fields for the features of the subsequent layer in the features of the previous layer.
Fig. 6 is a flowchart illustrating an embodiment of a 3D human body mesh acquiring apparatus according to an exemplary embodiment of the present invention, the 3D human body mesh acquiring apparatus may be applied to an electronic device, as shown in fig. 6, and the 3D human body mesh acquiring apparatus includes:
a feature acquiring module 610, configured to acquire an image feature of each frame of image in a video;
the image module 620 is configured to input, for each frame of image, image features of the frame of image into a trained U-shaped image neural network, so that the U-shaped image neural network obtains corresponding human body 3D mesh parameters based on the image features; the video is a video containing people;
the video module 630 is configured to combine and input each human body 3D grid parameter into a trained residual sequence diagram network according to a sequence order of the image frames, so that the residual sequence diagram network optimizes each human body 3D grid parameter based on the sequence, and obtains each optimized human body 3D grid parameter.
In an optional implementation manner, the feature acquiring module 610 is specifically configured to decompose the video into single-frame images, and input the single-frame images into a preset human body detection system respectively, so that an image including a human body candidate frame is output by the human body detection system; performing scale transformation on candidate frames contained in each frame of image to enable a human body to be positioned in the middle of the candidate frames; the single frame images containing the candidate frames are respectively input into a trained feature extraction network, so that the image features of the images are output by the feature extraction network.
In an optional implementation manner, the image module 620 is specifically configured to, in a process that a U-shaped graph neural network obtains corresponding human body 3D grid parameters based on image features, splice the image features with 3D coordinates of each vertex in a template grid of a first preset scale SMPL by using an input module in the U-shaped graph neural network, obtain initial features of each vertex and use the initial features as graph features of the first preset scale, and input the graph features to two first graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input into a downsampling module and a splicing module in the U-shaped graph neural network; the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series in the U-shaped graph neural network; the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and the new graph features are input into the up-sampling module in the U-shaped graph neural network; the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module; the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale, acquires the spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two second graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two serially connected second graph neural network modules sequentially process the input graph features to obtain final graph features of a first preset scale, and the final graph features are input into a coordinate regressor and a camera coordinate regressor in the U-shaped graph neural network; the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics; taking the 3D coordinates and camera parameters of each vertex as the 3D grid parameters of the human body corresponding to the frame image; wherein the first preset scale is greater than the second preset scale.
In an optional implementation manner, the video module 630 is specifically configured to optimize each human body 3D grid parameter based on time sequence in a residual time sequence diagram network, obtain an optimized matrix corresponding to each human body 3D grid parameter based on time sequence in the process of obtaining each optimized human body 3D grid parameter, and add each human body 3D grid parameter to the corresponding optimized matrix through residual link, so as to obtain each optimized human body 3D grid parameter.
In an alternative implementation, the apparatus further comprises (not shown in fig. 6):
the first training module is used for acquiring a training sample set, wherein each training sample in the training sample set is marked with a label, and the label comprises a human body 3D grid point coordinate, a 3D key point coordinate, a 2D key point coordinate or only a 2D key point coordinate; training the constructed U-shaped graph neural network by utilizing each training sample in the training sample set until convergence; the loss value of the U-shaped graph neural network consists of a 3D grid point loss, a 3D key point loss and a 2D key point loss.
In an alternative implementation, the apparatus further comprises (not shown in fig. 6):
the second training module is used for acquiring training videos containing people, wherein the human body in each frame of image in the training videos is marked with human body 3D grid point coordinates, 3D key point coordinates and 2D key point coordinates; training the constructed residual time sequence diagram network by utilizing the training video until convergence; the loss value of the residual time chart network consists of a 3D grid point loss and a 3D key point loss.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims (8)

1. A method for acquiring a 3D human mesh, the method comprising:
acquiring image characteristics of each frame of image in a video, and inputting the image characteristics of each frame of image into a trained U-shaped image neural network for each frame of image so as to acquire corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is obtained by shooting a human body;
combining and inputting each human body 3D grid parameter into a trained residual time sequence diagram network according to the time sequence of the image frames, so that each human body 3D grid parameter is optimized by the residual time sequence diagram network based on time sequence, and each optimized human body 3D grid parameter is obtained;
the method for acquiring the image characteristics of each frame of image in the video comprises the following steps:
decomposing the video into single-frame images, and respectively inputting the single-frame images into a preset human body detection system to output images containing human body candidate frames by the human body detection system; performing scale transformation on candidate frames contained in each frame of image to enable a human body to be positioned in the middle of the candidate frames; the single frame images containing the candidate frames are respectively input into a trained feature extraction network, so that the image features of the images are output by the feature extraction network.
2. The method of claim 1, wherein the U-shaped graph neural network obtains corresponding human 3D mesh parameters based on image features, comprising:
the input module in the U-shaped graph neural network splices the image characteristics with the 3D coordinates of each vertex in the template grid of the human body three-dimensional model SMPL with a first preset scale, obtains the initial characteristics of each vertex and takes the initial characteristics as graph characteristics of the first preset scale, and inputs the graph characteristics into the two first graph neural network modules which are connected in series and of the first preset scale in the U-shaped graph neural network;
the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input into a downsampling module and a splicing module in the U-shaped graph neural network;
the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series in the U-shaped graph neural network;
the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and the new graph features are input into the up-sampling module in the U-shaped graph neural network;
the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module;
the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale, acquires the spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two second graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network;
the two serially connected second graph neural network modules sequentially process the input graph features to obtain final graph features of a first preset scale, and the final graph features are input into a coordinate regressor and a camera coordinate regressor in the U-shaped graph neural network;
the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics;
taking the 3D coordinates and camera parameters of each vertex as the 3D grid parameters of the human body corresponding to the frame image;
wherein the first preset scale is greater than the second preset scale.
3. The method of claim 1, wherein the residual timing diagram network optimizes each human 3D grid parameter based on timing to obtain each optimized human 3D grid parameter, comprising:
the residual time sequence diagram network obtains an optimization matrix corresponding to each human body 3D grid parameter based on time sequence, and adds each human body 3D grid parameter to the corresponding optimization matrix through residual link to obtain each optimized human body 3D grid parameter.
4. The method of claim 1, wherein the training process for the U-graph neural network comprises:
acquiring a training sample set, wherein each training sample in the training sample set is marked with a label, and the label comprises a human body 3D grid point coordinate, a 3D key point coordinate, a 2D key point coordinate or only a 2D key point coordinate;
training the constructed U-shaped graph neural network by utilizing each training sample in the training sample set until convergence;
the loss value of the U-shaped graph neural network consists of a 3D grid point loss, a 3D key point loss and a 2D key point loss.
5. The method of claim 1, wherein the training process for the residual timing diagram network comprises:
acquiring a training video containing a person, wherein the human body in each frame of image in the training video is marked with a human body 3D grid point coordinate, a 3D key point coordinate and a 2D key point coordinate;
training the constructed residual time sequence diagram network by utilizing the training video until convergence;
the loss value of the residual time chart network consists of a 3D grid point loss and a 3D key point loss.
6. A 3D body mesh acquisition device, the device comprising:
the feature acquisition module is used for acquiring the image features of each frame of image in the video;
the image module is used for inputting the image characteristics of each frame of image into a trained U-shaped image neural network so as to obtain corresponding human body 3D grid parameters based on the image characteristics by the U-shaped image neural network; the video is obtained by shooting a human body;
the video module is used for combining the 3D grid parameters of each human body according to the time sequence of the image frames and inputting the combined parameters into a trained residual time sequence diagram network so as to optimize the 3D grid parameters of each human body based on time sequence by the residual time sequence diagram network and obtain the optimized 3D grid parameters of each human body;
the characteristic acquisition module is specifically used for decomposing the video into single-frame images, and respectively inputting the single-frame images into a preset human body detection system so as to output images containing human body candidate frames by the human body detection system; performing scale transformation on candidate frames contained in each frame of image to enable a human body to be positioned in the middle of the candidate frames; the single frame images containing the candidate frames are respectively input into a trained feature extraction network, so that the image features of the images are output by the feature extraction network.
7. The apparatus of claim 6, wherein the image module is specifically configured to, in a process that a U-shaped graph neural network obtains corresponding human body 3D grid parameters based on image features, the input module in the U-shaped graph neural network concatenates the image features with 3D coordinates of each vertex in a template grid of a three-dimensional model SMPL of a first preset scale, obtain initial features of each vertex and use the initial features as graph features of the first preset scale, and input the graph features to two first graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two first graph neural network modules connected in series sequentially process the input graph features to obtain new graph features of a first preset scale, and the new graph features are input into a downsampling module and a splicing module in the U-shaped graph neural network; the downsampling module converts the new graph features of the first preset scale into graph features of the second preset scale, and inputs the graph features of the second preset scale into four graph neural network modules of the second preset scale connected in series in the U-shaped graph neural network; the four series-connected graph neural network modules with the second preset scale sequentially process the input graph features to obtain new graph features with the second preset scale, and the new graph features are input into the up-sampling module in the U-shaped graph neural network; the up-sampling module restores the new graph characteristics of the second preset scale to the graph characteristics of the first preset scale and inputs the graph characteristics to the splicing module; the splicing module splices the new graph characteristics of the first preset scale with the restored graph characteristics of the first preset scale, acquires the spliced graph characteristics of the first preset scale and inputs the spliced graph characteristics of the first preset scale into two second graph neural network modules of the first preset scale connected in series in the U-shaped graph neural network; the two serially connected second graph neural network modules sequentially process the input graph features to obtain final graph features of a first preset scale, and the final graph features are input into a coordinate regressor and a camera coordinate regressor in the U-shaped graph neural network; the coordinate regressive device regresses the 3D coordinates of each vertex in the graph characteristics, and the camera coordinate regressive device regresses the camera parameters corresponding to the graph characteristics; taking the 3D coordinates and camera parameters of each vertex as the 3D grid parameters of the human body corresponding to the frame image; wherein the first preset scale is greater than the second preset scale.
8. The apparatus of claim 6, wherein the video module is specifically configured to optimize each human 3D grid parameter based on time sequence in a residual time sequence diagram network, obtain an optimized matrix corresponding to each human 3D grid parameter based on time sequence in the process of obtaining each optimized human 3D grid parameter, and add each human 3D grid parameter to the corresponding optimized matrix through residual link to obtain each optimized human 3D grid parameter.
CN202010085015.8A 2020-04-26 2020-04-26 3D human body grid acquisition method and device Active CN111311732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010085015.8A CN111311732B (en) 2020-04-26 2020-04-26 3D human body grid acquisition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010085015.8A CN111311732B (en) 2020-04-26 2020-04-26 3D human body grid acquisition method and device

Publications (2)

Publication Number Publication Date
CN111311732A CN111311732A (en) 2020-06-19
CN111311732B true CN111311732B (en) 2023-06-20

Family

ID=71161682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010085015.8A Active CN111311732B (en) 2020-04-26 2020-04-26 3D human body grid acquisition method and device

Country Status (1)

Country Link
CN (1) CN111311732B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767534B (en) * 2020-12-31 2024-02-09 北京达佳互联信息技术有限公司 Video image processing method, device, electronic equipment and storage medium
CN113011516A (en) * 2021-03-30 2021-06-22 华南理工大学 Three-dimensional mesh model classification method and device based on graph topology and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271589A (en) * 2007-03-22 2008-09-24 中国科学院计算技术研究所 Three-dimensional mannequin joint center extraction method
CN101833788A (en) * 2010-05-18 2010-09-15 南京大学 Three-dimensional human modeling method by using cartographical sketching
CN102982578A (en) * 2012-10-31 2013-03-20 北京航空航天大学 Estimation method for dressed body 3D model in single character image
CN105006014A (en) * 2015-02-12 2015-10-28 上海交通大学 Method and system for realizing fast fitting simulation of virtual clothing

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106611158A (en) * 2016-11-14 2017-05-03 深圳奥比中光科技有限公司 Method and equipment for obtaining human body 3D characteristic information
CN107392097B (en) * 2017-06-15 2020-07-07 中山大学 Three-dimensional human body joint point positioning method of monocular color video
CN108053480B (en) * 2017-12-08 2021-03-19 东华大学 Three-dimensional full-scale dressing human body mesh construction method based on reverse engineering technology
CN108629801B (en) * 2018-05-14 2020-11-24 华南理工大学 Three-dimensional human body model posture and shape reconstruction method of video sequence
CN108985259B (en) * 2018-08-03 2022-03-18 百度在线网络技术(北京)有限公司 Human body action recognition method and device
CN109199603B (en) * 2018-08-31 2020-11-03 浙江大学宁波理工学院 Intelligent positioning method for optimal screw placement point of pedicle screw
CN109859306A (en) * 2018-12-24 2019-06-07 青岛红创众投科技发展有限公司 A method of extracting manikin in the slave photo based on machine learning
CN109919122A (en) * 2019-03-18 2019-06-21 中国石油大学(华东) A kind of timing behavioral value method based on 3D human body key point
CN110059605A (en) * 2019-04-10 2019-07-26 厦门美图之家科技有限公司 A kind of neural network training method calculates equipment and storage medium
CN110074788B (en) * 2019-04-18 2020-03-17 梦多科技有限公司 Body data acquisition method and device based on machine learning
CN110399789B (en) * 2019-06-14 2021-04-20 佳都新太科技股份有限公司 Pedestrian re-identification method, model construction method, device, equipment and storage medium
CN110276316B (en) * 2019-06-26 2022-05-24 电子科技大学 Human body key point detection method based on deep learning
CN110619681B (en) * 2019-07-05 2022-04-05 杭州同绘科技有限公司 Human body geometric reconstruction method based on Euler field deformation constraint
CN110428493B (en) * 2019-07-12 2021-11-02 清华大学 Single-image human body three-dimensional reconstruction method and system based on grid deformation
CN110363862B (en) * 2019-07-15 2023-03-10 叠境数字科技(上海)有限公司 Three-dimensional grid sequence compression method based on human body template alignment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271589A (en) * 2007-03-22 2008-09-24 中国科学院计算技术研究所 Three-dimensional mannequin joint center extraction method
CN101833788A (en) * 2010-05-18 2010-09-15 南京大学 Three-dimensional human modeling method by using cartographical sketching
CN102982578A (en) * 2012-10-31 2013-03-20 北京航空航天大学 Estimation method for dressed body 3D model in single character image
CN105006014A (en) * 2015-02-12 2015-10-28 上海交通大学 Method and system for realizing fast fitting simulation of virtual clothing

Also Published As

Publication number Publication date
CN111311732A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN110298361B (en) Semantic segmentation method and system for RGB-D image
CN110443842B (en) Depth map prediction method based on visual angle fusion
CN108647639B (en) Real-time human body skeleton joint point detection method
CN109360171B (en) Real-time deblurring method for video image based on neural network
CN113344806A (en) Image defogging method and system based on global feature fusion attention network
CN111160164A (en) Action recognition method based on human body skeleton and image fusion
CN109389667B (en) High-efficiency global illumination drawing method based on deep learning
CN112465718B (en) Two-stage image restoration method based on generation of countermeasure network
CN113762147B (en) Facial expression migration method and device, electronic equipment and storage medium
CN111311732B (en) 3D human body grid acquisition method and device
CN113807361B (en) Neural network, target detection method, neural network training method and related products
CN115345866B (en) Building extraction method in remote sensing image, electronic equipment and storage medium
CN112509106A (en) Document picture flattening method, device and equipment
CN113077545A (en) Method for reconstructing dress human body model from image based on graph convolution
CN113066089A (en) Real-time image semantic segmentation network based on attention guide mechanism
CN117475258A (en) Training method of virtual fitting model, virtual fitting method and electronic equipment
CN117391938B (en) Infrared image super-resolution reconstruction method, system, equipment and terminal
CN111046738A (en) Precision improvement method of light u-net for finger vein segmentation
CN116342675B (en) Real-time monocular depth estimation method, system, electronic equipment and storage medium
CN113240584A (en) Multitask gesture picture super-resolution method based on picture edge information
CN113079136A (en) Motion capture method, motion capture device, electronic equipment and computer-readable storage medium
CN117726513A (en) Depth map super-resolution reconstruction method and system based on color image guidance
CN116012626B (en) Material matching method, device, equipment and storage medium for building elevation image
CN115630660B (en) Barcode positioning method and device based on convolutional neural network
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant