CN114758076A - Training method and device for deep learning model for building three-dimensional model - Google Patents

Training method and device for deep learning model for building three-dimensional model Download PDF

Info

Publication number
CN114758076A
CN114758076A CN202210430966.3A CN202210430966A CN114758076A CN 114758076 A CN114758076 A CN 114758076A CN 202210430966 A CN202210430966 A CN 202210430966A CN 114758076 A CN114758076 A CN 114758076A
Authority
CN
China
Prior art keywords
image data
dimensional
loss
sample
coordinate system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210430966.3A
Other languages
Chinese (zh)
Inventor
张健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210430966.3A priority Critical patent/CN114758076A/en
Publication of CN114758076A publication Critical patent/CN114758076A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The disclosure provides a training method and a training device for establishing a deep learning model of a three-dimensional model, and relates to the technical field of artificial intelligence, in particular to the technical field of computer vision. The specific implementation scheme is as follows: acquiring first sample image data and second sample image data, inputting the first sample image data and the second sample image data into a depth learning model, and acquiring predicted texture and illumination parameters of a sample object, predicted object modeling parameters and predicted camera coordinate system transformation parameters, wherein the predicted object modeling parameters represent the shape, position and scale of the sample object; calculating according to the predicted texture and the illumination parameters to obtain imaging loss; calculating according to the modeling parameters of the prediction object to obtain model consistency loss; calculating according to the predicted camera coordinate system transformation parameters to obtain the motion smoothness loss; and adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss. The present disclosure enables training of deep learning models for building three-dimensional models.

Description

Training method and device for deep learning model for building three-dimensional model
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of computer vision technology.
Background
With the rapid development of artificial intelligence, intelligent devices, intelligent applications and the like are in a large number, wherein three-dimensional interactive devices and applications with high individuation and intelligence become one of popular researches. At present, interaction schemes such as gestures and human body postures are commonly adopted for three-dimensional interaction to realize interaction between equipment, application and the like and a user, so that for the three-dimensional interaction, real-time reconstruction and stable tracking of three-dimensional models of objects such as hands and human bodies are the key points for enhancing the playability, the interactivity and the like of the interaction schemes.
Disclosure of Invention
The disclosure provides a training method and a training device for establishing a deep learning model of a three-dimensional model.
According to an aspect of the present disclosure, there is provided a training method for building a deep learning model of a three-dimensional model, including:
acquiring first sample image data and second sample image data, wherein the first sample image data and the second sample image data both comprise sample objects;
inputting the first sample image data and the second sample image data into a depth learning model to obtain a predicted texture and illumination parameter, a predicted object modeling parameter and a predicted camera coordinate system transformation parameter of the sample object, wherein the predicted object modeling parameter represents the shape, position and scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data;
Calculating to obtain imaging loss according to the predicted texture and the illumination parameters;
calculating to obtain model consistency loss according to the prediction object modeling parameters;
calculating to obtain the loss of motion smoothness according to the predicted camera coordinate system transformation parameters;
and adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss.
According to another aspect of the present disclosure, there is provided a three-dimensional model building method including:
acquiring first image data, wherein the first image data comprises a target object of a three-dimensional model to be constructed;
inputting the first image data into a pre-trained deep learning model, and obtaining first image features based on a feature extraction network in the deep learning model, wherein the deep learning model is obtained by training through a deep learning model training method for building a three-dimensional model;
inputting the first image characteristic into a texture and illumination prediction network of the deep learning model, and determining a predicted texture and illumination parameters of the target object;
inputting the first image feature into a modeling parameter prediction network of the deep learning model, and determining a prediction object modeling parameter of the target object, wherein the object modeling parameter represents the shape, the position and the scale of the target object;
And establishing a three-dimensional model of the target object based on the texture and illumination parameters and the object modeling parameters.
According to still another aspect of the present disclosure, there is provided a training apparatus for building a deep learning model of a three-dimensional model, including:
the data acquisition module is used for acquiring first sample image data and second sample image data, wherein the first sample image data and the second sample image data both comprise sample objects;
a parameter obtaining module, configured to input the first sample image data and the second sample image data into a depth learning model, so as to obtain a predicted texture and illumination parameter, a predicted object modeling parameter, and a predicted camera coordinate system transformation parameter of the sample object, where the predicted object modeling parameter represents a shape, a position, and a scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data;
The first loss calculation module is used for calculating to obtain imaging loss according to the predicted texture and the illumination parameters;
the second loss calculation module is used for calculating to obtain model consistency loss according to the prediction object modeling parameters;
the third loss calculation module is used for calculating the motion smoothness loss according to the predicted camera coordinate system transformation parameters;
and the parameter adjusting module is used for adjusting the parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss.
According to still another aspect of the present disclosure, there is provided a three-dimensional model building apparatus including:
the image data acquisition module is used for acquiring first image data, wherein the first image data comprises a target object of a three-dimensional model to be constructed;
the parameter determining module is used for inputting the first image data into a pre-trained deep learning model and extracting a network based on the features in the deep learning model to obtain first image features, wherein the deep learning model is obtained by training through a training device for establishing the deep learning model of the three-dimensional model; inputting the first image characteristic into a texture and illumination prediction network of the deep learning model, and determining a predicted texture and illumination parameters of the target object; inputting the first image feature into a modeling parameter prediction network of the deep learning model, and determining a prediction object modeling parameter of the target object, wherein the object modeling parameter represents the shape, the position and the scale of the target object;
And the model establishing module is used for establishing a three-dimensional model of the target object based on the texture and illumination parameters and the object modeling parameters.
The method includes the steps of obtaining first sample image data and second sample image data, wherein the first sample image data and the second sample image data both include sample objects; inputting the first sample image data and the second sample image data into a depth learning model to obtain a predicted texture and illumination parameter, a predicted object modeling parameter and a predicted camera coordinate system transformation parameter of the sample object, wherein the predicted object modeling parameter represents the shape, position and scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data; calculating to obtain imaging loss according to the predicted texture and the illumination parameters; calculating to obtain model consistency loss according to the prediction object modeling parameters; calculating to obtain the loss of the motion smoothness according to the predicted camera coordinate system transformation parameters; and adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss, thereby realizing the training of the deep learning model for establishing the three-dimensional model.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow chart diagram of a training method for building a deep learning model of a three-dimensional model according to the present disclosure;
fig. 2 is one possible implementation of step S12 provided by the present disclosure;
fig. 3 is one possible implementation of step S14 provided by the present disclosure;
fig. 4 is one possible implementation of step S15 provided by the present disclosure;
FIG. 5 is one possible implementation of step S13 provided by the present disclosure;
FIG. 6 is a schematic flow chart diagram of a three-dimensional model building method provided by the present disclosure;
FIG. 7 is a schematic structural diagram of a training apparatus for building a deep learning model of a three-dimensional model according to the present disclosure;
FIG. 8 is a schematic structural diagram of a three-dimensional modeling apparatus provided by the present disclosure;
FIG. 9 is a block diagram of an electronic device for implementing a training method for building a deep learning model of a three-dimensional model according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the prior art, a three-dimensional model is often built based on a deep learning technology, and one way is to collect RGB-D (RGB-Depth Map, Depth image) data of a modeled object (such as a human hand and a human body) through a special hardware device to build the three-dimensional model (three-dimensional modeling); another approach is to build a three-dimensional model based on RGB (image with three color channels) data and three-dimensional supervision data of the modeled object. However, both of the two schemes have certain problems, one of the two schemes relies on depth data obtained by special hardware equipment to carry out three-dimensional modeling, the special equipment is often expensive and is not beneficial to the general application of modeling schemes, and great inconvenience is brought to the practical application of the technology. And secondly, three-dimensional modeling based on RGB image data cannot consider the characteristics of an object in motion and the acquisition difficulty of three-dimensional supervision data, so that the problem of motion jitter of a three-dimensional modeling result in practical application can be caused, and the three-dimensional supervision data of different scenes cannot be quickly, simply and conveniently acquired, so that the three-dimensional modeling method cannot be applied to a universal scene in a large scale.
To solve at least one of the above problems, the present disclosure provides a training method for building a deep learning model of a three-dimensional model, comprising:
acquiring first sample image data and second sample image data, wherein the first sample image data and the second sample image data both comprise sample objects;
inputting the first sample image data and the second sample image data into a depth learning model to obtain a predicted texture and illumination parameter, a predicted object modeling parameter and a predicted camera coordinate system transformation parameter of the sample object, wherein the predicted object modeling parameter represents the shape, position and scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data;
calculating to obtain imaging loss according to the predicted texture and the illumination parameters;
calculating to obtain model consistency loss according to the prediction object modeling parameters;
Calculating to obtain the loss of the motion smoothness according to the predicted camera coordinate system transformation parameters;
and adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss.
As can be seen from the above, the training method for building a deep learning model of a three-dimensional model according to the present disclosure obtains a predicted texture and an illumination parameter of a sample object, a predicted object modeling parameter capable of representing a shape, a position, and a scale of the sample object, and a predicted camera coordinate system transformation parameter representing a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, respectively, by using first sample image data and second sample image data each including the sample object, thereby stably and rapidly predicting parameters required for three-dimensional modeling according to information in the sample image data, and further completing three-dimensional modeling. And then, calculating imaging loss, model consistency loss and motion smoothness loss according to the predicted parameters, and adjusting the parameters of the deep learning model so that the deep learning model gradually meets the requirements and the three-dimensional model is more accurately established.
The following describes in detail a training method for building a deep learning model of a three-dimensional model according to a specific embodiment.
The method of the embodiment of the disclosure is applied to the intelligent terminal, can be implemented by the intelligent terminal, and in the actual use process, the intelligent terminal can be a computer, a mobile phone and the like.
Referring to fig. 1, fig. 1 is a schematic flowchart of a training method for building a deep learning model of a three-dimensional model according to the present disclosure, including:
step S11: first sample image data and second sample image data are acquired.
Wherein the first sample image data and the second sample image data each include a sample object therein.
The sample object is a target object needing to establish a three-dimensional model, for example, when the three-dimensional model is established for a human hand, the sample object is the human hand; when the human body is subjected to three-dimensional model building, the sample object is the human body.
The first sample image data and the second sample image data are both known image data used for training a deep learning model for building a three-dimensional model, wherein each sample object comprises a sample object, and information of the sample object is also known, such as scale, posture and the like of the sample object, and is used as comparative reference data of a training result of the deep learning model for building the three-dimensional model.
In one example, the first sample image data and the second sample image data are two consecutive frames of image data cut out from a continuous sample video data including a sample object, and their respective corresponding time instants are consecutive.
Step S12: and inputting the first sample image data and the second sample image data into a depth learning model to obtain the predicted texture and illumination parameters of the sample object, the predicted object modeling parameters and the predicted camera coordinate system transformation parameters.
The predicted object modeling parameters represent the shape, position and scale of the sample object, the predicted camera coordinate system transformation parameters represent coordinate system transformation parameters between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when the camera collects the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera collects the second sample image data.
It will be appreciated that different locations in the sample object may have different characteristics, for example different texture characteristics and different lighting information throughout the sample object. A plurality of key points are marked according to different positions of the sample object, and the complex features of the sample object can be represented by combining the features of the key points.
In an example, the number of the key points of the sample object and the selection of the key points can be determined according to requirements, for example, when the sample object is a human hand, a plurality of joints of the human hand can be marked as a plurality of key points, and the features of the key points corresponding to the plurality of joints are integrated to represent the complex features of the human hand; a plurality of fixed points (for example, 778) with different characteristics can be selected as key points according to the actual requirement for building the three-dimensional model of the human hand, and the characteristics of the key points are integrated to represent the complex characteristics of the human hand.
The predicted texture and illumination parameters described above represent texture features and illumination information for the predicted sample object. In one example, the texture feature of the predicted sample object is RGB information of each key point of the sample object, and may be represented as C ═ Ci∈R3I is more than or equal to 1 and less than or equal to n, C is predicted texture characteristics, R is RGB information of each key point, i is each key point, and n is the total number of the key points; the illumination information includes the ambient illumination and the directional illumination received by the sample object, and specifically, the illumination information may include the received ambient light intensity, ambient light color, directional light intensity, directional light color, light direction, and so on, where the illumination information may be represented as
Figure BDA0003610472900000071
Wherein laWhich is indicative of the intensity of the ambient light,
Figure BDA0003610472900000072
representing the color of ambient light,/dWhich represents the directional optical density of the light,
Figure BDA0003610472900000073
indicating the color of directional light, ndIndicating the direction of the light.
The predicted object modeling parameter represents the shape, position, and scale of the sample object, and is used for representing the posture, shape, and the like of the sample object, and the predicted camera coordinate system transformation parameter represents the coordinate system transformation parameter between the first camera coordinate system and the second camera coordinate system, and is used for estimating the scale, rotation, translation, and the like of the sample object in the camera coordinate system. In one example, the first sample image data and the second sample image data are two consecutive frames of image data captured in a consecutive sample video data including the sample object, the first camera coordinate system is a camera coordinate system at a first time when the camera captures the first sample image data, and the second camera coordinate system is a camera coordinate system at a second time when the camera captures the second sample image data. Specifically, the camera that acquires the sample image data may be a monocular camera.
Step S13: and calculating to obtain the imaging loss according to the predicted texture and the illumination parameters.
And calculating imaging loss according to the predicted texture characteristics and the illumination information of the sample object represented by the predicted texture and illumination parameters, namely obtaining the imaging difference between the predicted sample object and the real sample object in the sample image data according to the texture characteristics and the characteristics of the sample object represented by the illumination information of the predicted sample object.
Step S14: and calculating to obtain the model consistency loss according to the model parameters of the prediction object.
And calculating a model consistency loss according to the shape, the position and the scale of the predicted sample object, namely, according to the shape, the position and the scale of the predicted sample object, obtaining the difference of the shape, the position and the scale between the predicted sample object and the real sample object in the sample image data, specifically, if the predicted sample object is a three-dimensional model of the sample object and the sample image data is two-dimensional image data, the obtained model consistency loss is a two-three-dimensional consistency loss and represents the difference of the shape, the position and the scale between the three-dimensional model predicted according to the two-dimensional image data and the original two-dimensional image data.
Step S15: and calculating to obtain the loss of the motion smoothness according to the predicted camera coordinate system transformation parameters.
It is to be understood that both the camera acquiring the sample image data and the sample object have motion, and especially when the first sample image data and the second sample image data are two consecutive frames of image data captured in a consecutive sample video data including the sample object, the sample object may have motion consistency in the first sample image data and the second sample image data, and then the motion smoothness loss is calculated according to the coordinate system transformation parameters between the first camera coordinate system and the second camera coordinate system represented by the predictive camera coordinate system transformation parameters, so as to improve the smoothness of the predicted sample object in the motion.
Step S16: and adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss.
And adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss which are obtained through calculation, so that the imaging difference between the predicted sample object and the real sample object in the sample image data and the difference between the shape, the position and the scale between the three-dimensional model and the original two-dimensional image data are reduced when the three-dimensional model of the sample object is established by the deep learning model, and the smoothness of the predicted sample object under the motion condition is improved.
As can be seen from the above, the training method for building a deep learning model of a three-dimensional model according to the present disclosure obtains a predicted texture and an illumination parameter of a sample object, a predicted object modeling parameter capable of representing a shape, a position, and a scale of the sample object, and a predicted camera coordinate system transformation parameter representing a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, respectively, by using first sample image data and second sample image data each including the sample object, so that parameters required for three-dimensional modeling can be stably and rapidly predicted according to information in the sample image data, thereby completing three-dimensional modeling. And then, calculating imaging loss, model consistency loss and motion smoothness loss according to the predicted parameters, and adjusting the parameters of the deep learning model so that the deep learning model gradually meets the requirements and the three-dimensional model is more accurately established.
In one embodiment of the present disclosure, the deep learning model includes a feature extraction network, a texture and illumination prediction network, a modeling parameter prediction network, a coordinate system transformation parameter prediction network;
in a possible implementation manner, as shown in fig. 2, the deep learning model includes a feature extraction network, a texture and illumination prediction network, a modeling parameter prediction network, and a coordinate system transformation parameter prediction network;
the step S12 of inputting the first sample image data and the second sample image data into a depth learning model to obtain the predicted texture and illumination parameters of the sample object, the predicted object modeling parameters, and the predicted camera coordinate system transformation parameters includes:
step S21: inputting the first sample image data and the second sample image data into the feature extraction network to obtain a first image feature and a second image feature;
step S22: inputting the first image characteristic and the second image characteristic into the texture and illumination prediction network to obtain the predicted texture and illumination parameters of the sample object;
step S23: inputting the first image characteristic and the second image characteristic into the modeling parameter prediction network to obtain a prediction object modeling parameter of the sample object;
Step S24: and inputting the first image characteristic and the second image characteristic into the coordinate system transformation parameter prediction network to obtain the transformation parameters of the coordinate system of the prediction camera.
When the predicted texture and illumination parameters of a sample object, the predicted object modeling parameters and the predicted camera coordinate system transformation parameters are obtained by using first sample image data and second sample image data, the first sample image data and the second sample image data are firstly input into a preset feature extraction network, first image features of the first sample image data and second image features of the second sample image data are extracted, and the first image features and the second image features are respectively input into a preset texture and illumination prediction network, a modeling parameter prediction network and a coordinate system transformation parameter prediction network based on the obtained image features, so that the predicted texture and illumination parameters, the predicted object modeling parameters and the predicted camera coordinate system transformation parameters of the sample object are respectively obtained.
The feature extraction network, the texture and illumination prediction network, the modeling parameter prediction network, and the coordinate system transformation parameter prediction network are all preset networks according to actual requirements, and in one example, the feature extraction network may be a coding network such as Resnet (a residual error network), and EfficientNet (a convolutional neural network). The texture and illumination prediction network, the modeling parameter prediction network and the coordinate system transformation parameter prediction network can adopt a structure of a decoder and a classifier, can also adopt a structure of the decoder, a pooling layer and a normalization layer, and can be set in a user-defined mode according to actual conditions.
In one example, when the sample object is a human hand, the modeled parameter prediction network may be an MANO parametric model (a hand pose parameter estimation model), and the predicted object modeling parameters are MANO human hand parameters.
As can be seen from the above, the training method for building a deep learning model of a three-dimensional model provided by the present disclosure utilizes first sample image data and second sample image data, which both include sample objects, to be respectively input to a texture and illumination prediction network, a modeling parameter prediction network, and a coordinate system transformation parameter prediction network, and obtains three sets of required parameters by prediction, so that parameters required for three-dimensional modeling can be stably and rapidly obtained by prediction according to information in the sample image data, thereby efficiently completing three-dimensional modeling.
In a possible implementation manner, as shown in fig. 3, the step S14 of calculating a model consistency loss according to the predicted object modeling parameter includes:
step S31: and performing two-dimensional projection in the direction of the original image data based on the prediction object modeling parameters to obtain two-dimensional projection key point data.
Wherein the original image data includes at least one of first sample image data and second sample image data.
After obtaining the prediction object modeling parameters, a three-dimensional model of the sample object may be obtained based on this, the three-dimensional model being a three-dimensional stereoscopic sample object having planes in a plurality of directions. And the sample object in the first sample image data and the second sample image data is a two-dimensional plane of the sample object in one direction. At this time, the three-dimensional model is projected in the direction of the original image data, that is, a three-dimensional stereo sample object is projected in the direction of the two-dimensional plane in the first sample image data and/or the second sample image data, and a plurality of key points are selected for each position of the projected sample object to obtain two-dimensional projection key point data.
For example, when the sample object in the first sample image data and/or the second sample image data is a top view of the sample object in a top view direction, two-dimensional projection key point data is obtained by performing projection in the top view direction on the three-dimensional stereoscopic sample object and using each position in the top view of the sample object as a plurality of key points.
Step S32: and acquiring two-dimensional key point data of a sample object in the original image data to obtain true value two-dimensional key point data.
After the two-dimensional projection key point data is obtained, selecting data of each key point corresponding to the sample object and the two-dimensional projection key point from the first sample image data and/or the second sample image data to obtain true value two-dimensional key point data.
Step S33: and calculating to obtain the model consistency loss according to the difference between the truth-value two-dimensional key point data and the two-dimensional projection key point data.
It is mentioned above that the prediction object modeling parameters are used to represent the shape, position, and scale of the sample object, then the two-dimensional projection keypoint data of the sample object may represent the shape, position, and scale of the predicted three-dimensional model of the sample object, and the true two-dimensional keypoint data is the true shape, position, and scale of the sample object in the first sample image data and/or the second sample image data.
Comparing the difference between the true two-dimensional key point data and the two-dimensional projection key point data, namely comparing the shape, position and scale of the predicted sample object with the shape, position and scale of the real sample object, and calculating the model consistency loss to represent the difference between the shape, position and scale of the two-dimensional sample object and the three-dimensional sample object in the original image data.
In one embodiment of the present disclosure, the model consistency loss is obtained according to the following formula:
Figure BDA0003610472900000111
wherein, EconIn order for the model to be a loss of consistency,
Figure BDA0003610472900000112
for the coordinates of the ith two-dimensional projected keypoint in the two-dimensional projected keypoint data,
Figure BDA0003610472900000121
the coordinates of the ith truth keypoint in the truth two-dimensional keypoint data,
Figure BDA0003610472900000122
l1 smoothing loss is calculated for the coordinates of the ith two-dimensional projection keypoint and the ith truth keypoint, and k represents the total number of truth keypoints in the truth two-dimensional keypoint data.
Based on the model consistency loss, parameters of at least one of the feature extraction network, the modeled parameter prediction network, and the like may be adjusted.
In view of the above, the training method for establishing the deep learning model of the three-dimensional model provided by the disclosure performs difference comparison between the two-dimensional projection key point of the sample object and the true two-dimensional key point of the sample object, so that the model consistency loss can be calculated based on a plurality of key points of the sample object, thereby more accurately obtaining the model consistency loss, representing the difference in shape, position and scale between the two-dimensional sample object and the predicted three-dimensional sample object in the original image data, further adjusting the model parameters, and improving the accuracy of the model.
In one possible implementation manner, as shown in fig. 4, the step S15 calculating the motion smoothness loss according to the predicted camera coordinate system transformation parameters includes:
step S41: based on the predicted object modeling parameters, first three-dimensional coordinates of the respective keypoints of the sample object in the first camera coordinate system are determined, and second three-dimensional coordinates of the respective keypoints of the sample object in the second camera coordinate system are determined.
As mentioned above, the first camera coordinate system is the camera coordinate system at the first time point when the camera acquires the first sample image data, and the first three-dimensional coordinates are the coordinates of each key point of the sample object in the first sample image data; the second camera coordinate system is a camera coordinate system at a second time when the camera acquires the second sample image data, and the second three-dimensional coordinates are coordinates of each key point of the sample object in the second sample image data.
Step S42: and converting the first three-dimensional coordinate into the second camera coordinate system according to the predicted camera coordinate system transformation parameter to obtain a third three-dimensional coordinate.
And predicting the camera coordinate system transformation parameters to be coordinate system transformation parameters between the first camera coordinate system and the second camera coordinate system, and converting the first three-dimensional coordinates of the sample object in the first camera coordinate system into the second camera coordinate system based on the prediction camera coordinate system transformation parameters to obtain third three-dimensional coordinates corresponding to the first three-dimensional coordinates.
Step S43: and calculating to obtain the loss of the motion smoothness according to the difference between the second three-dimensional coordinate and the third three-dimensional coordinate.
As can be seen from the above, if the second three-dimensional coordinate and the third three-dimensional coordinate are in the same camera coordinate system, i.e., the second camera coordinate system, the position difference between the second three-dimensional coordinate and the third three-dimensional coordinate can be directly calculated to obtain the position difference between the sample object in the first sample image data and the second sample image data.
In one embodiment of the present disclosure, the motion smoothness loss is obtained according to the following formula:
Figure BDA0003610472900000131
wherein, EsmoothIn order to provide a loss in the smoothness of the motion,
Figure BDA0003610472900000132
a third three-dimensional coordinate of the ith keypoint of the sample object,
Figure BDA0003610472900000133
is the second three-dimensional coordinate of the ith keypoint of the sample object,
Figure BDA0003610472900000134
indicating that L1 smoothing loss (a smoothing loss function) is calculated for the third three-dimensional coordinate of the ith keypoint and the second three-dimensional coordinate of the ith keypoint, k indicating the total number of keypoints for the sample object.
Based on the motion smoothness loss, parameters of at least one of the feature extraction network, the modeled parameter prediction network, and the coordinate system transformation parameter prediction network may be adjusted.
As can be seen from the above, according to the training method for establishing the deep learning model of the three-dimensional model, coordinates of the sample object in the first camera coordinate system are converted into the second camera coordinate system, so that differences between the coordinates of the sample object in the first camera coordinate system and the coordinates of the sample object in the second camera coordinate system are obtained, and a motion smoothness loss can be obtained based on a position difference between the sample image data and the second sample image data, so that the motion smoothness loss of the sample object is obtained more accurately, a model parameter is adjusted, and accuracy of the model is improved.
In a possible implementation manner, as shown in fig. 5, the step S13 of calculating an imaging loss according to the predicted texture and the illumination parameter includes:
step S51: and rendering to obtain reconstructed image data of the original image data based on the predicted texture, the illumination parameter and the predicted object modeling parameter.
The original image data includes at least one of first sample image data and second sample image data.
According to the prediction object modeling parameters, the positions of all key points of the target object can be obtained, and then the prediction texture and the illumination parameters are combined for rendering, so that the reconstructed image data of the original image data can be obtained.
Step S52: and calculating to obtain the imaging loss based on the difference between the original image data and the reconstructed image data.
The reconstructed image data is three-dimensional image data of the sample object, the three-dimensional image data of the sample object is compared with two-dimensional image data of the sample object in the original image data according to the obtained three-dimensional image data of the sample object, and the obtained difference is the predicted imaging loss of the sample object.
The reconstructed image data obtained by rendering can be obtained based on any three-dimensional rendering mode.
In an embodiment of the present disclosure, an imaging loss is obtained according to the following formula according to the original image data and the reconstructed image data:
Figure BDA0003610472900000141
wherein, EpixelRepresenting the loss of imaging, SreRepresenting a set of keypoints in said reconstructed image data, Iu,vA pixel value representing a keypoint of coordinates (u, v) in the original image data,
Figure BDA0003610472900000142
and Z represents a normalization parameter, and represents that the pixel value difference of each key point is obtained and then average operation is carried out.
Based on the imaging loss, parameters of at least one of the feature extraction network, the texture and illumination prediction network, and the modeled parameter prediction network may be adjusted.
From the above, the training method for establishing the deep learning model of the three-dimensional model provided by the disclosure calculates the imaging loss based on the difference between the two-dimensional image data of the sample object in the original image data and the three-dimensional image data of the sample object in the reconstructed image data while considering the texture and illumination of the image data, so that the reconstructed image data more accurately restores the original image data, and more accurate imaging loss is obtained, and the model parameters are adjusted based on the imaging loss, thereby further improving the accuracy of the model.
Referring to fig. 6, fig. 6 is a schematic flow chart of a three-dimensional model building method provided by the present disclosure, including:
step S61: first image data is acquired.
The first image data comprises a target object of a three-dimensional model to be constructed;
step S62: inputting the first image data into a pre-trained deep learning model, and extracting a network based on the features in the deep learning model to obtain first image features;
the deep learning model is obtained through training by any one of the training devices for building the deep learning model of the three-dimensional model.
Step S63: inputting the first image characteristic into a texture and illumination prediction network of the deep learning model, and determining a predicted texture and illumination parameters of the target object;
Step S64: and inputting the first image characteristics into a modeling parameter prediction network of the deep learning model, and determining prediction object modeling parameters of the target object.
Wherein the object modeling parameters represent a shape, a position, a scale of the target object.
Step S65: and establishing a three-dimensional model of the target object based on the texture and illumination parameters and the object modeling parameters.
The first image data is image data of a target object including a three-dimensional model to be constructed, and specifically, may be two consecutive frames of image data in consecutive video data including the target object. And inputting the first image data into a pre-trained deep learning model, determining texture and illumination parameters representing texture features and illumination information of the target object and object modeling parameters representing the shape, position and scale of the target object, and establishing a three-dimensional model of the target object based on the parameters to obtain the three-dimensional target object.
In one embodiment of the present disclosure, the deep learning model is a deep learning model that removes a coordinate system transformation parameter prediction network.
The coordinate system transformation parameter prediction network is used for calculating the motion smoothness loss of the deep learning model in the training process of the deep learning model, removing the coordinate system transformation parameter prediction network in the deep learning model after the deep learning model is trained, and establishing the three-dimensional model of the target object by using the deep learning model after the coordinate system transformation parameter prediction network is removed.
The obtained three-dimensional model can be used for scenes such as identification, matching, object reconstruction, posture detection, two-dimensional image generation and the like of a target object. In one example, if the first image data is an image including a human hand, the target object to be constructed with the three-dimensional model is the human hand. Inputting the image into a depth learning model trained in advance, and then extracting a network based on the features in the depth learning model to obtain the image features of the human hand. Inputting the image characteristics of the human hand into a texture and illumination prediction network of the deep learning model, and determining the predicted texture and illumination parameters of the human hand; and inputting the image characteristics of the human hand into a modeling parameter prediction network of the deep learning model, and determining the prediction object modeling parameters of the human hand, wherein the object modeling parameters represent the shape, position and scale of the human hand. And finally, establishing a three-dimensional model of the human hand based on the obtained texture, the illumination parameters and the object modeling parameters, wherein the obtained three-dimensional model of the human hand can be used for estimating the hand posture and the like of the human.
In view of the above, the three-dimensional model establishing method provided by the present disclosure inputs the first image data to the pre-trained deep learning model, determines the texture, the illumination parameter and the object modeling parameter, establishes the three-dimensional model of the target object, and establishes the three-dimensional model of the sample object according to the first image data, thereby achieving establishment of the three-dimensional model with high efficiency, low cost and high universality.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a training apparatus for building a deep learning model of a three-dimensional model according to the present disclosure, including:
a data obtaining module 701, configured to obtain first sample image data and second sample image data, where the first sample image data and the second sample image data both include a sample object;
a parameter obtaining module 702, configured to input the first sample image data and the second sample image data into a depth learning model, so as to obtain a predicted texture and illumination parameter, a predicted object modeling parameter, and a predicted camera coordinate system transformation parameter of the sample object, where the predicted object modeling parameter represents a shape, a position, and a scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data;
a first loss calculating module 703, configured to calculate an imaging loss according to the predicted texture and the illumination parameter;
A second loss calculating module 704, configured to calculate a model consistency loss according to the prediction object modeling parameter;
a third loss calculating module 705, configured to calculate, according to the predicted camera coordinate system transformation parameter, a motion smoothness loss;
a parameter adjusting module 706, configured to adjust parameters of the deep learning model according to the imaging loss, the model consistency loss, and the motion smoothness loss.
As can be seen from the above, the training apparatus for building a deep learning model of a three-dimensional model according to the present disclosure obtains a predicted texture and an illumination parameter of a sample object, a predicted object modeling parameter capable of representing a shape, a position, and a scale of the sample object, and a predicted camera coordinate system transformation parameter representing a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, respectively, by using first sample image data and second sample image data each including the sample object, thereby stably and rapidly predicting parameters required for three-dimensional modeling according to information in the sample image data, and further completing three-dimensional modeling. And then, calculating imaging loss, model consistency loss and motion smoothness loss according to the predicted parameters, and adjusting the parameters of the deep learning model so that the deep learning model gradually meets the requirements and the three-dimensional model is more accurately established.
In one embodiment of the present disclosure, the deep learning model includes a feature extraction network, a texture and illumination prediction network, a modeling parameter prediction network, and a coordinate system transformation parameter prediction network;
the parameter obtaining module 706 is specifically configured to:
inputting the first sample image data and the second sample image data into the feature extraction network to obtain a first image feature and a second image feature;
inputting the first image characteristic and the second image characteristic into the texture and illumination prediction network to obtain the predicted texture and illumination parameters of the sample object;
inputting the first image characteristic and the second image characteristic into the modeling parameter prediction network to obtain a prediction object modeling parameter of the sample object;
and inputting the first image characteristic and the second image characteristic into the coordinate system transformation parameter prediction network to obtain the transformation parameters of the coordinate system of the prediction camera.
As can be seen from the above, the training device for building a deep learning model of a three-dimensional model according to the present disclosure uses first sample image data and second sample image data, which both include sample objects, to be respectively input to a texture and illumination prediction network, a modeling parameter prediction network, and a coordinate system transformation parameter prediction network, so as to predict three sets of required parameters, thereby stably and rapidly predicting parameters required for three-dimensional modeling according to information in the sample image data, and further efficiently completing three-dimensional modeling.
In an embodiment of the present disclosure, the second loss calculating module 704 includes:
a projection key point obtaining submodule configured to perform two-dimensional projection in a direction of original image data based on the prediction object modeling parameter to obtain two-dimensional projection key point data, where the original image data includes at least one of first sample image data and second sample image data;
a true value key point obtaining submodule, configured to obtain two-dimensional key point data of a sample object in the original image data, to obtain true value two-dimensional key point data;
and the second loss calculation submodule is used for calculating to obtain the consistency loss of the model according to the difference between the truth-value two-dimensional key point data and the two-dimensional projection key point data.
In an embodiment of the disclosure, the second loss calculating submodule is specifically configured to:
the model consistency loss is obtained according to the following formula:
Figure BDA0003610472900000181
wherein, EconIn order for the model to be a loss of consistency,
Figure BDA0003610472900000182
projecting key points for two dimensionsAccording to the coordinates of the ith two-dimensional projection key point,
Figure BDA0003610472900000183
the coordinates of the ith truth keypoint in the truth two-dimensional keypoint data,
Figure BDA0003610472900000184
l1 smoothing loss is calculated for the coordinates of the ith two-dimensional projection keypoint and the ith truth keypoint, and k represents the total number of truth keypoints in the truth two-dimensional keypoint data.
As can be seen from the above, the training apparatus for building a deep learning model of a three-dimensional model provided by the present disclosure performs difference comparison between a two-dimensional projection key point of a sample object and a true two-dimensional key point of the sample object, so that a model consistency loss can be calculated based on a plurality of key points of the sample object, thereby more accurately obtaining the model consistency loss, representing differences in shape, position, and scale between a two-dimensional sample object and a predicted three-dimensional sample object in original image data, further adjusting model parameters, and improving the accuracy of the model.
In an embodiment of the present disclosure, the third loss calculating module 705 includes:
a three-dimensional coordinate obtaining sub-module for determining, based on the predicted object modeling parameter, first three-dimensional coordinates of each keypoint of the sample object in the first camera coordinate system, and second three-dimensional coordinates of each keypoint of the sample object in the second camera coordinate system;
the three-dimensional coordinate conversion sub-module is used for converting the first three-dimensional coordinate into the second camera coordinate system according to the predicted camera coordinate system conversion parameter to obtain a third three-dimensional coordinate;
And the third loss calculation submodule is used for calculating the loss of the motion smoothness according to the difference between the second three-dimensional coordinate and the third three-dimensional coordinate.
In an embodiment of the disclosure, the third loss calculating sub-module is specifically configured to:
the motion smoothness loss is obtained according to the following formula:
Figure BDA0003610472900000185
wherein E issmoothIn order to account for the loss of smoothness of the motion,
Figure BDA0003610472900000186
a third three-dimensional coordinate of the ith keypoint of the sample object,
Figure BDA0003610472900000187
is the second three-dimensional coordinate of the ith keypoint of the sample object,
Figure BDA0003610472900000191
indicating that L1 smoothing loss is calculated for the third three-dimensional coordinate of the ith keypoint and the second three-dimensional coordinate of the ith keypoint, and k indicates the total number of keypoints of the sample object.
As can be seen from the above, the training device for building a deep learning model of a three-dimensional model provided by the present disclosure converts the coordinates of a sample object in a first camera coordinate system into a second camera coordinate system, so as to obtain the difference between the coordinates of the sample object in the first camera coordinate system and the second camera coordinate system, respectively, and obtain the motion smoothness loss based on the position difference between the sample image in the first sample image data and the second sample image data, thereby more accurately obtaining the motion smoothness loss of the sample object, further adjusting the model parameters, and improving the accuracy of the model.
In an embodiment of the present disclosure, the first loss calculating module 703 includes:
the image rendering sub-module is used for rendering reconstructed image data of original image data based on the predicted texture and illumination parameters and the predicted object modeling parameters, wherein the original image data comprises at least one of first sample image data and second sample image data;
and the first loss calculation submodule is used for calculating to obtain the imaging loss based on the difference between the original image data and the reconstructed image data.
In an embodiment of the disclosure, the first loss calculation submodule is specifically configured to:
obtaining imaging loss according to the original image data and the reconstructed image data according to the following formula:
Figure BDA0003610472900000192
wherein E ispixelRepresenting said imaging loss, SreRepresenting a set of keypoints in said reconstructed image data, Z representing a normalization parameter, Iu,vA pixel value representing a keypoint of coordinates (u, v) in the original image data,
Figure BDA0003610472900000193
a pixel value representing a keypoint of coordinates (u, v) in the reconstructed image data.
Therefore, the training device for establishing the deep learning model of the three-dimensional model provided by the disclosure calculates the imaging loss based on the difference between the two-dimensional image data of the sample object in the original image data and the three-dimensional image data of the sample object in the reconstructed image data while considering the texture and illumination of the image data, so that the reconstructed image data more accurately restores the original image data and more accurate imaging loss is obtained, and the accuracy of the model is further improved by adjusting the model parameters based on the imaging loss.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a three-dimensional model building apparatus provided in the present disclosure, including:
an image data obtaining module 801, configured to obtain first image data, where the first image data includes a target object of a three-dimensional model to be constructed;
a parameter determining module 802, configured to input the first image data into a pre-trained deep learning model, and extract a network based on features in the deep learning model to obtain first image features, where the deep learning model is obtained by training with any one of the above training apparatuses for building a deep learning model of a three-dimensional model; inputting the first image characteristic into a texture and illumination prediction network of the deep learning model, and determining a predicted texture and illumination parameters of the target object; inputting the first image feature into a modeling parameter prediction network of the deep learning model, and determining a prediction object modeling parameter of the target object, wherein the object modeling parameter represents the shape, the position and the scale of the target object;
a model building module 803, configured to build a three-dimensional model of the target object based on the texture and illumination parameters and the object modeling parameters.
In one embodiment of the present disclosure, the deep learning model is a deep learning model with a coordinate system transformation parameter prediction network removed.
As can be seen from the above, the three-dimensional model building apparatus provided by the present disclosure inputs the first image data into the pre-trained deep learning model, determines the texture, the illumination parameter, and the object modeling parameter, builds the three-dimensional model of the target object, and builds the three-dimensional model of the sample object according to the first image data, thereby realizing building of the three-dimensional model with high efficiency, low cost, and high universality.
In the technical scheme of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the common customs of public order.
It should be noted that the head model in this embodiment is not a head model for a specific user, and cannot reflect personal information of a specific user.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a training method for building a deep learning model of a three-dimensional model. For example, in some embodiments, the training method for building a deep learning model of a three-dimensional model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the training method for building a deep learning model of a three-dimensional model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method for building a deep learning model of a three-dimensional model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (22)

1. A training method for building a deep learning model of a three-dimensional model, the method comprising:
acquiring first sample image data and second sample image data, wherein the first sample image data and the second sample image data both comprise sample objects;
inputting the first sample image data and the second sample image data into a depth learning model to obtain a predicted texture and illumination parameter, a predicted object modeling parameter and a predicted camera coordinate system transformation parameter of the sample object, wherein the predicted object modeling parameter represents the shape, position and scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data;
Calculating to obtain imaging loss according to the predicted texture and the illumination parameters;
calculating to obtain model consistency loss according to the prediction object modeling parameters;
calculating to obtain the loss of motion smoothness according to the predicted camera coordinate system transformation parameters;
and adjusting parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss.
2. The method of claim 1, wherein the deep learning model comprises a feature extraction network, a texture and illumination prediction network, a modeling parameter prediction network, a coordinate system transformation parameter prediction network;
inputting the first sample image data and the second sample image data into a deep learning model to obtain a predicted texture and an illumination parameter of the sample object, a predicted object modeling parameter and a predicted camera coordinate system transformation parameter, and the method comprises the following steps:
inputting the first sample image data and the second sample image data into the feature extraction network to obtain a first image feature and a second image feature;
inputting the first image characteristic and the second image characteristic into the texture and illumination prediction network to obtain the predicted texture and illumination parameters of the sample object;
Inputting the first image characteristic and the second image characteristic into the modeling parameter prediction network to obtain a prediction object modeling parameter of the sample object;
and inputting the first image characteristic and the second image characteristic into the coordinate system transformation parameter prediction network to obtain the transformation parameters of the coordinate system of the prediction camera.
3. The method of claim 1, wherein said calculating a model consistency loss based on said predicted object modeling parameters comprises:
performing two-dimensional projection in the direction of original image data based on the prediction object modeling parameter to obtain two-dimensional projection key point data, wherein the original image data comprises at least one of first sample image data and second sample image data;
acquiring two-dimensional key point data of a sample object in the original image data to obtain true value two-dimensional key point data;
and calculating to obtain the model consistency loss according to the difference between the truth-value two-dimensional key point data and the two-dimensional projection key point data.
4. The method of claim 3, wherein the calculating a model consistency loss from the difference of the true two-dimensional keypoint data and the two-dimensional projected keypoint data comprises:
The model consistency loss is obtained according to the following formula:
Figure FDA0003610472890000021
wherein E isconIn order for the model to be a loss of consistency,
Figure FDA0003610472890000022
for the coordinates of the ith two-dimensional projected keypoint in the two-dimensional projected keypoint data,
Figure FDA0003610472890000023
the coordinates of the ith truth keypoint in the truth two-dimensional keypoint data,
Figure FDA0003610472890000024
l1 smoothing loss is calculated for the coordinates of the ith two-dimensional projection keypoint and the ith truth keypoint, and k represents the total number of truth keypoints in the truth two-dimensional keypoint data.
5. The method of claim 1, wherein said calculating a motion smoothness loss from said predicted camera coordinate system transformation parameters comprises:
determining first three-dimensional coordinates of the respective keypoints of the sample object in the first camera coordinate system and second three-dimensional coordinates of the respective keypoints of the sample object in the second camera coordinate system based on the predicted object modeling parameters;
converting the first three-dimensional coordinate into the second camera coordinate system according to the predicted camera coordinate system transformation parameter to obtain a third three-dimensional coordinate;
and calculating to obtain the loss of the motion smoothness according to the difference between the second three-dimensional coordinate and the third three-dimensional coordinate.
6. The method of claim 5, wherein said calculating a motion smoothness loss based on a difference between said second three-dimensional coordinates and said third three-dimensional coordinates comprises:
the motion smoothness loss is obtained according to the following formula:
Figure FDA0003610472890000031
wherein, EsmoothIn order to provide a loss in the smoothness of the motion,
Figure FDA0003610472890000032
a third three-dimensional coordinate of the ith keypoint of the sample object,
Figure FDA0003610472890000033
is the second three-dimensional coordinate of the ith keypoint of the sample object,
Figure FDA0003610472890000034
indicating that the L1 smoothing loss is calculated for the third three-dimensional coordinate of the ith key point and the second three-dimensional coordinate of the ith key point, and k indicates the total number of the key points of the sample object.
7. The method of claim 3, wherein said calculating an imaging loss from said predicted texture and lighting parameters comprises:
based on the predicted texture and illumination parameters and the predicted object modeling parameters, obtaining reconstructed image data of original image data through rendering, wherein the original image data comprises at least one of first sample image data and second sample image data;
and calculating to obtain the imaging loss based on the difference between the original image data and the reconstructed image data.
8. The method of claim 7, wherein the calculating an imaging loss based on a difference of the original image data and the reconstructed image data comprises:
According to the original image data and the reconstructed image data, obtaining the imaging loss according to the following formula:
Figure FDA0003610472890000035
wherein, EpixelRepresenting the loss of imaging, SreRepresenting a set of keypoints in said reconstructed image data, Z representing a normalization parameter,Iu,vA pixel value representing a keypoint of coordinates (u, v) in the original image data,
Figure FDA0003610472890000036
a pixel value representing a keypoint of coordinates (u, v) in the reconstructed image data.
9. A three-dimensional model building method, comprising:
acquiring first image data, wherein the first image data comprises a target object of a three-dimensional model to be constructed;
inputting the first image data into a pre-trained deep learning model, and obtaining first image features based on a feature extraction network in the deep learning model, wherein the deep learning model is obtained by training according to the method of any one of claims 1 to 8;
inputting the first image feature into a texture and illumination prediction network of the deep learning model, and determining a predicted texture and illumination parameters of the target object;
inputting the first image feature into a modeling parameter prediction network of the deep learning model, and determining a prediction object modeling parameter of the target object, wherein the object modeling parameter represents the shape, the position and the scale of the target object;
And establishing a three-dimensional model of the target object based on the texture and illumination parameters and the object modeling parameters.
10. The method of claim 9, wherein the deep learning model is a deep learning model that removes a coordinate system transformation parameter prediction network.
11. A training apparatus for building a deep learning model of a three-dimensional model, the apparatus comprising:
the device comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring first sample image data and second sample image data, and the first sample image data and the second sample image data both comprise sample objects;
a parameter obtaining module, configured to input the first sample image data and the second sample image data into a depth learning model, so as to obtain a predicted texture and illumination parameter, a predicted object modeling parameter, and a predicted camera coordinate system transformation parameter of the sample object, where the predicted object modeling parameter represents a shape, a position, and a scale of the sample object, the predicted camera coordinate system transformation parameter represents a coordinate system transformation parameter between a first camera coordinate system and a second camera coordinate system, the first camera coordinate system is a camera coordinate system when a camera acquires the first sample image data, and the second camera coordinate system is a camera coordinate system when the camera acquires the second sample image data;
The first loss calculation module is used for calculating to obtain imaging loss according to the predicted texture and the illumination parameters;
the second loss calculation module is used for calculating to obtain model consistency loss according to the prediction object modeling parameters;
the third loss calculation module is used for calculating the motion smoothness loss according to the predicted camera coordinate system transformation parameters;
and the parameter adjusting module is used for adjusting the parameters of the deep learning model according to the imaging loss, the model consistency loss and the motion smoothness loss.
12. The apparatus of claim 11, wherein the deep learning model comprises a feature extraction network, a texture and illumination prediction network, a modeling parameter prediction network, a coordinate system transformation parameter prediction network;
the parameter obtaining module is specifically configured to:
inputting the first sample image data and the second sample image data into the feature extraction network to obtain a first image feature and a second image feature;
inputting the first image characteristic and the second image characteristic into the texture and illumination prediction network to obtain the predicted texture and illumination parameters of the sample object;
Inputting the first image characteristic and the second image characteristic into the modeling parameter prediction network to obtain a prediction object modeling parameter of the sample object;
and inputting the first image characteristic and the second image characteristic into the coordinate system transformation parameter prediction network to obtain the transformation parameters of the coordinate system of the prediction camera.
13. The apparatus of claim 11, wherein the second loss calculation module comprises:
a projection key point obtaining sub-module, configured to perform two-dimensional projection in a direction of original image data based on the prediction object modeling parameter to obtain two-dimensional projection key point data, where the original image data includes at least one of first sample image data and second sample image data;
a true value key point obtaining submodule, configured to obtain two-dimensional key point data of a sample object in the original image data, to obtain true value two-dimensional key point data;
and the second loss calculation submodule is used for calculating to obtain the model consistency loss according to the difference between the true value two-dimensional key point data and the two-dimensional projection key point data.
14. The apparatus of claim 13, wherein the second loss computation sub-module is specifically configured to:
The model consistency loss is obtained according to the following formula:
Figure FDA0003610472890000051
wherein E isconIn order for the model to be a loss of consistency,
Figure FDA0003610472890000052
for the ith projection keypoint in the two-dimensional projection keypoint dataThe coordinates of the position of the object to be imaged,
Figure FDA0003610472890000053
the coordinates of the ith truth keypoint in the truth two-dimensional keypoint data,
Figure FDA0003610472890000054
l1 smoothing loss is calculated for the coordinates of the ith projection keypoint and the ith truth keypoint, and k represents the total number of truth keypoints in the truth two-dimensional keypoint data.
15. The apparatus of claim 11, wherein the third loss calculation module comprises:
a three-dimensional coordinate obtaining sub-module for determining, based on the predicted object modeling parameter, first three-dimensional coordinates of each keypoint of the sample object in the first camera coordinate system, and second three-dimensional coordinates of each keypoint of the sample object in the second camera coordinate system;
the three-dimensional coordinate conversion sub-module is used for converting the first three-dimensional coordinate into the second camera coordinate system according to the predicted camera coordinate system conversion parameter to obtain a third three-dimensional coordinate;
and the third loss calculation submodule is used for calculating the loss of the motion smoothness according to the difference between the second three-dimensional coordinate and the third three-dimensional coordinate.
16. The apparatus of claim 15, wherein the third loss computation sub-module is specifically configured to:
the motion smoothness loss is obtained according to the following formula:
Figure FDA0003610472890000061
wherein, EsmoothIn order to provide a loss in the smoothness of the motion,
Figure FDA0003610472890000062
a third three-dimensional coordinate for the ith keypoint of the sample object,
Figure FDA0003610472890000063
second three-dimensional coordinates for the ith keypoint of the sample object,
Figure FDA0003610472890000064
indicating that the L1 smoothing loss is calculated for the third three-dimensional coordinate of the ith key point and the second three-dimensional coordinate of the ith key point, and k indicates the total number of the key points of the sample object.
17. The apparatus of claim 13, wherein the first loss calculation module comprises:
the image rendering sub-module is used for rendering reconstructed image data of original image data based on the predicted texture and illumination parameters and the predicted object modeling parameters, wherein the original image data comprises at least one of first sample image data and second sample image data;
and the first loss calculation submodule is used for calculating to obtain the imaging loss based on the difference between the original image data and the reconstructed image data.
18. The apparatus of claim 17, wherein the first loss computation submodule is specifically configured to:
Obtaining imaging loss according to the original image data and the reconstructed image data according to the following formula:
Figure FDA0003610472890000065
wherein E ispixelRepresenting said imaging loss, SreRepresenting a set of keypoints in said reconstructed image data, Z representing a normalization parameter, Iu,vRepresenting the original image dataThe pixel value of the key point with the middle coordinate (u, v),
Figure FDA0003610472890000066
a pixel value representing a keypoint of coordinates (u, v) in the reconstructed image data.
19. A three-dimensional model building apparatus comprising:
the image data acquisition module is used for acquiring first image data, wherein the first image data comprises a target object of a three-dimensional model to be constructed;
a parameter determination module, configured to input the first image data into a pre-trained deep learning model, and obtain a first image feature based on a feature extraction network in the deep learning model, where the deep learning model is obtained by training with the apparatus according to any one of claims 11 to 18; inputting the first image characteristic into a texture and illumination prediction network of the deep learning model, and determining a predicted texture and illumination parameters of the target object; inputting the first image feature into a modeling parameter prediction network of the deep learning model, and determining a prediction object modeling parameter of the target object, wherein the object modeling parameter represents the shape, the position and the scale of the target object;
And the model establishing module is used for establishing a three-dimensional model of the target object based on the texture and illumination parameters and the object modeling parameters.
20. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.
21. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-10.
22. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.
CN202210430966.3A 2022-04-22 2022-04-22 Training method and device for deep learning model for building three-dimensional model Pending CN114758076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210430966.3A CN114758076A (en) 2022-04-22 2022-04-22 Training method and device for deep learning model for building three-dimensional model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210430966.3A CN114758076A (en) 2022-04-22 2022-04-22 Training method and device for deep learning model for building three-dimensional model

Publications (1)

Publication Number Publication Date
CN114758076A true CN114758076A (en) 2022-07-15

Family

ID=82332979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210430966.3A Pending CN114758076A (en) 2022-04-22 2022-04-22 Training method and device for deep learning model for building three-dimensional model

Country Status (1)

Country Link
CN (1) CN114758076A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115530855A (en) * 2022-09-30 2022-12-30 先临三维科技股份有限公司 Control method and device of three-dimensional data acquisition equipment and three-dimensional data acquisition equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176042A (en) * 2019-05-31 2019-08-27 北京百度网讯科技有限公司 Training method, device and the storage medium of camera self moving parameter estimation model
CN112330729A (en) * 2020-11-27 2021-02-05 中国科学院深圳先进技术研究院 Image depth prediction method and device, terminal device and readable storage medium
CN112907557A (en) * 2021-03-15 2021-06-04 腾讯科技(深圳)有限公司 Road detection method, road detection device, computing equipment and storage medium
US20210281814A1 (en) * 2020-03-04 2021-09-09 Toyota Research Institute, Inc. Systems and methods for self-supervised depth estimation according to an arbitrary camera
WO2021174939A1 (en) * 2020-03-03 2021-09-10 平安科技(深圳)有限公司 Facial image acquisition method and system
US20210406599A1 (en) * 2020-06-26 2021-12-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Model training method and apparatus, and prediction method and apparatus
US20220414911A1 (en) * 2020-03-04 2022-12-29 Huawei Technologies Co., Ltd. Three-dimensional reconstruction method and three-dimensional reconstruction apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110176042A (en) * 2019-05-31 2019-08-27 北京百度网讯科技有限公司 Training method, device and the storage medium of camera self moving parameter estimation model
WO2021174939A1 (en) * 2020-03-03 2021-09-10 平安科技(深圳)有限公司 Facial image acquisition method and system
US20210281814A1 (en) * 2020-03-04 2021-09-09 Toyota Research Institute, Inc. Systems and methods for self-supervised depth estimation according to an arbitrary camera
US20220414911A1 (en) * 2020-03-04 2022-12-29 Huawei Technologies Co., Ltd. Three-dimensional reconstruction method and three-dimensional reconstruction apparatus
US20210406599A1 (en) * 2020-06-26 2021-12-30 Beijing Baidu Netcom Science And Technology Co., Ltd. Model training method and apparatus, and prediction method and apparatus
CN112330729A (en) * 2020-11-27 2021-02-05 中国科学院深圳先进技术研究院 Image depth prediction method and device, terminal device and readable storage medium
CN112907557A (en) * 2021-03-15 2021-06-04 腾讯科技(深圳)有限公司 Road detection method, road detection device, computing equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁炎兴等: "基于可见光单图像三维结构恢复方法综述", 《集成技术》, vol. 10, no. 6, 30 November 2021 (2021-11-30) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115530855A (en) * 2022-09-30 2022-12-30 先临三维科技股份有限公司 Control method and device of three-dimensional data acquisition equipment and three-dimensional data acquisition equipment
WO2024067027A1 (en) * 2022-09-30 2024-04-04 先临三维科技股份有限公司 Control method and apparatus for three-dimensional data acquisition device, and three-dimensional data acquisition device

Similar Documents

Publication Publication Date Title
WO2022116677A1 (en) Target object grasping method and apparatus, storage medium, and electronic device
JP2020507850A (en) Method, apparatus, equipment, and storage medium for determining the shape of an object in an image
CN114186632B (en) Method, device, equipment and storage medium for training key point detection model
CN109858333B (en) Image processing method, image processing device, electronic equipment and computer readable medium
JP7273129B2 (en) Lane detection method, device, electronic device, storage medium and vehicle
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN113674421B (en) 3D target detection method, model training method, related device and electronic equipment
CN115880435B (en) Image reconstruction method, model training method, device, electronic equipment and medium
CN113793370B (en) Three-dimensional point cloud registration method and device, electronic equipment and readable medium
CN112862877A (en) Method and apparatus for training image processing network and image processing
CN112652057A (en) Method, device, equipment and storage medium for generating human body three-dimensional model
CN113688907A (en) Model training method, video processing method, device, equipment and storage medium
CN115690382A (en) Training method of deep learning model, and method and device for generating panorama
CN114898111B (en) Pre-training model generation method and device, and target detection method and device
CN114677572B (en) Object description parameter generation method and deep learning model training method
CN114758076A (en) Training method and device for deep learning model for building three-dimensional model
CN117409431B (en) Multi-mode large language model training method, electronic equipment and storage medium
KR20220088289A (en) Apparatus and method for estimating object pose
CN117456236A (en) Zero sample classification method, device and equipment for 3D point cloud data and storage medium
CN112085842A (en) Depth value determination method and device, electronic equipment and storage medium
CN115375740A (en) Pose determination method, three-dimensional model generation method, device, equipment and medium
CN114842066A (en) Image depth recognition model training method, image depth recognition method and device
CN116310615A (en) Image processing method, device, equipment and medium
CN113920023A (en) Image processing method and device, computer readable medium and electronic device
CN113610856A (en) Method and device for training image segmentation model and image segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination