CN115049780A

CN115049780A - Deep rendering model training method and device, and target rendering method and device

Info

Publication number: CN115049780A
Application number: CN202210582297.1A
Authority: CN
Inventors: 申童; 张炜; 梅涛
Original assignee: Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-09-13
Also published as: WO2023226479A1

Abstract

The disclosure provides a depth rendering model training method and device and a target rendering method and device, and relates to the field of artificial intelligence. The depth rendering model training method comprises the following steps: rendering the target to be rendered by using preset control parameters to generate a sample image frame; inputting the control parameters into a machine learning model; inputting the tensor to be processed into a machine learning model so as to render the tensor to be processed according to the control parameters by using the machine learning model to generate a rendered image frame; determining a loss function value according to the rendering image frame and the sample image frame; and training the machine learning model and the tensor to be processed by using the loss function value to generate a depth rendering model and a target tensor. The target rendering method comprises the following steps: acquiring a control parameter associated with a rendering target; and inputting the control parameters into the depth rendering model so as to render the target tensor according to the control parameters by using the depth rendering model to generate the image frame. The rendering time cost can be effectively reduced.

Description

Deep rendering model training method and device, and target rendering method and device

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to a depth rendering model training method and device and a target rendering method and device.

Background

CG (Computer Graphics) is a technology for converting three-dimensional objects into images or videos using computers and mathematical algorithms. With the improvement of computer computing power, CG technology is widely used in scenes such as movies, games, advertisements, interactive systems, etc. CG technology generally requires a series of operations such as modeling, designing materials, laying out lights, rendering, and the like. The final quality of the picture also depends on the performance of each link.

CG models of characters are typically complex to design. The desire to show realistic character effects requires its fine modeling, fine texture, complex hair systems, powerful rendering engines as support. The movie industry level rendering requires a great deal of effort to accomplish the rendering work for each frame. A frame of an image may take tens of hours or more. In an interactive scene, people need to be controlled in real time, namely, control signals are used for driving expressions, gestures, postures and the like of the people, and a renderer needs to generate high-quality pictures in real time to form an interactive video.

Disclosure of Invention

The inventors have noticed that a highly accurate character model can be created using CG technology and very realistic images can be rendered, but with the accompanying increase in rendering time costs, it cannot be applied to scenes where real-time performance is required.

Accordingly, the present disclosure provides a deep rendering model training scheme, which can effectively reduce the rendering time cost while ensuring the rendering effect of the target character.

According to a first aspect of the embodiments of the present disclosure, there is provided a depth rendering model training method, including: rendering the target to be rendered by using preset control parameters to generate a sample image frame; inputting the control parameters into a machine learning model; inputting a tensor to be processed into the machine learning model so as to render the tensor to be processed according to the control parameters by using the machine learning model to generate a rendered image frame; determining a loss function value according to the rendering image frame and the sample image frame; and training the machine learning model and the tensor to be processed by utilizing the loss function value so as to generate a depth rendering model and a target tensor.

In some embodiments, the machine learning model comprises M machine learning submodels, M being a positive integer; the rendering the to-be-processed tensor according to the control parameters by using the machine learning model includes: inputting the ith feature tensor into an ith machine learning submodel, so as to render the ith feature tensor by using the ith machine learning submodel according to the control parameters, and output an i +1 th feature tensor, wherein i is more than or equal to 1 and is less than or equal to M, and the 1 st feature tensor is the tensor to be processed; and taking the (M + 1) th feature tensor output by the M machine learning sub-model as the rendering image frame.

In some embodiments, each machine learning submodel includes: the device comprises a fusion module, a rendering module, a nonlinear operation module and a resolution enhancement module; the rendering the ith feature tensor according to the control parameters by using the ith machine learning submodel comprises: in the ith machine learning submodel, processing the ith feature tensor by using the fusion module according to the control parameters so as to output a 1 st intermediate tensor; processing the 1 st intermediate tensor with the rendering module to output a 2 nd intermediate tensor; processing the 2 nd intermediate tensor with the nonlinear operation module to output a 3 rd intermediate tensor; processing the 3 rd intermediate tensor with the resolution enhancement module to output the i +1 th feature tensor.

In some embodiments, the fusion module is a normalization layer module; the rendering module is a convolutional layer module; the nonlinear operation module is an active layer module; the resolution enhancement module is an upsampling layer module.

In some embodiments, said inputting said control parameters into a machine learning model comprises: processing the control parameters by utilizing a multilayer perceptron to generate hidden variables; and inputting the hidden variables into a fusion module in each machine learning submodel.

In some embodiments, the machine learning model further comprises a channel matching module; the using the (M + 1) th feature tensor output by the (M) th machine learning submodel as the rendering image frame comprises: inputting the (M + 1) th feature tensor output by the M machine learning submodel into the channel matching module so as to perform channel matching processing on the (M + 1) th feature tensor by using the channel matching module to generate the rendering image frame.

In some embodiments, the channel matching module is a convolutional layer module.

In some embodiments, the control parameters include at least one of a gesture, or an expression of the target.

In some embodiments, the loss function value comprises at least one of a pixel loss value, a perceptual loss value, a feature matching loss value, or an antagonistic loss value.

According to a second aspect of the embodiments of the present disclosure, there is provided a depth rendering model training apparatus, including: the first training processing module is configured to render the target to be rendered by using preset control parameters so as to generate a sample image frame; a second training processing module configured to input the control parameters into a machine learning model; a third training processing module configured to input a to-be-processed tensor into the machine learning model, to render the to-be-processed tensor according to the control parameters by using the machine learning model to generate a rendered image frame, to determine a loss function value according to the rendered image frame and a sample image frame, and to train the machine learning model and the to-be-processed tensor by using the loss function value to generate a depth rendering model and a target tensor.

In some embodiments, the machine learning model comprises M machine learning submodels, M being a positive integer; the third training processing module is configured to input an ith feature tensor into an ith machine learning submodel, so that the ith feature tensor is rendered by the ith machine learning submodel according to the control parameters to output an (i + 1) th feature tensor, wherein i is greater than or equal to 1 and is less than or equal to M, the 1 st feature tensor is the to-be-processed tensor, and the (M + 1) th feature tensor output by the Mth machine learning submodel is taken as the rendering image frame.

In some embodiments, each machine learning submodel includes: the device comprises a fusion module, a rendering module, a nonlinear operation module and a resolution enhancement module; the third training processing module is configured to, in the ith machine learning submodel, process the ith feature tensor according to the control parameters by using the fusion module to output a 1 st intermediate tensor, process the 1 st intermediate tensor by using the rendering module to output a 2 nd intermediate tensor, process the 2 nd intermediate tensor by using the nonlinear operation module to output a 3 rd intermediate tensor, and process the 3 rd intermediate tensor by using the resolution enhancement module to output the i +1 th feature tensor.

In some embodiments, the second training processing module is configured to process the control parameters using a multi-layered perceptron to generate hidden variables, which are input to the fusion module in each of the machine learning submodels.

In some embodiments, the machine learning model further comprises a channel matching module; the third training processing module is configured to input the (M + 1) th feature tensor output by the (M) th machine learning submodel into the channel matching module, so as to perform channel matching processing on the (M + 1) th feature tensor by using the channel matching module to generate the rendering image frame.

In some embodiments, the control parameters include at least one of a pose, a gesture, or an expression of the target.

According to a third aspect of the embodiments of the present disclosure, there is provided a depth rendering model training apparatus, including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a training method implementing any of the embodiments described above based on instructions stored by the memory.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a target rendering method, including: acquiring a control parameter associated with a rendering target; inputting control parameters into a depth rendering model, so as to render a target tensor according to the control parameters by using the depth rendering model to generate an image frame, wherein the depth rendering model and the target tensor are obtained by training by using the training method of any one of the above embodiments.

According to a fifth aspect of embodiments of the present disclosure, there is provided a target rendering apparatus including: a first rendering processing module configured to acquire a control parameter associated with a rendering target; and a second rendering processing module configured to input the control parameters into a depth rendering model, so as to render a target tensor according to the control parameters by using the depth rendering model to generate an image frame, wherein the depth rendering model and the target tensor are trained by using the training method described in any one of the above embodiments.

According to a sixth aspect of embodiments of the present disclosure, there is provided a target rendering apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method of implementing the target rendering method as described in any of the above embodiments based on instructions stored by the memory.

According to a seventh aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, in which computer instructions are stored, and when executed by a processor, the computer-readable storage medium implements the method according to any of the embodiments described above.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings may be obtained according to the drawings without inventive labor.

Fig. 1 is a schematic flow chart of a depth rendering model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a machine learning model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a structure of a machine learning submodel according to an embodiment of the disclosure;

FIG. 4 is a schematic structural diagram of a machine learning model according to another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model training framework according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a depth rendering model training apparatus according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a depth rendering model training apparatus according to another embodiment of the present disclosure;

FIG. 8 is a flowchart illustrating a target rendering method according to an embodiment of the disclosure;

FIG. 9 is a model deployment framework diagram of one embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a target rendering apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a target rendering apparatus according to another embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the embodiments described are only some embodiments of the present disclosure, rather than all embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Fig. 1 is a schematic flow chart diagram of a deep rendering model training method according to an embodiment of the present disclosure. In some embodiments, the following depth rendering model training method is performed by a depth rendering model training apparatus.

In step 101, a target to be rendered is rendered using preset control parameters to generate a sample image frame.

In some embodiments, the 3D mesh output by the CG model is provided to a renderer, which renders the 3D mesh with preset control parameters to generate a sample image frame.

For example, the control parameters include at least one of a pose, a gesture, or an expression of the target.

At step 102, control parameters are input into the machine learning model.

In some embodiments, the control parameters are processed by MLP (multi layer Perceptron) to generate hidden variables, and the hidden variables are input into a machine learning model, so as to improve the expressive power of the control parameters.

In step 103, the tensor to be processed is input into the machine learning model, so that the tensor to be processed is rendered by the machine learning model according to the control parameters to generate a rendered image frame.

In some embodiments, the pending tensor is randomly generated.

In some embodiments, the machine learning model includes M machine learning submodels, M being a positive integer.

In the rendering process, the ith feature tensor is input into the ith machine learning submodel, so that the ith feature tensor is rendered by the ith machine learning submodel according to the control parameters, the (i + 1) th feature tensor is output, i is greater than or equal to 1 and is less than or equal to M, and the 1 st feature tensor is a to-be-processed tensor. Next, the M +1 th feature tensor output by the mth machine learning submodel is taken as a rendered image frame.

Fig. 2 is a schematic structural diagram of a machine learning model according to an embodiment of the present disclosure. As an example, in fig. 2, the machine learning model includes 3 machine learning submodels. The to-be-processed tensor generated at random is input into the 1 st machine learning submodel 21 as the 1 st feature tensor, so that the 1 st feature tensor is rendered according to the control parameters by using the 1 st machine learning submodel 21 to output the 2 nd feature tensor.

Next, the 2 nd feature tensor is input to the 2 nd machine learning submodel 22, so that the 2 nd feature tensor is rendered according to the control parameters by the 2 nd machine learning submodel 22 to output the 3 rd feature tensor.

Next, the 3 rd feature tensor is input to the 3 rd machine learning submodel 23, so that the 3 rd feature tensor is rendered according to the control parameters by the 3 rd machine learning submodel 23 to output the 4 th feature tensor. Thereby treating the 4 th feature tensor as a rendered image frame.

It should be noted that, the greater the number of the machine learning submodels in the machine learning model, the greater the depth of the machine learning model can be increased, thereby improving the characterization capability of the machine learning model. However, as the number of machine learning submodels increases, the time required for training and reasoning may be extended and more computing resources may be occupied. Therefore, the number of the machine learning submodels in the machine learning model can be adjusted according to specific use conditions.

Fig. 3 is a schematic structural diagram of a machine learning submodel according to an embodiment of the disclosure. As shown in fig. 3, each machine learning submodel includes: a fusion module 31, a rendering module 32, a non-linear operation module 33 and a resolution enhancement module 34.

For example, in the process of rendering the received feature tensor according to the control parameters by each machine learning submodel, the fusion module 31 processes the received feature tensor according to the control parameters to output the 1 st intermediate tensor so as to fuse the feature tensor with the control parameters.

In some embodiments, the fusion module is a normalization layer module. For example, the Normalization layer used here is an AdaIN (Adaptive Instance Normalization) layer.

Next, the rendering module 32 processes the 1 st intermediate tensor to output the 2 nd intermediate tensor so as to implement the rendering process of the intermediate tensor.

In some embodiments, the rendering module is a convolutional layer module.

Next, the nonlinear operation module 33 processes the 2 nd intermediate tensor to output the 3 rd intermediate tensor so as to enhance the characterizing capability of the intermediate tensor.

In some embodiments, the nonlinear operation module is an active layer module.

Next, the resolution enhancement module 34 processes the 3 rd intermediate tensor to output the 4 th intermediate tensor so as to enhance the resolution of the feature tensor.

In some embodiments, the resolution enhancement module is an upsample layer module.

Fig. 4 is a schematic structural diagram of a machine learning model according to another embodiment of the present disclosure. Fig. 4 differs from fig. 2 in that in the embodiment shown in fig. 4, the machine learning model further includes a channel matching module 24.

As shown in fig. 4, the 4 th feature tensor output by the 3 rd machine learning submodel 23 is input to the channel matching module 24, so that the channel matching module 24 performs channel matching processing on the 4 th feature tensor to generate a rendered image frame. So that the channel number of the rendered image frame meets the preset channel number requirement. For example, the channel matching module 24 performs channel matching processing on the 4 th feature tensor to generate a rendering image frame with the channel number of 512.

In some embodiments, as shown in fig. 4, the machine learning model further includes an MLP 25. MLP25 processes the control parameters to increase the expressive power of the control parameters and provides the output to the fusion module in each machine learning submodel.

Returning to fig. 1. In step 104, a loss function value is determined from the rendered image frame and the sample image frame.

In some embodiments, the loss function value includes at least one of a pixel loss (L1 loss) value, a perceptual loss (perceptual loss) value, a feature matching loss (feature matching loss) value, or an adaptive loss (adaptive loss) value.

In step 105, the machine learning model and the to-be-processed tensor are trained using the loss function values to generate a depth rendering model and a target tensor.

FIG. 5 is a diagram of a model training framework according to an embodiment of the present disclosure.

As shown in fig. 5, the CG model 51 outputs a 3D mesh, and the parameter configuration module 53 configures control parameters for the renderer 52 and the machine learning model 54. The renderer 52 renders the 3D mesh with the control parameters to generate a sample image frame. The machine learning model 54 renders the randomly generated to-be-processed tensor according to the control parameters to generate a rendered image frame. And then determining a loss function value according to the rendering image frame and the sample image frame, and training the machine learning model and the tensor to be processed by using the loss function value to generate a depth rendering model and a target tensor.

In the depth rendering model training method provided by the above embodiment of the present disclosure, a target to be rendered is rendered by using a preset control parameter, so as to generate a sample image frame. And then training the machine learning model and the to-be-processed tensor input into the machine learning model by using the sample image frame and preset control parameters, so as to generate a depth rendering model and a target tensor. And then the target rendering can be rapidly realized by utilizing the depth rendering model and the target tensor.

Fig. 6 is a schematic structural diagram of a depth rendering model training apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the depth rendering model training apparatus includes a first training process module 61, a second training process module 62, and a third training process module 63.

The first training processing module 61 is configured to render the target to be rendered using preset control parameters to generate a sample image frame.

In some embodiments, the first training processing module 61 provides the 3D mesh output by the CG model to a renderer, which renders the 3D mesh with preset control parameters to generate a sample image frame.

For example, the control parameters include at least one of a gesture, or an expression of the target.

The second training process module 62 is configured to input control parameters into the machine learning model.

In some embodiments, the MLP is used to process the control parameters to generate hidden variables, and the hidden variables are input into the machine learning model, so as to improve the expression capability of the control parameters.

The third training processing module 63 is configured to input the to-be-processed tensor into the machine learning model, to render the to-be-processed tensor according to the control parameters using the machine learning model to generate a rendered image frame, to determine a loss function value from the rendered image frame and the sample image frame, and to train the machine learning model and the to-be-processed tensor using the loss function value to generate a depth rendering model and a target tensor.

In some embodiments, the pending tensor is randomly generated.

In the rendering process, the third training processing module 63 inputs the ith feature tensor into the ith machine learning submodel, so that the ith machine learning submodel is used to render the ith feature tensor according to the control parameters, and the ith +1 feature tensor is output, wherein i is greater than or equal to 1 and is less than or equal to M, and the 1 st feature tensor is the to-be-processed tensor. Next, the M +1 th feature tensor output by the mth machine learning submodel is taken as a rendered image frame.

In some embodiments, each machine learning submodel includes: the device comprises a fusion module, a rendering module, a nonlinear operation module and a resolution enhancement module.

The third training processing module is configured to process the ith feature tensor according to the control parameters by using the fusion module to output a 1 st intermediate tensor so as to fuse the feature tensor with the control parameters in the ith machine learning submodel. The 1 st intermediate tensor is processed by the rendering module to output a 2 nd intermediate tensor so as to realize the rendering processing of the intermediate tensor. The 2 nd intermediate tensor is next processed with the non-linear operation module to output the 3 rd intermediate tensor so as to enhance the characterization capability of the intermediate tensor. The 3 rd intermediate tensor is then processed by a resolution enhancement module to output the (i + 1) th feature tensor, thereby enhancing the resolution of the feature tensor.

In some embodiments, the fusion module is a normalization layer module. For example, the normalization layer used here is an AdaIN layer. The rendering module is a convolutional layer module, the nonlinear operation module is an active layer module, and the resolution enhancement module is an up-sampling layer module.

In some embodiments, the machine learning model further comprises a channel matching module.

The third training processing module 63 is configured to input the M +1 th feature tensor output by the mth machine learning submodel into the channel matching module, so as to perform channel matching processing on the M +1 th feature tensor by using the channel matching module to generate a rendered image frame. Therefore, the number of channels of the rendered image frame meets the preset channel number requirement.

In some embodiments, the machine learning model further comprises MLPs. The third training processing module 63 processes the control parameters using MLP to increase the expressive power of the control parameters and provides the output results to the fusion module in each machine learning submodel.

In some embodiments, the loss function values include at least one of a pixel loss value, a perceptual loss value, a feature matching loss value, or an antagonistic loss value.

Fig. 7 is a schematic structural diagram of a depth rendering model training apparatus according to another embodiment of the present disclosure. As shown in fig. 7, the depth rendering model training apparatus includes a memory 71 and a processor 72.

The memory 71 is used for storing instructions, the processor 72 is coupled to the memory 71, and the processor 72 is configured to execute the method according to any embodiment in fig. 1 based on the instructions stored in the memory.

As shown in fig. 7, the depth rendering model training apparatus further includes a communication interface 73 for information interaction with other devices. Meanwhile, the deep rendering model training device further comprises a bus 74, and the processor 72, the communication interface 73 and the memory 71 are communicated with each other through the bus 74.

The memory 71 may comprise high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory. The memory 71 may also be a memory array. The storage 71 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules.

Further, the processor 72 may be a central processing unit CPU, or may be an application specific integrated circuit ASIC, or one or more integrated circuits configured to implement embodiments of the present disclosure.

The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement the method according to any one of the embodiments in fig. 1.

Fig. 8 is a flowchart illustrating a target rendering method according to an embodiment of the disclosure. In some embodiments, the following target rendering method is performed by a target rendering apparatus.

In step 801, control parameters associated with a render target are obtained.

In step 802, the control parameters are input into the depth rendering model, so that the target tensor is rendered according to the control parameters by using the depth rendering model to generate an image frame. The depth rendering model and the target tensor are obtained by training by using the training method related to any embodiment in fig. 1.

FIG. 9 is a model deployment framework diagram of one embodiment of the present disclosure.

As shown in fig. 9, the driving module 91 supplies control parameters for controlling the posture, gesture, expression, and the like of the target person to the depth rendering model 92. The depth rendering model 92 renders the target tensor with the control parameters to generate an image frame. The depth rendering model 92 and the target tensor are obtained by training according to the training method related to any one of the embodiments in fig. 1.

Fig. 10 is a schematic structural diagram of a target rendering apparatus according to an embodiment of the present disclosure. As shown in fig. 10, the target rendering apparatus includes a first rendering processing module 1001 and a second rendering processing module 1002.

The first rendering processing module 1001 acquires a control parameter associated with a rendering target.

The second rendering processing module 1002 inputs the control parameters into the depth rendering model to render the target tensor according to the control parameters using the depth rendering model to generate the image frame. The depth rendering model and the target tensor are obtained by training through the training method related to any embodiment of the fig. 1.

Fig. 11 is a schematic structural diagram of a target rendering apparatus according to another embodiment of the present disclosure. As shown in fig. 11, the target rendering apparatus includes a memory 111, a processor 112, a communication interface 113, and a bus 114. Fig. 11 differs from fig. 7 in that, in the embodiment shown in fig. 11, the processor 112 is configured to perform the method according to any of the embodiments in fig. 8 based on instructions stored in the memory.

The present disclosure also relates to a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement a method according to any one of the embodiments in fig. 5.

By implementing the scheme of the disclosure, under the condition of not using the existing high-precision CG model and renderer, high-quality video image frames can be directly generated only by inputting control parameters for controlling the posture, the gesture, the expression and the like of the target person into the deep rendering network.

In some embodiments, the functional unit modules described above can be implemented as a general purpose Processor, a Programmable Logic Controller (PLC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable Logic device, discrete Gate or transistor Logic, discrete hardware components, or any suitable combination thereof for performing the functions described in this disclosure.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A depth rendering model training method comprises the following steps:

rendering the target to be rendered by using preset control parameters to generate a sample image frame;

inputting the control parameters into a machine learning model;

inputting a tensor to be processed into the machine learning model so as to render the tensor to be processed by using the machine learning model according to the control parameters to generate a rendered image frame;

determining a loss function value according to the rendering image frame and the sample image frame;

and training the machine learning model and the tensor to be processed by utilizing the loss function value so as to generate a depth rendering model and a target tensor.

2. The method of claim 1, wherein the machine learning model includes M machine learning submodels, M being a positive integer;

the rendering the to-be-processed tensor according to the control parameters by using the machine learning model comprises:

inputting the ith feature tensor into an ith machine learning submodel, so that the ith machine learning submodel is used for rendering the ith feature tensor according to the control parameters, and outputting an (i + 1) th feature tensor, wherein i is more than or equal to 1 and less than or equal to M, and the 1 st feature tensor is the tensor to be processed;

and taking the (M + 1) th feature tensor output by the M machine learning submodel as the rendering image frame.

3. The method of claim 2, wherein each machine learning submodel comprises: the device comprises a fusion module, a rendering module, a nonlinear operation module and a resolution enhancement module;

the rendering the ith feature tensor according to the control parameters by using the ith machine learning submodel comprises:

in the ith machine learning submodel, processing the ith feature tensor by using the fusion module according to the control parameters so as to output a 1 st intermediate tensor;

processing the 1 st intermediate tensor with the rendering module to output a 2 nd intermediate tensor;

processing the 2 nd intermediate tensor with the nonlinear operation module to output a 3 rd intermediate tensor;

processing the 3 rd intermediate tensor with the resolution enhancement module to output the i +1 th feature tensor.

4. The method of claim 3, wherein:

the fusion module is a normalization layer module;

the rendering module is a convolutional layer module;

the nonlinear operation module is an active layer module;

the resolution enhancement module is an upsampling layer module.

5. The method of claim 3, wherein said inputting the control parameters into a machine learning model comprises:

processing the control parameters by utilizing a multilayer perceptron to generate hidden variables;

and inputting the hidden variables into a fusion module in each machine learning submodel.

6. The method of claim 2, wherein the machine learning model further comprises a channel matching module;

the using, as the rendered image frame, the (M + 1) th feature tensor output by the (M) th machine learning submodel includes:

inputting the (M + 1) th feature tensor output by the M machine learning submodel into the channel matching module so as to perform channel matching processing on the (M + 1) th feature tensor by using the channel matching module to generate the rendering image frame.

7. The method of claim 6, wherein,

the channel matching module is a convolutional layer module.

8. The method of claim 1, wherein,

the control parameter includes at least one of a gesture, or an expression of the target.

9. The method of any one of claims 1-8,

the loss function value includes at least one of a pixel loss value, a perceptual loss value, a feature matching loss value, or an antagonistic loss value.

10. A depth rendering model training apparatus, comprising:

the first training processing module is configured to render the target to be rendered by using preset control parameters so as to generate a sample image frame;

a second training processing module configured to input the control parameters into a machine learning model;

a third training processing module configured to input a to-be-processed tensor into the machine learning model, to render the to-be-processed tensor according to the control parameters by using the machine learning model to generate a rendered image frame, to determine a loss function value according to the rendered image frame and a sample image frame, and to train the machine learning model and the to-be-processed tensor by using the loss function value to generate a depth rendering model and a target tensor.

11. The apparatus of claim 10, wherein the machine learning model comprises M machine learning submodels, M being a positive integer;

the third training processing module is configured to input an ith feature tensor into an ith machine learning submodel, so that the ith feature tensor is rendered by the ith machine learning submodel according to the control parameters to output an (i + 1) th feature tensor, wherein i is greater than or equal to 1 and is less than or equal to M, the 1 st feature tensor is the to-be-processed tensor, and the (M + 1) th feature tensor output by the Mth machine learning submodel is taken as the rendering image frame.

12. The apparatus of claim 11, wherein each machine learning submodel comprises: the device comprises a fusion module, a rendering module, a nonlinear operation module and a resolution enhancement module;

the third training processing module is configured to, in the ith machine learning submodel, process the ith feature tensor according to the control parameters by using the fusion module to output a 1 st intermediate tensor, process the 1 st intermediate tensor by using the rendering module to output a 2 nd intermediate tensor, process the 2 nd intermediate tensor by using the nonlinear operation module to output a 3 rd intermediate tensor, and process the 3 rd intermediate tensor by using the resolution enhancement module to output the i +1 th feature tensor.

13. The apparatus of claim 12, wherein:

the fusion module is a normalization layer module;

the rendering module is a convolutional layer module;

the nonlinear operation module is an active layer module;

the resolution enhancement module is an upsampling layer module.

14. The apparatus of claim 12, wherein,

the second training processing module is configured to process the control parameters with a multi-layered perceptron to generate hidden variables, which are input to a fusion module in each of the machine learning submodels.

15. The apparatus of claim 11, wherein the machine learning model further comprises a channel matching module;

the third training processing module is configured to input the (M + 1) th feature tensor output by the (M) th machine learning submodel into the channel matching module, so as to perform channel matching processing on the (M + 1) th feature tensor by using the channel matching module to generate the rendering image frame.

16. The apparatus of claim 15, wherein,

the channel matching module is a convolutional layer module.

17. The apparatus of claim 10, wherein,

18. The apparatus of any one of claims 10-17,

19. A depth rendering model training apparatus, comprising:

a memory configured to store instructions;

a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 1-9 based on instructions stored by the memory.

20. A method of target rendering, comprising:

acquiring a control parameter associated with a rendering target;

inputting control parameters into a depth rendering model, so as to render a target tensor according to the control parameters by using the depth rendering model to generate an image frame, wherein the depth rendering model and the target tensor are obtained by training by using the training method of any one of claims 1-9.

21. A target rendering apparatus comprising:

a first rendering processing module configured to acquire a control parameter associated with a rendering target;

a second rendering processing module configured to input control parameters into a depth rendering model, so as to render a target tensor according to the control parameters by using the depth rendering model to generate an image frame, wherein the depth rendering model and the target tensor are trained by using the training method of any one of claims 1-9.

22. A target rendering apparatus comprising:

a memory configured to store instructions;

a processor coupled to the memory, the processor configured to perform an implementation of the method of claim 20 based on instructions stored by the memory.

23. A computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement the method of any one of claims 1-9, 20.