CN118096978A

CN118096978A - 3D artistic content rapid generation method based on arbitrary stylization

Info

Publication number: CN118096978A
Application number: CN202410503092.9A
Authority: CN
Inventors: 邢树军; 于迅博; 汲鲁育; 高鑫; 许世鑫; 刘博阳; 高超; 黄辉
Original assignee: Shenzhen Zhenxiang Technology Co ltd; Beijing University of Posts and Telecommunications
Current assignee: Shenzhen Zhenxiang Technology Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2024-04-25
Filing date: 2024-04-25
Publication date: 2024-05-28

Abstract

The invention discloses a 3D artistic content rapid generation method based on arbitrary stylization, which constructs a characteristic grid radiation field rich in high-level semantic information according to a plurality of input content images, optimizes a storage structure of the radiation field through tensor decomposition, then carries out self-adaptive characteristic enhancement on content characteristics in channel and space dimensions, learns complementary multi-level style information from the input artistic style images, carries out style migration on a characteristic image obtained by volume rendering on the characteristic grid, finally carries out joint training and optimizing on a decoder and the components according to a global quality loss function and a local detail loss function, and finally realizes rapid generation of the stylized artistic image under any view angle of the content scene. Compared with the prior art, the method and the device realize the rapid generation of the high-quality personalized 3D artistic content, are suitable for complex scenes, can effectively avoid the occurrence of artifacts which can influence visual effects, and better meet application requirements.

Description

3D artistic content rapid generation method based on arbitrary stylization

Technical Field

The invention relates to the technical field of 3D content generation and non-realistic stylized rendering, in particular to a 3D artistic content rapid generation method based on arbitrary stylization.

Background

With the development of 3D visual devices and display technologies, the demand for 3D content is increasing. However, the creation and generation of 3D content requires high labor and time costs, which is a more difficult problem for the creation and personalized generation of 3D artistic content. In practical application, the stylization can realize content generation with different visual effects by combining the content structure of the original image with another style, and provides a new thought for 3D content artistic creation and personalized generation.

In the prior art, traditional stylization is mostly performed on 2D images, and direct application of such methods to 3D stylization leads to multi-view inconsistency and artifact generation. It can be seen that the current 3D stylization faces multiple difficulties such as inconsistent multiple views, poor stylization quality, inability to popularize to any style, long training time, etc. Meanwhile, the 3D stylization method based on the point cloud is limited by the accuracy of the depth estimation link, and the stylization of the complex scene often leads to the occurrence of more artifacts and requires a great deal of time for training and optimizing. The single-style and multi-style migration method can only realize migration under one or more specific styles, cannot be applied to any style, and is inconvenient for a user to perform personalized artistic creation of 3D content. Furthermore, the neural radiation field-based optimization learning method has poor consistency in retaining the colors and textures of the reference styles, and the stylized result has obvious visual difference from the original styles and the original details of the content are seriously lost.

Disclosure of Invention

The invention aims to solve the technical problem of providing the 3D artistic content rapid generation method based on arbitrary stylization, which supports personalized creation of users, is suitable for complex scenes and can avoid the occurrence of artifacts to influence visual effects, aiming at the defects of the prior art.

In order to solve the technical problems, the invention adopts the following technical scheme.

A 3D artistic content rapid generation method based on arbitrary stylization, comprising: step S1, constructing a characteristic grid nerve radiation field containing high-level semantic information according to a plurality of input content images, wherein the high-level semantic information is abstract characteristics obtained by a deep learning network; step S2, optimizing the storage structure of the characteristic grid nerve radiation field through tensor decomposition; step S3, obtaining a characteristic grid after carrying out self-adaptive characteristic enhancement on the channel and space dimensions on the content characteristics, wherein the content characteristics are characteristic diagrams of the high-level semantic information; s4, learning complementary multi-level style information from the input artistic style image; s5, performing style migration on a content feature map obtained by volume rendering on the feature grid; and S6, jointly training and optimizing a decoder, a self-adaptive feature enhancement component and a component for extracting multi-level style information according to the global quality loss function and the local detail loss function, and further generating a stylized artistic image of the content scene at any view angle.

Preferably, the step S1 includes: step S10, constructing an original nerve radiation field based on a voxel grid according to the input content images at a plurality of different viewpoints, wherein the characteristics stored in each voxel comprise volume density and original scene characteristics representing colors; s11, extracting the high-level semantic information of the content image through a pre-trained convolutional neural network, realizing the reconstruction of a feature grid, and constructing a nerve radiation field of which each voxel contains the volume density and the high-level semantic information; step S12, carrying out volume self-adaptive instance normalization on the characteristics of each sampling point on the characteristic grid, and always maintaining updating iteration of the mean value and the variance in the training stage so as to eliminate the phenomenon of multi-viewpoint inconsistency caused by batch difference.

Preferably, in the step S2, for the voxel grid nerve radiation field, feature information is stored by vectors and matrices in XYZ directions, and the tensor decomposition is used to reduce the memory complexity.

Preferably, the step S3 includes: step S30, pyramid pooling is carried out on the content features, features of different levels are respectively input into a convolution layer, self-adaptive channel attention force diagrams of different scales are obtained, added and fused, and then the self-adaptive channel attention force diagrams are multiplied with the original content features to realize multi-level structure characterization enhancement in the channel dimension; step S31, compressing the adaptively enhanced content features in the channel dimension, calculating the fused adaptive space attention map through convolution kernels with different sizes and up-down sampling operation, and multiplying the fused adaptive space attention map with the content features to realize multi-scale region perception enhancement.

Preferably, the step S30 includes: step S300, carrying out multi-scale pooling on the input content features to obtain regional features under different scales; step S301, inputting the multiple paths of regional features into different convolution layers respectively to obtain multiple paths of regional features with added attention; step S302, the regional characteristics of the multiple added attention are summed and nonlinear activation is performed, and then an adaptive channel attention map is output.

Preferably, in the step S31: the channel dimension represents the number of channels of the feature map, and each channel corresponds to a filter or convolution kernel; the spatial dimension is the spatial position of a point in the feature map in both the height and width dimensions.

Preferably, the step S31 includes: step S310, channel compression: carrying out maximum pooling and average pooling on input features along the channel dimension respectively, and connecting the two along the channel dimension; step S311, feature refinement: the method comprises a global average pooling branch and a feature fusion sensing branch under different pyramid scales, wherein the global average pooling branch is used for carrying out global average on the output features of the step S310 in a space dimension, and then carrying out up-sampling operation after passing through a layer of convolution layer, and the feature fusion sensing branch is of a U-shaped network structure; step S312, attention output: the outputs of the two branches in the step S311 are added, nonlinear activation is performed, and then an adaptive spatial attention map is output.

Preferably, the step S4 includes: step S40, obtaining a style characteristic image from the artistic style image through a pre-trained convolutional neural network, and then calculating the mean value and variance of the style characteristic image; step S41, after serializing the style feature graphs, calculating feature covariance and performing linear transformation through a convolution layer to obtain a style migration matrix; the mean value, variance and style migration matrix of the style feature images obtained through calculation in the steps serve as complementary multi-level information and are used for representing the styles of the artistic style images.

Preferably, the step S5 includes: step S50, performing volume rendering at any view angle on the feature grid subjected to self-adaptive enhancement to obtain a content feature map; and S51, carrying out mathematical operation on style information extracted from the artistic style image and the content feature map to obtain a stylized feature map.

Preferably, the step S6 includes: step S60, inputting the stylized feature map into a decoder based on a convolutional neural network to obtain stylized images of corresponding visual angles in RGB space; step S61, optimizing and training a decoder, a self-adaptive feature enhancement component and a component for extracting multi-level style information through a designed global and local joint loss function; the joint loss function comprises a global style loss part, a global content loss part and a local detail retention loss part based on a Laplace matrix, wherein each part is adjusted by corresponding weight.

According to the method for quickly generating 3D artistic content based on arbitrary stylization, firstly, a characteristic grid radiation field rich in high-level semantic information is constructed according to a plurality of input content images, a storage structure of the radiation field is optimized through tensor decomposition, then self-adaptive characteristic enhancement in channel and space dimensions is carried out on content characteristics, complementary multi-level style information is learned from the input artistic style images, style migration is carried out on characteristic images obtained by volume rendering of the characteristic grids, and finally, a decoder and the components are optimized through joint training according to a global quality loss function and a local detail loss function, so that quick generation of the stylized artistic image under any view angle of the content scene is finally realized. Compared with the prior art, the method and the device realize the rapid generation of high-quality personalized 3D artistic content, are suitable for complex scenes, and meanwhile, the stylized artistic image generated based on the method and the device not only inherits the style of the artistic image with high quality, but also retains the content structure details of the complex scene, and can effectively avoid the occurrence of artifacts which can influence the visual effect.

Drawings

Fig. 1 is a flowchart of a method for rapidly generating 3D artistic content based on arbitrary stylization.

Detailed Description

The invention is described in more detail below with reference to the drawings and examples.

The invention discloses a 3D artistic content rapid generation method based on arbitrary stylization, referring to FIG. 1, which comprises the following steps:

Step S1, constructing a characteristic grid nerve radiation field containing high-level semantic information according to a plurality of input content images, wherein the high-level semantic information is abstract characteristics obtained by a deep learning network;

step S2, optimizing the storage structure of the characteristic grid nerve radiation field through tensor decomposition;

Step S3, obtaining a characteristic grid after carrying out self-adaptive characteristic enhancement on the channel and space dimensions on the content characteristics, wherein the content characteristics are characteristic diagrams of the high-level semantic information;

s4, learning complementary multi-level style information from the input artistic style image;

S5, performing style migration on a content feature map obtained by volume rendering on the feature grid;

And S6, jointly training and optimizing a decoder, a self-adaptive feature enhancement component and a component for extracting multi-level style information according to the global quality loss function and the local detail loss function, and further generating a stylized artistic image of the content scene at any view angle.

According to the method, firstly, a characteristic grid radiation field rich in high-level semantic information is constructed according to a plurality of input content images, a storage structure of the radiation field is optimized through tensor decomposition, then self-adaptive characteristic enhancement in channel and space dimensions is carried out on content characteristics, complementary multi-level style information is learned from the input artistic style images, style migration is carried out on a characteristic image obtained by volume rendering of the characteristic grid, finally, an optimization decoder and the components are trained in a combined mode according to a global quality loss function and a local detail loss function, and finally, the rapid generation of the stylized artistic image under any view angle of the content scene is realized. Compared with the prior art, the method and the device realize the rapid generation of high-quality personalized 3D artistic content, are suitable for complex scenes, and meanwhile, the stylized artistic image generated based on the method and the device not only inherits the style of the artistic image with high quality, but also retains the content structure details of the complex scene, and can effectively avoid the occurrence of artifacts which can influence the visual effect.

Further, the step S1 includes:

step S10, constructing an original nerve radiation field based on a voxel grid according to the input content images at a plurality of different viewpoints, wherein the characteristics stored in each voxel comprise volume density and original scene characteristics representing colors;

S11, extracting the high-level semantic information of the content image through a pre-trained convolutional neural network, realizing the reconstruction of a feature grid, and constructing a nerve radiation field of which each voxel contains the volume density and the high-level semantic information;

Step S12, carrying out volume self-adaptive instance normalization on the characteristics of each sampling point on the characteristic grid, and always maintaining updating iteration of the mean value and the variance in the training stage so as to eliminate the phenomenon of multi-viewpoint inconsistency caused by batch difference.

In the step S1, a feature grid radiation field rich in high-level semantic information is constructed according to a plurality of input content images, and the specific contents are as follows:

Given a multi-view image of a scene and a reference style image, the purpose of 3D style migration is to generate a new view of the scene in the reference style, and style migration allows a designer or an average user to create an artwork in an innovative manner, by applying different styles of drawings, artwork to the image, novel and unique artistic effects can be generated, expanding the means of creation and personalized customization.

The input multiple content images refer to RGB images photographed at different view angles of a stylized target scene, the images capture information of the target in different directions and different angles in space, and three-dimensional structures of the scene or objects can be restored by using the multi-view images through a triangulation method and the like. For the style migration task, the content image provides the original content and structure information, and the new image after style migration still maintains the same content and structure as the content image, but changes in style.

Wherein the number of the input content images is at least two, and the number is optimally 20-40.

The high-level semantic information refers to a feature representation of a deeper level in a deep learning network, and is usually an abstract feature obtained by stacking convolution and pooling layers for multiple times, and the abstract feature comprises a more abstract and semantic description of objects, textures, shapes and the like in an image. For high-level semantic information extracted by the image classification network, it helps to classify the input image, enabling the network to learn the semantic structure of the image, not just the pixel information of the surface.

The radiation field is referred to herein as a neural radiation field (NeRF), which is a three-dimensional scene characterization method that is effective in the field of three-dimensional reconstruction and viewpoint generation. The core idea of the original nerve radiation field is that three-dimensional coordinates and the view direction of a camera are used as three-dimensional scene information to be encoded into a multi-layer perceptron (MLP) to directly predict RGB values and volume density, then the final RGB color values are obtained by sampling on light rays corresponding to each pixel and utilizing volume rendering, but the use and intensive sampling of a large-scale multi-layer perceptron in the original radiation field lead to large calculation amount of the mode and slow training and reasoning speed.

The characteristic grid radiation field is an improvement based on the original nerve radiation field, and the main improvement idea is to adopt an implicit method to mixedly characterize the scene. For example, explicit data structures such as discrete voxel grids or hash maps are used to store features, allowing for fast convergence and reasoning. In the method, discrete voxel grids are used for storing scene information, each voxel contains voxel density and scene characteristics of the sampling point, and the continuity of the volume density value and the characteristic value in space can be realized through some interpolation methods, and optional methods comprise tri-linear interpolation, nearest neighbor interpolation, bilinear interpolation and the like, and the preferred method is tri-linear interpolation.

Constructing a feature grid radiation field rich in high-level semantic information comprises two implementation stages: (1) The first stage is to construct an original characteristic grid radiation field according to a plurality of input content images, wherein the constructed original characteristic grid radiation field optionally comprises Plenoxels, tensoRF and the like, preferably Plenoxels, at the moment, the radiation field is represented by a discrete voxel grid, and each voxel stores the volume density and the spherical harmonic coefficient of a corresponding scene sampling point, the spherical harmonic coefficient represents the original color information of the scene, and the original color information contains little semantic information, so that if the original color information is directly subjected to characteristic migration, the high-quality style migration is not favored; therefore, (2) the second stage is to reconstruct the semantic feature grid, and firstly obtain a feature map containing semantic information of the content image by using a common pre-training network, where the pre-training network optionally includes: VGGNet16, VGGNet, 19, alexNet, resNet, etc., preferably VGGNet19, and then any ray passing through the feature grid can be obtained by integrating the reconstructed semantic feature components characterizing the color at each sample point along the ray in a process similar to volume renderingIs characterized by multiple channels:

Wherein, Is the model at the sampling position/>Calculated volume density and reconstruction features,/>Representing the number of characteristic channels; /(I)Showing the total sampling point number on one ray; /(I)Is the light step size; /(I)Is transmittance; Expressed as ray/> Is a weight of (2). The invention is trained by using the common optimized loss function in the new viewpoint generating task to realize the alignment of semantic features and radiation field voxel features under any view angle, wherein the optimized loss function optionally comprises the mean square error loss (MSE) between predicted and real RGB images, perception loss and the like, and preferably uses the mean square error loss between the predicted and real RGB images and the feature images and the perception loss capable of enhancing the generated image quality as the training loss function of the stage:

Wherein, Representing the ray/>, contained in the current batchPredicted values of the lower scene image. /(I)Representing different layers of a pretrained network involved in calculating perceived loss,/>Representing the pretrained network/>And (5) a layer output characteristic diagram.

At this time, the construction of the characteristic grid radiation field containing high-level semantic information is completed, the volume density and the semantic characteristics are stored in each voxel, the style of the radiation field is converted in the high-level characteristic space rich in the high-level semantic information, the stylized effect which is closer to the reference style can be generated, the calculated amount is small, the adaptability is strong, and the method can be popularized to any stylized.

In a preferred manner, in the step S2, for the voxel grid neural radiation field, feature information is stored by vectors and matrices in XYZ directions, and the memory complexity is reduced by using tensor decomposition.

With respect to the step S2, the storage structure of the radiation field is optimized by tensor decomposition, and the specific contents include:

The tensor decomposition is a mathematical technique for representing a multi-dimensional array (tensor) as a set of low-rank components that can help extract potential structures and features in the data, thereby reducing the dimensionality of the data and simplifying the analysis.

Regarding optimizing the storage structure of the radiation field, the specific way is to release the low-rank constraint of two modes of tensors by means of vector-matrix (VM) decomposition, and decompose the tensors into compact vectorsSum matrix factor/>

Wherein,Is three super-parameters, the size depends on the complexity of the corresponding base, and can be set as/>, for most scenesThe tensor refers to the information stored in each voxel of the previously constructed radiation field, namely, the volume density and the semantic features, and the invention respectively carries out tensor decomposition on the volume density and the semantic features, stores feature grid information through vectors and matrixes in the XYZ direction, realizes the reduction of the complexity of the memory by using the tensor decomposition, greatly improves the storage efficiency, and changes the complexity of the memory from/>

The step S3 of the present invention includes:

Step S30, pyramid pooling is carried out on the content features, features of different levels are respectively input into a convolution layer, self-adaptive channel attention force diagrams of different scales are obtained, added and fused, and then the self-adaptive channel attention force diagrams are multiplied with the original content features to realize multi-level structure characterization enhancement in the channel dimension;

step S31, compressing the adaptively enhanced content features in the channel dimension, calculating the fused adaptive space attention map through convolution kernels with different sizes and up-down sampling operation, and multiplying the fused adaptive space attention map with the content features to realize multi-scale region perception enhancement.

Specifically, the step S30 includes:

step S300, carrying out multi-scale pooling on the input content features to obtain regional features under different scales;

step S301, inputting the multiple paths of regional features into different convolution layers respectively to obtain multiple paths of regional features with added attention;

Step S302, the regional characteristics of the multiple added attention are summed and nonlinear activation is performed, and then an adaptive channel attention map is output.

Specifically, in the step S31:

the channel dimension represents the number of channels of the feature map, and each channel corresponds to a filter or convolution kernel;

the spatial dimension is the spatial position of a point in the feature map in both the height and width dimensions.

Further, the step S31 includes:

step S310, channel compression: carrying out maximum pooling and average pooling on input features along the channel dimension respectively, and connecting the two along the channel dimension;

Step S311, feature refinement: the method comprises a global average pooling branch and a feature fusion sensing branch under different pyramid scales, wherein the global average pooling branch is used for carrying out global average on the output features of the step S310 in a space dimension, and then carrying out up-sampling operation after passing through a layer of convolution layer, and the feature fusion sensing branch is of a U-shaped network structure;

Step S312, attention output: the outputs of the two branches in the step S311 are added, nonlinear activation is performed, and then an adaptive spatial attention map is output.

In the step S3, the adaptive feature enhancement in the channel and space dimensions is performed on the content features, and the specific content includes:

The content features are high-level semantic information rich feature graphs generated in the previous step, which contain four dimensions BCWH: batch (Batch), channel (Channel), height (Height), and Width (Width). Batch refers to the number of samples used in a single training, also referred to as batch size. In training a deep neural network, multiple samples are typically used simultaneously for parameter updating, which is a concept of batch size. The B dimension represents the number of samples processed by the network at a time, and the C dimension represents the depth of the feature map, i.e., the number of feature channels. The H dimension represents the size of the feature map in the vertical direction, and the W dimension represents the size of the feature map in the horizontal direction.

The channel dimension refers to a channel dimension of the feature map, and represents the number of channels, also called depth, of the feature map, and each channel corresponds to a filter or convolution kernel and is responsible for extracting a specific type of feature. The space dimension refers to the space position of a certain point in the feature map in the height and width dimensions.

The self-adaptive feature enhancement in the channel dimension and the space dimension is a serial processing structure, and the self-adaptive feature enhancement in the channel dimension is performed first and then the self-adaptive feature enhancement in the space dimension is performed.

Specifically, the self-adaptive feature enhancement in the channel dimension is to input the content features into a constructed multi-level structure characterization enhancement network, and calculate to obtain a self-adaptive channel attention map; and multiplying the obtained self-adaptive channel attention map with the original input content characteristics to realize multi-layer structure characterization enhancement in the channel dimension. Regarding attention attempts, which involve different weights assigned to different parts of the feature, the network can be made more focused on information that is meaningful to the task by weighting with the original feature. A characterization enhancement network for a multilayer structure comprising three serially connected portions: (1) Firstly, carrying out multi-scale pooling on input content features to obtain regional features under different scales, wherein the pooling is a common operation in deep learning and is used for reducing the space size of a feature map, reducing the computational complexity and extracting the features to a certain extent, and the optional pooling types comprise: maximum pooling and average pooling, preferably average pooling, the multi-scale meaning selecting different pooling window sizes; (2) Inputting the multi-path regional characteristics into different convolution layers respectively to obtain regional characteristics of multi-path added attention; (3) The regional features of the multiple added attention are summed and non-linearly activated, and finally an adaptive channel attention map is output. The nonlinear activation function optionally includes: reLU, sigmoid, etc., preferably Sigmoid functions. By means of the self-adaptive feature enhancement in the channel dimension, the model is made to pay attention to what is of significance to a given content image, namely by optimizing the channel weight, the stylized model can pay attention to valuable structural layers, and the necessary scene structure is reserved while the style migration effect in a complex scene is improved. In addition, the pyramid pooling is used for multi-scale feature extraction, various main clues about scene features are collected, and fusion of global and local context information can be achieved.

Regarding the self-adaptive feature enhancement in the space dimension, firstly, inputting the content features enhanced by the channel self-adaptive features into a constructed multi-scale region perception enhancement network, and obtaining a self-adaptive space attention diagram through calculation; the resulting adaptive spatial attention is then multiplied by the input content features to achieve multi-scale region-aware enhancement in the spatial dimension. Regarding a multiscale region-aware enhancement network, it comprises three phases that proceed sequentially: (1) Channel compression, namely respectively carrying out maximum pooling and average pooling on input features along a channel dimension, and connecting the two along the channel dimension; (2) Feature refinement, the stage comprises two parallel branches, one is a global average pooling branch and the other is a feature fusion perception branch under different pyramid scales. The global average pooling branch comprises the steps of firstly carrying out global average on the output characteristics of the stage (1) in the space dimension, then carrying out up-sampling operation through a layer of convolution layer, and finally carrying out feature fusion sensing branch, wherein the down-sampling operation is that the output characteristics of the stage (1) are input into different convolution layers according to the sequence from large to small of convolution kernel, and the down-sampling operation is carried out; the uplink operation is to up-sample the characteristic diagram after convolution and add and fuse with the output of other convolution layers; (3) Attention output, add the outputs of the two branches of stage (2), then perform nonlinear activation, and finally output adaptive spatial attention map. The model is focused on the information part "where" by adaptive feature enhancement in the spatial dimension. According to the invention, different attention degrees are given to different areas of different scale feature graphs, so that the perception and characterization capability of a model on a specific sensitive area and the enhancement on key detail textures are enhanced.

As a preferred manner, the step S4 includes:

Step S40, obtaining a style characteristic image from the artistic style image through a pre-trained convolutional neural network, and then calculating the mean value and variance of the style characteristic image;

step S41, after serializing the style feature graphs, calculating feature covariance and performing linear transformation through a convolution layer to obtain a style migration matrix;

the mean value, variance and style migration matrix of the style feature images obtained through calculation in the steps serve as complementary multi-level information and are used for representing the styles of the artistic style images.

With respect to the step S4, complementary multi-level style information is learned from the input artistic style image, and specific contents include:

With respect to artistic-style images, i.e., reference-style images in a stylized task, one of the key steps of the stylized task is to extract the style features of the style image and fuse it with the content features. The style information is information which is extracted from the style art image and can represent the style, and common style information comprises: mean, variance, etc. But by means of these simple style information alone, high quality style extraction and conversion cannot be achieved. The complementary multi-level style information refers to a mean value, a variance and a style migration matrix extracted from a style image feature map through a constructed style extraction network. The three types of style information are extracted from different layers of the style image, contain different statistical characteristics of the style image, and can comprehensively and complementarily represent the brightness, the color distribution, the texture characteristics and the like of the style image.

The learning of complementary multi-level style information is realized through a constructed style extraction network, and firstly, the artistic style image is input into a pretrained convolutional neural network to obtain a style characteristic diagram. The pre-trained convolutional neural network optionally includes: VGGNet16, VGGNet, 19, alexNet, resNet, etc., preferably VGGNet. The mean and variance of the style feature map can then be calculated, the style feature map is then input to the convolution layer, the feature covariance is calculated, and finally a linear layer is applied to learn the style migration matrixThe mean value, variance and style migration matrix of the style feature images obtained through calculation in the process serve as complementary multi-level information, and the style of the artistic style images is represented.

The step S5 of the present invention includes:

step S50, performing volume rendering at any view angle on the feature grid subjected to self-adaptive enhancement to obtain a content feature map;

And S51, carrying out mathematical operation on style information extracted from the artistic style image and the content feature map to obtain a stylized feature map.

Regarding the specific flow of step S5, performing style migration on the feature map obtained by volume rendering on the feature grid, where the style migration includes:

the feature grid is a refined feature grid enhanced by self-adaptive features, different areas and contents have different attention, and the perception and characterization capability of the model on a specific sensitive area are enhanced.

Regarding volume rendering, conventional volume rendering is a technology of computer graphics and visualization for rendering and visualizing three-dimensional volume data, and by adjusting transparency and color of the volume data, to simulate the effect of light propagation in a body, an image is finally generated, and herein, the volume rendering refers to a volume rendering mode used in a nerve radiation field, and aims to generate RGB pixel values of corresponding light. Firstly, generating a light ray corresponding to each pixel of an image under a specific visual angle, sampling in a radiation field along the light ray direction, and finally generating a corresponding pixel color by integrating the color and the volume density of each sampling point.

The computing method refers to the above-mentioned volume rendering thought, samples the feature grid along the light direction, and integrates the volume density and refined features of the sampled points, thereby finally obtaining the light features corresponding to the lightThe calculation formula is as follows:

Wherein, Is the model at the sampling position/>The calculated volume density and refinement feature,Representing the number of characteristic channels; /(I)Representing the total sampling point number on one ray; /(I)Is the light step size; /(I)Is transmittance (TRANSMITTANCE); /(I)Expressed as ray/>Is a weight of (2).

The style migration on the feature map refers to migrating the multi-level style information (mean, variance and style migration matrix) learned in the step S4 onto the content feature map, so as to realize style migration on the feature map layer, and finally realize the generation of the stylized radiation field through multiple optimization iterations. The specific migration steps are as follows: (1) Migrating styles into matricesAnd content feature map/>Multiplying; (2) The output of the above (1) is combined with the variance/>Multiplying and then multiplying with the mean/>And weight map/>Added by products of (2) to finally obtain a stylized feature map/>The weight graph refers to the weight corresponding to each pixelAnd (5) composed two-dimensional data. The above conversion process can be described by the following formula:

wherein/> Representing matrix multiplication,/>Is a content profile. The style migration of the feature map is realized through the simple addition and multiplication operation, the style migration of the traditional method for directly carrying out the radiation field is avoided, the calculated amount is reduced, and the operation speed is improved.

The step S6 of the present invention includes:

step S60, inputting the stylized feature map into a decoder based on a convolutional neural network to obtain stylized images of corresponding visual angles in RGB space;

Step S61, optimizing and training a decoder, a self-adaptive feature enhancement component and a component for extracting multi-level style information through a designed global and local joint loss function;

the joint loss function comprises a global style loss part, a global content loss part and a local detail retention loss part based on a Laplace matrix, wherein each part is adjusted by corresponding weight.

Regarding the step S6, the decoder and the above components are jointly trained and optimized according to the global quality loss function and the local detail loss function, so as to finally realize the rapid generation of the stylized artistic image under any view angle of the content scene, which specifically includes the following contents:

With respect to decoders, reference is made to convolutional neural networks constructed for converting a stylized feature map into images under RGB space. The convolutional neural network consists of a convolutional layer and a nonlinear activation function. Regarding global quality loss function, the method is used for regulating and controlling global space structure and overall style quality, and consists of two parts: content loss and grid loss. Wherein the content is lost Is the Mean Square Error (MSE) between the features of the final output stylized image and the features of the content image:

wherein style loss/> The mean square error sum of the mean and the variance of the output characteristics of each layer of the pretrained VGG19 is obtained by respectively carrying out the process of the stylized image and the style image:

Wherein, Representing a stylized image; /(I)Representing an input original content image; /(I)Representing an image/>First/>, through a pretrained convolutional neural networkAnd (5) a layer output characteristic diagram.

Regarding the local detail loss function, the detail content of the stylized image is optimized by measuring the difference in local detail structure using the difference of the laplace matrix between the stylized image and the content image as the laplace loss term:

Wherein the Laplace matrix By inputting an image/>Convolution with a laplace filter, which is commonly used for edge and contour detection, results.

With respect to joint training, it means that the content loss, style loss and Laplace loss are combined together through their weights, so as to jointly optimize the stylized model training, thus enabling the stylized image to reflect the reference style while preserving the structural details of the original content. The total loss function of the stylized training phase is:

With respect to the quick generation, after the training phase is completed, high-quality 3D stylized artistic content of any view angle of the scene can be generated in an inference mode in any style, and the training and reasoning speed exceeds that of most existing 3D style migration methods.

The invention discloses a 3D artistic content rapid generation method based on arbitrary stylization, and the following embodiments can be referred to in practical application.

Example 1

The embodiment specifically comprises the following steps:

Step 1, selecting 30 target scene multi-view images shot from multiple view angles as input content images; and selecting WikiArt datasets containing more than eighteen artistic images as reference style image sources, selecting 140 datasets from the reference style image sources as test sets, and taking the rest datasets as training sets.

And taking the content image as input, constructing Plenoxels an original scene radiation field, wherein each voxel contains the voxel density and scene characteristics of the sampling point, and realizing the continuity of values in space through tri-linear interpolation.

Extracting content image output at ReLU3_1 layer VGGNet19 as semantic information by using pre-trained VGGNet as semantic information extraction network, and rendering by the above formula (1) to obtain any ray passing through feature gridThen training in this stage using the above equation (2) as a training loss function, and calculating the perceived loss/>The relu3_1 layer and the relu4_1 layer selected as the pre-training VGGNet layer are used for better improving multi-view consistency, and the embodiment disables the influence of the sight direction on the effect, so that the characteristic grid radiation field rich in high-level semantic information is finally obtained.

Step 2, optimizing the storage structure by means of vector-matrix (VM) decomposition, as in the above formula (3), the embodiment sets the tensor components in X, Y, Z directions to be consistent in number andNamely, the total number of tensor components is 192, the reduction of the memory complexity is realized by utilizing tensor decomposition, the storage efficiency is greatly improved, and the memory complexity is reduced from。

And 3, carrying out average pooling on the content characteristics by using three different pooling scales in the multi-level structure characterization enhancement network constructed by the embodiment, respectively enabling the spatial resolutions of multi-scale pooled output characteristics to be 1×1,2×2 and 3×3, then inputting the pooled output characteristics into two 1×1 convolution layers which are sequentially connected, carrying out nonlinear activation by using a Sigmoid function, finally outputting an adaptive channel attention map, and multiplying the obtained adaptive channel attention map with the original input content characteristics to realize multi-level structure characterization enhancement in the channel dimension.

The multi-scale region perception enhancement network constructed in this embodiment compresses the channel dimension through maximum pooling and average pooling, then uses three convolution kernels conv7 x 7, conv5 x 5 and conv3 x 3 with different sizes to perform feature fusion perception, then adds the feature fusion perception with the global average pooling result and performs nonlinear activation of Sigmoid function, finally outputs adaptive space attention force diagram, and multiplies the obtained adaptive space attention force diagram with the input content features to realize multi-scale region perception enhancement in the space dimension.

Step 4, firstly obtaining a style characteristic image through VGGNet and 19, then calculating to obtain the mean value and variance of the style characteristic image, then inputting the style characteristic image into a convolution layer formed by sequentially connecting three conv1d+ReLU modules, then calculating the characteristic covariance, and finally, using a linear layer to learn the style migration matrixThe mean value, variance and style migration matrix of the style feature images obtained through calculation serve as complementary multi-level information, and the style of the artistic style images is represented.

Step 5, performing the volume rendering process shown in the formula (4) to the refined feature grid subjected to the self-adaptive feature enhancement to obtain a content feature map. Style migration matrix/>, as in equation (5)Multiplying the content feature map and then multiplying the result by the variance/>Multiplying and finally multiplying with the mean/>And weight map/>Added by products of (2) to finally obtain a stylized feature map/>。

Step 6, constructing a convolutional neural network comprising eight conv2d+ReLU modules which are sequentially connected to form a decoder, and training a stylized model through a global plus local joint optimization loss function shown in the formula (9), wherein the content loss weightCan be set to 1, style loss weight/>Can be set to 20, laplace loss weight/>Can be set to 100, using relu1_1, relu2_1, relu3_1, relu4_1 layers of pretraining VGGNet19 to participate in computation/>The training iteration number may be set to 25k and the training time at RTX3090 is about 4 hours.

After training is completed, the embodiment inputs the internal reference and external reference matrixes of the required visual angles, so that high-quality 3D stylized artistic content of the scene at any visual angle can be generated in a reasoning mode, the generated stylized artistic image not only inherits the style of the artistic image at high quality, but also maintains the content detail structure of the complex scene, and artifacts in the complex scene are avoided. The time for reasoning generation of the 720p resolution stylized artistic image under a single arbitrary view angle is about 4s, and the training and reasoning speed exceeds that of most of the existing 3D style migration methods.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention, and modifications, equivalent substitutions or improvements made within the technical scope of the present invention should be included in the scope of the present invention.

Claims

1. A method for rapidly generating 3D artistic content based on arbitrary stylization, comprising:

2. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 1, wherein said step S1 comprises:

3. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 2, wherein in said step S2, for the voxel grid neural radiation field, feature information is stored by vector and matrix in XYZ direction, and the memory complexity is reduced by tensor decomposition.

4. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 1, wherein said step S3 comprises:

5. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 4, wherein said step S30 comprises:

6. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 4, wherein in said step S31:

7. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 4, wherein said step S31 comprises:

8. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 1, wherein said step S4 comprises:

9. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 1, wherein said step S5 comprises:

10. The method for rapidly generating 3D artistic content based on arbitrary stylization according to claim 9, wherein said step S6 comprises: