CN116402914A

CN116402914A - Method, device and product for determining stylized image generation model

Info

Publication number: CN116402914A
Application number: CN202310378089.4A
Authority: CN
Inventors: 吴进波; 刘星
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-07-07
Anticipated expiration: 2043-04-11
Also published as: CN116402914B

Abstract

The disclosure provides a method, a device, electronic equipment, a storage medium and a program product for determining a stylized image generation model, which relate to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like. The specific implementation scheme is as follows: generating a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field; under the guidance of multi-mode stylized guidance data, respectively generating a guided image through a first nerve radiation field and a second nerve radiation field, wherein the stylized guidance data represents style information expected to be reached by the guided image; determining loss information according to the stylized guide data and the guided image; and updating parameters of the second nerve radiation field according to the loss information to generate a stylized image generation model. The method and the device enable the nerve radiation field to be suitable for stylized guide data of different modes, and improve stylized editing effect and efficiency.

Description

Method, device and product for determining stylized image generation model

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and particularly relates to a method and a device for determining a stylized image generation model, a method and a device for generating a stylized image, electronic equipment, a storage medium and a computer program product, which can be used in automatic driving and intelligent traffic scenes.

Background

The technologies of nerve rendering, three-dimensional reconstruction and the like mainly aim at the recovery of a real scene. For images based on neural rendering, users often have a need for editing them, for example, stylized editing is one of the editing needs. Currently, there is a lack of technical means for editing scenes and objects generated by neural rendering using multi-modal data.

Disclosure of Invention

The present disclosure provides a method, apparatus, electronic device, storage medium and computer program product for determining a stylized image generation model and for generating a stylized image.

According to a first aspect, there is provided a method for determining a stylized image generation model, comprising: generating a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field; respectively generating a guided image through a first nerve radiation field and a second nerve radiation field under the guidance of multi-mode stylized guiding data, wherein the stylized guiding data represents style information expected to be reached by the guided image; determining loss information according to the stylized guide data and the guided image; and updating parameters of the second nerve radiation field according to the loss information to generate a stylized image generation model.

According to a second aspect, there is provided a method for generating a stylized image, comprising: acquiring target stylized guidance data, wherein the target stylized guidance data represents style information of a stylized image to be generated expected to be reached; and generating a stylized image through a trained stylized image generation model under the guidance of the target stylized guidance data, wherein the stylized image generation model is obtained through training in any implementation mode of the first aspect.

According to a third aspect, there is provided an apparatus for determining a stylized image generation model, comprising: a first generation unit configured to generate a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field; a second generation unit configured to generate a guided image through the first and second nerve radiation fields, respectively, under guidance of multi-modal stylized guidance data, wherein the stylized guidance data characterizes style information that the guided image is expected to reach; a determining unit configured to determine loss information from the stylized guidance data and the post-guidance image; and an updating unit configured to update parameters of the second neural radiation field according to the loss information to generate a stylized image generation model.

According to a fourth aspect, there is provided a method for generating a stylized image, comprising: an acquisition unit configured to acquire target stylized guidance data, wherein the target stylized guidance data characterizes style information to which a stylized image to be generated is expected to reach; and a third generating unit configured to generate a stylized image through the trained stylized image generating model under the guidance of the target stylized guidance data, wherein the stylized image generating model is obtained through training in any implementation manner of the third aspect.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first and second aspects.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first and second aspects.

According to a seventh aspect, there is provided a computer program product comprising: a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first and second aspects.

According to the technology disclosed by the invention, a method for determining a stylized image generation model is provided, and for a trained nerve radiation field, the output result of the trained nerve radiation field is constrained by stylized guide data of different modes, so that the original nerve radiation field is optimized to generate an image result after style editing, the applicability of the nerve radiation field to the stylized guide data of different modes is improved, and the guiding effect and the generating efficiency in the generating process of the stylized image are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram to which an embodiment according to the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a method for determining a stylized image generation model according to the present disclosure;

fig. 3 is a schematic diagram of an application scenario of a method for determining a stylized image generation model according to the present embodiment;

FIG. 4 is a flow chart of yet another embodiment of a method for determining a stylized image generation model according to the present disclosure;

FIG. 5 is a flow chart of one embodiment of a method for generating a stylized image according to the present disclosure;

FIG. 6 is a block diagram of one embodiment of an apparatus for determining a stylized image generation model according to the present disclosure;

FIG. 7 is a block diagram of one embodiment of an apparatus for generating a stylized image according to the present disclosure;

FIG. 8 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

FIG. 1 illustrates an exemplary architecture 100 in which the methods and apparatus for determining a stylized image generation model, methods and apparatus for generating a stylized image of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The communication connection between the

terminal devices

101, 102, 103 constitutes a topology network, the network 104 being the medium for providing the communication link between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The

terminal devices

101, 102, 103 may be hardware devices or software supporting a network connection for data interaction and data processing with a server. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, etc., including, but not limited to, car-mounted computers, smart phones, tablet computers, electronic book readers, laptop portable computers, desktop computers, etc. When the

terminal devices

101, 102, 103 are software, they can be installed in the above-listed electronic devices. It may be implemented as a plurality of software or software modules, for example, for providing distributed services, or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background processing server for constraining the image results output by the pre-trained neural radiation field through the stylized guidance data of different modalities provided by the

terminal devices

101, 102, 103, and further optimizing the original neural radiation field to obtain a stylized image generation model. For another example, the background processing server generates the stylized image by generating the model from the trained stylized image under the guidance of the target stylized guidance data provided by the

terminal devices

101, 102, 103. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should also be noted that, the method for determining the stylized image generation model and the method for generating the stylized image provided by the embodiments of the present disclosure may be performed by a server, may be performed by a terminal device, or may be performed by the server and the terminal device in cooperation with each other. Accordingly, the means for determining the stylized image generation model and the means for generating the stylized image may include all parts (e.g., all units) in the server, all the parts may be in the terminal device, and all the parts may be in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. When the method for determining the stylized image generation model, the electronic device on which the method for generating the stylized image is run, does not need to be data-transmitted with other electronic devices, the system architecture may include only the method for determining the stylized image generation model, the electronic device (e.g., server or terminal device) on which the method for generating the stylized image is run.

Referring to fig. 2, fig. 2 is a flowchart of a method for determining a stylized image generation model according to an embodiment of the disclosure, wherein the flowchart 200 includes the steps of:

step 201, based on the pre-trained neural radiation field, generating a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters.

In this embodiment, an execution subject (for example, a terminal device or a server in fig. 1) of the method for determining the stylized image generation model may acquire a pre-trained neural radiation field from a remote location or from a local location through a wired network connection manner or a wireless network connection, and generate a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field.

NeRF (Neural Radiance Fields, nerve radiation field), is a view angle synthesis method for images, used to perform new view angle synthesis tasks. The new view angle synthesis task refers to rendering and generating an Image corresponding to a Target Pose given a Source Image (Source Image), a corresponding Source Pose (Source Pose) and the Target Pose (Target Pose). The source pose refers to a transformation matrix that converts camera coordinates into world coordinates.

NeRF is to introduce a fully connected neural network into the three-dimensional scene representation of an object, only needs a plurality of pictures of the same object with different angles as supervision, the neural network can implicitly model the three-dimensional scene of the object, and then renders and generates a two-dimensional image with a new angle under a new view angle through rendering methods such as volume rendering (volume rendering). The method has wide application in the fields of three-dimensional reconstruction, data enhancement, augmented reality, virtual reality and the like.

Specifically, first, a three-dimensional point set is sampled by camera light passing through the scene, each three-dimensional point having coordinates (x, y, z). Then, the three-dimensional point set and the related two-dimensional view angle (θ, Φ) of the samples are taken as inputs of the neural network (MLP, multilayer Perceptron), and the color (view-dependent) and volume density (volume density) corresponding to the sampled points are output. Finally, the output color (c= (r, g, b)) and volume density (σ) are rendered into a two-dimensional picture using classical volume rendering methods. And carrying out iterative training on the neural network by minimizing pixel difference values between the known label picture and the actual picture obtained by rendering to obtain a pre-trained neural radiation field.

In this embodiment, the executing body replicates the pre-trained nerve radiation field to obtain two nerve radiation fields, and fixes a parameter of one of the nerve radiation fields as the first nerve radiation field; setting a parameter of another nerve radiation field in an updatable state as a second nerve radiation field.

Step 202, under the guidance of the multi-mode stylized guidance data, respectively generating a guided image through the first nerve radiation field and the second nerve radiation field.

In this embodiment, the execution subject may generate the guided image by the first nerve radiation field and the second nerve radiation field under the guidance of the multi-modal stylized guidance data, respectively. Wherein the stylized guidance data characterizes style information that the guided image is expected to reach.

Multimodal includes, but is not limited to, video, image, text, speech, etc. For each modality, however, a plurality of training samples corresponding to the modality are set, constituting a training sample set. Taking the image mode as an example, the corresponding training sample set includes a plurality of images, and the plurality of images have different styles, including but not limited to, fresh, literature, private, fashion, black and white, and the like. Taking text as an example, the corresponding training sample set includes a plurality of texts, and the plurality of texts are used for representing different styles.

In this embodiment, when the stylized image generation model to be generated has only the guidance requirement of the stylized guidance data of a single modality, the stylized guidance data of a desired modality may be used only for guidance, and the guided image may be generated by the first nerve radiation field and the second nerve radiation field, respectively. When the stylized image generation model to be generated has the guiding requirement of the stylized guiding data of the multiple modes, the stylized guiding data of each mode of the multiple modes can be utilized to guide the stylized image generation model to be generated, and the guided image is generated through the first nerve radiation field and the second nerve radiation field respectively. For example, for a desired plurality of modalities, determining a training sequence for each modality on the neural radiation field; and then, carrying out corresponding training in sequence according to the training sequence, and obtaining a stylized image generation model supporting stylized guide data of multiple modes after the training process corresponding to each mode. The training process corresponding to each mode may be performed with reference to the step 202 and the subsequent steps 203 to 204, and the implementation manner corresponding to each step.

In a specific image generation process, for the first nerve radiation field and the second nerve radiation, under the guidance of the same stylized guidance data, the guided images are respectively generated based on the same camera pose which is randomly generated. It should be noted that, since the parameters of the first nerve radiation field are fixed and the parameters are not updated during the training process, the stylized guidance data does not play a role in stylized guidance of the first nerve radiation field.

Step 203, determining loss information according to the stylized guidance data and the guidance image.

In this embodiment, the execution subject may determine the loss information based on the stylized guidance data and the post-guidance image.

As an example, first, the execution subject may perform feature extraction on the stylized guidance data and the post-guidance image through a feature extraction network, respectively, to obtain a guidance feature and a post-guidance feature; then, a loss between the guided feature and a guided feature corresponding to a guided image generated by the second neural radiation field is determined as loss information.

As yet another example, first, stylized guidance data and style characteristics of a guided image corresponding to a second nerve radiation field, and content characteristics of a guided image corresponding to a first nerve radiation field and a guided image corresponding to a second nerve radiation field are extracted through a feature extraction network; then, a loss between the two style features and a loss between the two content features are determined, and loss information is determined in combination with the two losses. The style characteristics represent style information corresponding to the image, and the content characteristics represent content information such as objects, structures and the like included in the image.

The feature extraction network may be a general feature extraction network supporting feature extraction functions of various modal data, so that features corresponding to different data of various modalities are determined through the same feature extraction network; the feature extraction network may also be a dedicated feature extraction network supporting feature extraction functions of the single-mode data, so that for different mode data, the corresponding features of the mode data are determined through the corresponding feature extraction network. Taking the general feature extraction network as an example, it may be a CLIP (Contrastive Language-Image Pretraining, contrast language-image pre-training) network.

Step 204, updating parameters of the second nerve radiation field according to the loss information to generate a stylized image generation model.

In this embodiment, the execution subject may update the parameters of the second neural radiation field according to the loss information to generate the stylized image generation model.

As an example, the above-described execution body may iteratively perform the following training operations: firstly, selecting untrained stylized guide data from a training sample set of a mode to be supported by a stylized image generation model; then, executing the step 202, and respectively generating guided images through the first nerve radiation field and the second nerve radiation field under the guidance of the selected stylized guiding data; then, the above step 203 is performed, and loss information is determined according to the stylized guidance data and the guided image; then, the above step 204 is performed, and the parameters of the second nerve radiation field are updated according to the loss information.

In response to reaching the preset end condition, a final second neuro-radiation field is determined as a stylized image generation model. The preset ending condition may be, for example, that the training iteration number exceeds a preset number threshold, the training time exceeds a preset time threshold, and the training loss tends to converge.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the method for determining a stylized image generation model according to the present embodiment. In the application scenario of fig. 3, the server for determining the stylized image generation model first generates a first neural radiation field 302 with fixed parameters and a second neural radiation field 303 with updatable parameters based on the pre-trained neural radiation field 301. Then, the following training operations are iteratively performed until a preset end condition is reached, and the trained second neural radiation field 303 is determined as a stylized image generation model: firstly, selecting untrained stylized guidance data 304 from a multi-modal training sample set required by a stylized image generation model; then, under the guidance of the selected stylized guidance data 304, respectively generating a guided

image

305, 306 through the first nerve radiation field 302 and the second nerve radiation field 303 trained by the last training operation, wherein the stylized guidance data characterizes style information which is expected to be reached by the guided image; then, loss information 307 is determined from the stylized guidance data and the post-guidance image; then, based on the loss information, the parameters of the second nerve radiation field 303 are updated.

In this embodiment, a method for determining a stylized image generation model is provided, for a trained neural radiation field, the output result of the trained neural radiation field is constrained by stylized guiding data of different modes, so that an original neural radiation field is optimized to generate an image result after style editing, applicability of the neural radiation field to the stylized guiding data of different modes is improved, and guiding effect and generating efficiency in the generating process of the stylized image are improved.

In some optional implementations of the present embodiment, the stylized guidance data is stylized guidance text. The stylized guidance text may be text that is characterized by natural language, e.g., "freshness style editing". Under guidance of the stylized guidance text, a post-guidance image is generated by the first and second nerve radiation fields, respectively.

In this implementation manner, the execution body may execute the step 203 as follows:

first, determining a first feature according to stylized guide text and original text corresponding to a guided image generated by a first neural radiation field.

As an example, for a guided image generated by a first neural radiation field, the execution subject may extract its style characteristics and determine an original text characterizing the style of the guided image generated by the first neural radiation field based on its style characteristics; further, a difference between the stylized leading text and the original text is determined as a first feature. The guided image generated by the first nerve radiation field is used for extracting high-level image features of the guided image through a depth feature extraction network to serve as style features of the guided image.

As yet another example, the executing body may perform feature extraction on the guided image generated by the first neural radiation field, and determine style features corresponding to the guided image; further, extracting features of the stylized guide text, and determining guide style features corresponding to the stylized guide text; further, a difference between the style feature and the guide style feature is taken as a first feature.

Second, a second feature is determined from the guided images generated by the first and second neural radiation fields, respectively.

As an example, the execution subject may perform feature extraction on the two guided images, respectively, to obtain image features corresponding to the two guided images respectively; further, a difference between the image features corresponding to the two guided images is determined as the second feature.

Third, loss information between the first feature and the second feature is determined.

Specifically, the execution body may determine, as the loss information, a loss between the first feature and the second feature under a preset loss function type. The predetermined loss function type is, for example, a loss function type such as cosine loss, cross entropy loss, and the like.

In the implementation manner, the loss information is determined by fully utilizing the first characteristics between the stylized guide text and the original text corresponding to the guided image generated by the first nerve radiation field and the second characteristics between the guided images generated by the first nerve radiation field and the second nerve radiation field, so that the second characteristics are restrained through the first characteristics, the guided image generated by the second nerve radiation field is finally enabled to have the style represented by the stylized guide text, the accuracy of the loss information is improved, and the training efficiency of the second nerve radiation field in the training process is improved.

In some optional implementations of this embodiment, the executing body may execute the first step by: firstly, extracting features of a stylized guide text to obtain guide text features; then, extracting the characteristics of the original text to obtain the characteristics of the original text; finally, a direction vector between the leading text feature and the original text feature is determined as a first feature.

The original text represents the guided image generated by the first nerve radiation field, and because the parameters of the first nerve radiation field are fixed, the direction vector between the original text feature corresponding to the original text and the guided text feature corresponding to the stylized guided text can be used as a reference of the difference between the second feature between the guided images generated by the first nerve radiation field and the second nerve radiation field, the change of the second feature is guided, the direction vector between the guided text feature and the original text feature is used as the first feature, the accuracy of the first feature is improved, and the guiding capability of the first feature to the second feature is enhanced.

In some optional implementations of this embodiment, the executing entity may determine the guide text feature and the original text feature by: performing feature extraction on the stylized guide text through a first preset feature extraction network to obtain guide text features; and extracting the characteristics of the original text through a first preset characteristic extraction network to obtain the characteristics of the original text.

Wherein the first preset feature extraction network may be a neural network supporting a text feature extraction function, such as a CLIP network. In the implementation manner, the stylized guide text and the original text are respectively subjected to feature extraction through the same feature extraction network (first preset feature extraction network), so that differences between the guide text features and the original text features introduced by different feature extraction networks are avoided, and the accuracy of the determined first features is improved.

In some optional implementations of this embodiment, the executing body may execute the second step by: respectively extracting features of the guided images generated by the first nerve radiation field and the second nerve radiation field to obtain a first image feature and a second image feature; a direction vector between the first image feature and the second image feature is determined as the second feature.

Since the parameters of the first neural radiation field are fixed, and the generated guided image is not affected by the stylized guide text, in an ideal state where the second neural radiation field has style editing, the direction vector (second feature) between the first image feature and the second image feature should be similar to or the same as the first feature. The direction vector between the first image feature and the second image feature is used as the second feature, so that the accuracy of the second feature is improved, and the model training speed of the second feature under the condition of the first feature constraint is improved.

In some optional implementations of this embodiment, the executing entity may determine the first image feature and the second image feature by: and respectively carrying out feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a first preset feature extraction network to obtain a first image feature and a second image feature.

In the implementation manner, the feature extraction network for generating the first image feature and the second image feature is the same as the feature extraction network for generating the guide text feature and the original text feature, so that the difference between the first image feature and the second image feature introduced by different feature extraction networks is avoided, the accuracy of the second feature is improved, the difference between the first feature and the second feature introduced by different feature extraction networks is also avoided, and the accuracy of loss information is improved.

In some alternative implementations of the present embodiment, the stylized guidance data is stylized guidance speech. In this implementation manner, the execution body first converts the stylized guidance speech into the stylized guidance text, and then performs information processing according to the implementation manner corresponding to the stylized guidance text.

Wherein, the process of converting the stylized guidance speech into the stylized guidance text may be implemented based on speech-to-text technology. The implementation mode provides an information processing mode under the condition that the stylized guiding data is stylized guiding voice, and improves the applicability of the stylized image generation model to voice.

In some optional implementations of the present embodiment, the stylized guidance data is a stylized guidance image. The stylized leading image may be an image having the same style or different styles. The contents (objects such as persons and objects included) in the stylized leading image are not limited. For example, different stylized guide images may include different content therein, but with the same style; as another example, the same content may be included in different stylized leading images, but with different styles. Under guidance of the stylized guidance image, a post-guidance image is generated by the first and second nerve radiation fields, respectively.

first, a first loss between the first and second nerve radiation fields is determined.

The parameters of the first nerve radiation field are fixed and the parameters of the second nerve radiation field are updatable. As the training operation is iterated, parameters of the second neural radiation field are gradually updated, resulting in a gradual difference between the first and second neural radiation fields.

As an example, the above-described execution subject may determine a difference between the parameter of the first nerve radiation field and the parameter of the second nerve radiation field as the first loss.

Second, a second loss between the stylized leading image and the leading image generated by the second neural radiation field is determined.

As an example, for the stylized leading image and the post-leading image generated by the second neural radiation field, the above-described execution subject extracts corresponding style features thereof, respectively; further, a difference between the stylized leading image and the corresponding style characteristics of the leading image generated by the second nerve radiation field is determined as a second loss.

Third, the loss information is determined in combination with the first loss and the second loss.

As an example, the above-described execution body may determine the loss information based on summing, weighted summing, or the like, in combination with the first loss and the second loss.

In the implementation manner, a loss information determination manner under the condition that the stylized guiding data is the stylized guiding image is provided, and the accuracy of the loss information is improved.

In some optional implementations of this embodiment, the executing body may execute the first step by: first, determining low-level features of a guided image generated by each of the first and second neural radiation fields; the loss between the low-level features is then determined as a first loss.

The low-level features mainly characterize the content and structure in the image. In the feature extraction process of the guided image, a plurality of feature extraction operations are generally iterated. The features of the guided image resulting from the initial feature extraction operation are typically low-level features. The loss between the low-level features characterizing the content and the structure in the image is determined as a first loss, which may be that the consistency of the content and the structure of the generated guided image and the guided image generated by the first nerve radiation field is still maintained under the influence of the stylized guiding image on the second nerve radiation field, so that the content and the structure in the generated guided image are ensured to be unchanged.

In some optional implementations of this embodiment, the executing entity may determine the low-level features of the guided image by: and performing feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a second preset feature extraction network, and taking features extracted by a low-level feature extraction layer in the second preset feature extraction network as low-level features.

The second preset feature extraction network may be a neural network supporting an image feature extraction function, such as a VGG (Visual Geometry Group ) network. In the implementation manner, the same feature extraction network (second preset feature extraction network) is used for respectively extracting features of the two guided images, so that differences between low-level features corresponding to the two guided images introduced by different feature extraction networks are avoided, and the accuracy of the determined first loss is improved. The second preset feature extraction network comprises a plurality of feature extraction layers, each feature extraction layer in the plurality of feature extraction layers continues feature extraction based on features obtained by the previous feature extraction layer, and the obtained features are sent to the subsequent feature extraction layer.

In some optional implementations of this embodiment, the executing body may execute the second step by: firstly, determining the corresponding high-level features of a stylized guide image and a guide image generated by a second nerve radiation field; the loss between the high-level features is then determined as a second loss.

Compared with the low-level features, the high-level features are more abstract, and can represent respective style information of the stylized guide image and the guided image. Determining the loss of the high-level features between the stylized leading image and the leading image as the second loss can enable the second nerve radiation field to pay attention to and learn the style information of the stylized leading image, and is beneficial to improving the style editing effect of the second nerve radiation field after training.

In some optional implementations of this embodiment, the executing entity may determine the high-level features of the image by: and respectively carrying out feature extraction on the stylized guide image and the guide image generated by the second nerve radiation field through a second preset feature extraction network, and taking the features extracted by a high-level feature extraction layer in the second preset feature extraction network as high-level features.

In the implementation manner, the feature extraction network for generating the low-level features is the same as the feature extraction network for generating the high-level features, so that the difference between the low-level features introduced by different feature extraction networks is avoided, the accuracy of the first loss is improved, the difference between the high-level features introduced by different feature extraction networks is also avoided, and the accuracy of the second loss is improved.

In some optional implementations of the present embodiment, in order to enable the stylized image generation model to simultaneously have an editing function of stylized guidance data of multiple modes, in a training process of the stylized image generation model, guided images are generated through the first nerve radiation field and the second nerve radiation field respectively under guidance of the stylized guidance data of at least two modes of the multiple modes at the same time.

For example, the at least two modalities are an image modality and a speech modality, an image modality and a text modality, an image modality, a speech modality, and a text modality.

In an actual style editing situation, there may be a case where the single-mode stylized guidance data cannot achieve a better style guidance effect, and at this time, the stylized image generation model is guided simultaneously by the multi-mode stylized guidance data, which achieves a better guidance effect. For example, combining a stylized guidance image and stylized guidance speech may provide better guidance for the stylized editing process.

In the implementation manner, the first nerve radiation field and the second nerve radiation field are both guided by the stylized guiding data of multiple modes, the guided images are generated, the applicability of the stylized image generation model generated based on the second nerve radiation field to the multi-mode stylized guiding data is improved, and the style editing effect and efficiency of the stylized image generation model are improved.

In some optional implementations of this embodiment, the executing body may execute the step 203 as follows: and determining loss information according to the stylized guide data and the guided images corresponding to each of at least two modes.

For the stylized guidance data and the guided image corresponding to each mode, the execution subject may determine the loss information corresponding to each mode with reference to the example corresponding to step 203, and further determine the final loss information by combining the losses corresponding to each mode.

In the implementation mode, model updating is performed based on the loss information corresponding to the multi-mode guiding data in the training process, and the convergence speed of the training process is improved on the basis that the second nerve radiation field can support the multi-mode stylized guiding data.

In some optional implementations of this embodiment, the executing body may specifically execute the step of determining the loss information by:

first, for a text mode and/or a voice mode in at least two modes, determining loss information corresponding to the text mode and/or the voice mode according to the first characteristic and the second characteristic.

Second, combining the first loss and the second loss for the image modes in at least two modes to obtain loss information corresponding to the image modes.

Thirdly, combining the loss information corresponding to each of at least two modes to determine the loss information.

As an example, the executing entity may determine the loss information by summing, weighting, and the like, in combination with the loss information corresponding to each of the at least two modalities.

It should be noted that, for the text mode and the voice mode, the execution body may be executed in combination with implementation manners of loss information corresponding to the text mode and the voice mode; for the image modes, the execution body may be executed in combination with an implementation manner of the loss information corresponding to the image modes, which is not described herein.

And determining final loss information based on the loss information corresponding to each mode, and training a second nerve radiation field based on the final loss information to obtain a trained stylized image generation model, so that the stylized image generation model can simultaneously support stylized guide data of multiple modes, and the stylized editing effect and efficiency are improved.

With continued reference to fig. 4, there is shown a schematic flow 400 of yet another embodiment of a method for determining a stylized image generation model according to the present embodiment, comprising the steps of:

step 401, generating a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field.

Step 402, generating guided images through the first nerve radiation field and the second nerve radiation field respectively under the guidance of stylized guiding data of at least two modes in the multi-modes.

Wherein the stylized guidance data characterizes style information that the guided image is expected to reach.

Step 403, for a text mode and/or a voice mode in at least two modes, performing feature extraction on the stylized guide text through a first preset feature extraction network to obtain guide text features.

And step 404, performing feature extraction on the original text corresponding to the guided image generated by the first nerve radiation field through a first preset feature extraction network to obtain the original text feature.

Step 405, determining a direction vector between the leading text feature and the original text feature as a first feature.

Step 406, performing feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a first preset feature extraction network to obtain a first image feature and a second image feature.

In step 407, a direction vector between the first image feature and the second image feature is determined as the second feature.

At step 408, loss information between the first feature and the second feature is determined.

Step 409, for the image modes in at least two modes, performing feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a second preset feature extraction network, and taking the features extracted by the low-level feature extraction layer in the second preset feature extraction network as low-level features.

In step 410, the loss between low-level features is determined as a first loss.

In step 411, feature extraction is performed on the stylized guiding image and the guided image generated by the second nerve radiation field through the second preset feature extraction network, and features extracted by the high-level feature extraction layer in the second preset feature extraction network are used as high-level features.

At step 412, the loss between the high-level features is determined to be a second loss.

Step 413, determining loss information by combining the first loss and the second loss

In step 414, loss information is determined in conjunction with the loss information corresponding to each of the at least two modalities.

In step 415, parameters of the second neuro-radiation field are updated according to the loss information to generate a stylized image generation model.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the flowchart 400 of the method for generating a stylized image in this embodiment specifically illustrates a model training process under the guidance of the stylized guidance data of at least two modalities at the same time, so that the stylized editing effect and efficiency of the model can be further improved.

With continued reference to fig. 5, there is shown a schematic flow chart 500 of one embodiment of a method for generating a stylized image according to the present disclosure, including the steps of:

step 501, target stylized guidance data is acquired.

In the present embodiment, an execution subject of the method for generating a stylized image (for example, a terminal device or a server in fig. 1) may acquire target stylized guidance data from a remote location, or from a local location, through a wired network connection or a wireless network connection. Wherein the target stylized guidance data characterizes style information that a stylized image to be generated is expected to reach.

The target stylized guidance data may be single-mode stylized guidance data, or multi-mode stylized guidance data, such as image mode, text mode, or voice mode.

Step 502, generating a stylized image through the trained stylized image generation model under the guidance of the target stylized guidance data.

In this embodiment, the execution subject may generate the stylized image through the trained stylized image generation model under the guidance of the target stylized guidance data. Wherein the stylized image generation model is trained by the

above embodiments

200, 400.

As an example, the execution subject may generate the stylized image by the trained stylized image generation model based on the acquired camera pose under the guidance of the target stylized guidance data.

In the embodiment, a method for determining a stylized image generation model is provided, and the method is used for generating the model based on the trained stylized image, so that the editing efficiency of the stylized editing process and the editing effect of the generated stylized image are improved.

With continued reference to fig. 6, as an implementation of the method illustrated in the foregoing figures, the present disclosure provides an embodiment of an apparatus for determining a stylized image generation model, which corresponds to the method embodiment illustrated in fig. 2, and which may be particularly applicable in a variety of electronic devices.

As shown in fig. 6, an apparatus 600 for determining a stylized image generation model includes: a first generation unit 601 configured to generate a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field; a second generating unit 602 configured to generate a guided image by the first and second nerve radiation fields, respectively, under guidance of multi-modal stylized guidance data, wherein the stylized guidance data characterizes style information that the guided image is expected to reach; a determining unit 603 configured to determine loss information from the stylized guidance data and the post-guidance image; an updating unit 604 configured to update parameters of the second neural radiation field according to the loss information to generate a stylized image generation model.

In some optional implementations of the present embodiment, the stylized guidance data is a stylized guidance text, and the determining unit 603 is further configured to: determining a first feature according to the stylized guide text and an original text corresponding to the guided image generated by the first nerve radiation field; determining a second feature from the guided images generated by the first and second neural radiation fields, respectively; loss information between the first feature and the second feature is determined.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: extracting features of the stylized guide text to obtain guide text features; extracting the characteristics of the original text to obtain the characteristics of the original text; a direction vector between the leading text feature and the original text feature is determined as a first feature.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: performing feature extraction on the stylized guide text through a first preset feature extraction network to obtain guide text features; and extracting the characteristics of the original text through a first preset characteristic extraction network to obtain the characteristics of the original text.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: respectively extracting features of the guided images generated by the first nerve radiation field and the second nerve radiation field to obtain a first image feature and a second image feature; a direction vector between the first image feature and the second image feature is determined as the second feature.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: and respectively carrying out feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a first preset feature extraction network to obtain a first image feature and a second image feature.

In some optional implementations of the present embodiment, wherein the stylized guidance data is stylized guidance speech, and the determining unit 603 is further configured to: the stylized guidance speech is converted into stylized guidance text.

In some optional implementations of the present embodiment, the stylized guidance data is a stylized guidance image, and the determining unit 603 is further configured to: determining a first loss between the first and second nerve radiation fields; determining a second loss between the stylized leading image and the leading image generated by the second neural radiation field; the first loss and the second loss are combined to determine loss information.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: determining low-level features of the guided image generated by each of the first and second neural radiation fields; the loss between low-level features is determined as a first loss.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: and performing feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a second preset feature extraction network, and taking features extracted by a low-level feature extraction layer in the second preset feature extraction network as low-level features.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: determining the corresponding high-level features of the stylized guide image and the guide image generated by the second nerve radiation field; the loss between the high-level features is determined as a second loss.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: and respectively carrying out feature extraction on the stylized guide image and the guide image generated by the second nerve radiation field through a second preset feature extraction network, and taking the features extracted by a high-level feature extraction layer in the second preset feature extraction network as high-level features.

In some optional implementations of the present embodiment, the second generating unit 602 is further configured to: and simultaneously generating guided images through the first nerve radiation field and the second nerve radiation field under the guidance of stylized guiding data of at least two modes in the multi-mode.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: and determining loss information according to the stylized guide data and the guided images corresponding to each of at least two modes.

In some optional implementations of the present embodiment, the determining unit 603 is further configured to: for a text mode and/or a voice mode in at least two modes, determining loss information corresponding to the text mode and/or the voice mode according to the first characteristic and the second characteristic; combining the first loss and the second loss for the image modes in at least two modes to obtain loss information corresponding to the image modes; loss information is determined in combination with loss information corresponding to each of the at least two modalities.

In this embodiment, a device for determining a stylized image generation model is provided, for a trained neural radiation field, the output result of the trained neural radiation field is constrained by stylized guiding data of different modes, so that an original neural radiation field is optimized to generate an image result after style editing, applicability of the neural radiation field to the stylized guiding data of different modes is improved, and guiding effect and generating efficiency in the generating process of the stylized image are improved.

With continued reference to fig. 7, as an implementation of the method illustrated in the foregoing figures, the present disclosure provides an embodiment of a method for generating a stylized image, where the apparatus embodiment corresponds to the method embodiment illustrated in fig. 5, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 7, a method 700 for generating a stylized image includes: an acquisition unit 701 configured to acquire target stylized guidance data, wherein the target stylized guidance data characterizes style information to which a stylized image to be generated is expected to reach; the third generating unit 702 is configured to generate a stylized image by the trained stylized image generation model under the guidance of the target stylized guidance data. Wherein the stylized image generation model is trained by the above-described embodiment 600.

In this embodiment, an apparatus for generating a stylized image is provided, which generates a model based on a trained stylized image, and improves the editing efficiency of the stylized editing process and the editing effect of the generated stylized image.

According to an embodiment of the present disclosure, the present disclosure further provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to implement the method for determining a stylized image generation model, the method for generating a stylized image described in any of the embodiments above.

According to an embodiment of the present disclosure, there is also provided a readable storage medium storing computer instructions for enabling a computer to implement the method for determining a stylized image generation model, the method for generating a stylized image described in any of the above embodiments when executed.

The disclosed embodiments provide a computer program product enabling the implementation of the method for determining a stylized image generation model, the method for generating a stylized image described in any of the embodiments above when executed by a processor.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a method for determining a stylized image generation model. For example, in some embodiments, the method for determining a stylized image generation model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the method for determining a stylized image generation model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method for determining the stylized image generation model by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called as a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual special server (VPS, virtual Private Server) service; or may be a server of a distributed system or a server incorporating a blockchain.

According to the technical scheme of the embodiment of the disclosure, a method for determining a stylized image generation model is provided, for a trained nerve radiation field, the output result of the trained nerve radiation field is constrained by stylized guide data of different modes, so that an original nerve radiation field is optimized to generate an image result after style editing, the applicability of the nerve radiation field to the stylized guide data of different modes is improved, and the guiding effect and the generating efficiency in the generating process of the stylized image are improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for determining a stylized image generation model, comprising:

generating a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field;

generating a guided image through the first nerve radiation field and the second nerve radiation field under the guidance of multi-mode stylized guiding data, wherein the stylized guiding data represents style information expected to be reached by the guided image;

determining loss information according to the stylized guidance data and the guided image;

and updating parameters of the second nerve radiation field according to the loss information to generate a stylized image generation model.

2. The method of claim 1, wherein the stylized guidance data is stylized guidance text, and

the determining loss information according to the stylized guidance data and the guided image includes:

determining a first feature according to the stylized guide text and an original text corresponding to the guided image generated by the first nerve radiation field;

determining a second feature from the guided images generated by each of the first and second neural radiation fields;

The loss information between the first feature and the second feature is determined.

3. The method of claim 2, wherein the determining a first feature from the stylized guided text and the original text corresponding to the guided image generated by the first neural radiation field comprises:

extracting features of the stylized guide text to obtain guide text features;

extracting the characteristics of the original text to obtain the characteristics of the original text;

a direction vector between the leading text feature and the original text feature is determined as the first feature.

4. The method of claim 3, wherein the feature extraction of the stylized guide text to obtain guide text features comprises:

extracting features of the stylized guide text through a first preset feature extraction network to obtain the guide text features; and

the step of extracting the characteristics of the original text to obtain the characteristics of the original text comprises the following steps:

and extracting the characteristics of the original text through the first preset characteristic extraction network to obtain the characteristics of the original text.

5. The method of claim 2, wherein the determining a second feature from the guided images generated by the first and second nerve radiation fields, respectively, comprises:

Respectively extracting features of the guided images generated by the first nerve radiation field and the second nerve radiation field to obtain a first image feature and a second image feature;

a direction vector between the first image feature and the second image feature is determined as the second feature.

6. The method of claim 5, wherein the feature extraction of the guided images generated by the first and second neural radiation fields, respectively, results in first and second image features, respectively, comprising:

and respectively carrying out feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field through a first preset feature extraction network to obtain the first image feature and the second image feature.

7. The method of any of claims 2-6, wherein the stylized guidance data is stylized guidance speech, and

before determining the first feature according to the stylized guide text and the original text corresponding to the guided image generated by the first nerve radiation field, the method further includes:

converting the stylized guidance speech into the stylized guidance text.

8. The method of claim 1, wherein the stylized guidance data is a stylized guidance image, and

determining a first loss between the first and second nerve radiation fields;

determining a second loss between the stylized leading image and a leading image generated by the second neural radiation field;

and determining the loss information by combining the first loss and the second loss.

9. The method of claim 8, wherein the determining a first loss between the first and second nerve radiation fields comprises:

determining low-level features of a guided image generated by each of the first and second neural radiation fields;

and determining the loss between the low-level features as the first loss.

10. The method of claim 9, wherein the determining low-level features of the guided image generated by each of the first and second nerve radiation fields comprises:

and performing feature extraction on the guided images generated by the first nerve radiation field and the second nerve radiation field respectively through a second preset feature extraction network, and taking features extracted by a low-level feature extraction layer in the second preset feature extraction network as the low-level features.

11. The method of claim 8, wherein the determining a second loss between the stylized guidance image and the second nerve radiation field generated post-guidance image comprises:

determining the corresponding high-level features of the stylized guide image and the guide image generated by the second nerve radiation field;

and determining the loss between the high-level features as the second loss.

12. The method of claim 11, wherein the determining the high-level features to which the stylized guidance image and the second neural radiation field generated guidance image each correspond comprises:

and respectively carrying out feature extraction on the stylized guide image and the guided image generated by the second nerve radiation field through a second preset feature extraction network, and taking the features extracted by a high-level feature extraction layer in the second preset feature extraction network as the high-level features.

13. The method of any of claims 1-12, wherein the generating guided images from the first and second nerve radiation fields, respectively, under guidance of multi-modal stylized guidance data, comprises:

Simultaneously, under the guidance of stylized guidance data of at least two modes of the multi-modes, respectively generating the guided images through the first nerve radiation field and the second nerve radiation field.

14. The method of claim 13, wherein the determining loss information from the stylized guidance data and the post-guidance image comprises:

and determining the loss information according to the stylized guide data and the guided images corresponding to each mode of the at least two modes.

15. The method of claim 14, wherein the determining the loss information from the stylized guidance data and the guided image corresponding to each of the at least two modalities comprises:

determining loss information corresponding to the text mode and/or the voice mode according to the first characteristic and the second characteristic for the text mode and/or the voice mode in the at least two modes;

combining the first loss and the second loss for the image modes in the at least two modes to obtain loss information corresponding to the image modes;

and combining the loss information corresponding to each mode in the at least two modes to determine the loss information.

16. A method for generating a stylized image, comprising:

acquiring target stylized guidance data, wherein the target stylized guidance data represents style information of a stylized image to be generated expected to be reached;

generating the stylized image through a trained stylized image generation model under the guidance of the target stylized guidance data, wherein the stylized image generation model is obtained through training of any one of claims 1-15.

17. An apparatus for determining a stylized image generation model, comprising:

a first generation unit configured to generate a first neural radiation field with fixed parameters and a second neural radiation field with updatable parameters based on the pre-trained neural radiation field;

a second generation unit configured to generate a post-guidance image through the first and second nerve radiation fields, respectively, under guidance of multi-modal stylized guidance data, wherein the stylized guidance data characterizes style information that the post-guidance image is expected to reach;

a determining unit configured to determine loss information from the stylized guidance data and the post-guidance image;

An updating unit configured to update parameters of the second neural radiation field according to the loss information to generate a stylized image generation model.

18. The apparatus of claim 17, wherein the stylized guidance data is stylized guidance text, and

the determining unit is further configured to:

determining a first feature according to the stylized guide text and an original text corresponding to the guided image generated by the first nerve radiation field; determining a second feature from the guided images generated by each of the first and second neural radiation fields; the loss information between the first feature and the second feature is determined.

19. The apparatus of claim 18, wherein the determination unit is further configured to:

extracting features of the stylized guide text to obtain guide text features; extracting the characteristics of the original text to obtain the characteristics of the original text; a direction vector between the leading text feature and the original text feature is determined as the first feature.

20. The apparatus of claim 19, wherein the determination unit is further configured to:

Extracting features of the stylized guide text through a first preset feature extraction network to obtain the guide text features; and extracting the characteristics of the original text through the first preset characteristic extraction network to obtain the characteristics of the original text.

21. The apparatus of claim 18, wherein the determination unit is further configured to:

respectively extracting features of the guided images generated by the first nerve radiation field and the second nerve radiation field to obtain a first image feature and a second image feature; a direction vector between the first image feature and the second image feature is determined as the second feature.

22. The apparatus of claim 21, wherein the determination unit is further configured to:

23. The apparatus of any of claims 18-22, wherein the stylized guidance data is stylized guidance speech, and

The determining unit is further configured to:

converting the stylized guidance speech into the stylized guidance text.

24. The apparatus of claim 17, wherein the stylized guidance data is a stylized guidance image, and

the determining unit is further configured to:

determining a first loss between the first and second nerve radiation fields; determining a second loss between the stylized leading image and a leading image generated by the second neural radiation field; and determining the loss information by combining the first loss and the second loss.

25. The apparatus of claim 24, wherein the determination unit is further configured to:

determining low-level features of a guided image generated by each of the first and second neural radiation fields; and determining the loss between the low-level features as the first loss.

26. The apparatus of claim 25, wherein the determination unit is further configured to:

27. The apparatus of claim 24, wherein the determination unit is further configured to:

determining the corresponding high-level features of the stylized guide image and the guide image generated by the second nerve radiation field; and determining the loss between the high-level features as the second loss.

28. The apparatus of claim 27, wherein the determination unit is further configured to:

29. The apparatus of any of claims 17-28, wherein the second generation unit is further configured to:

30. The apparatus of claim 29, wherein the determination unit is further configured to:

31. The apparatus of claim 30, wherein the determination unit is further configured to:

determining loss information corresponding to the text mode and/or the voice mode according to the first characteristic and the second characteristic for the text mode and/or the voice mode in the at least two modes; combining the first loss and the second loss for the image modes in the at least two modes to obtain loss information corresponding to the image modes; and combining the loss information corresponding to each mode in the at least two modes to determine the loss information.

32. An apparatus for generating a stylized image, comprising:

an acquisition unit configured to acquire target stylized guidance data, wherein the target stylized guidance data characterizes style information to which a stylized image to be generated is expected to reach;

a third generating unit configured to generate the stylized image by a trained stylized image generation model under guidance of the target stylized guidance data, wherein the stylized image generation model is obtained by training according to any one of claims 17 to 31.

33. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.

34. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-16.

35. A computer program product comprising: a computer program which, when executed by a processor, implements the method of any of claims 1-16.