CN117496099A - Three-dimensional image editing method, system, electronic device and storage medium - Google Patents

Three-dimensional image editing method, system, electronic device and storage medium Download PDF

Info

Publication number
CN117496099A
CN117496099A CN202311267121.8A CN202311267121A CN117496099A CN 117496099 A CN117496099 A CN 117496099A CN 202311267121 A CN202311267121 A CN 202311267121A CN 117496099 A CN117496099 A CN 117496099A
Authority
CN
China
Prior art keywords
image
sample
text
editing
original image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311267121.8A
Other languages
Chinese (zh)
Inventor
李建民
李建辉
朱军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202311267121.8A priority Critical patent/CN117496099A/en
Publication of CN117496099A publication Critical patent/CN117496099A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Graphics (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Hardware Design (AREA)
  • Architecture (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention relates to the technical field of three-dimensional modeling, and provides a three-dimensional image editing method, a system, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an original image and a text instruction; inputting an original image and a text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of each target object in each editing image is matched with the description of the text instruction; the image editing model is used for mapping an original image into hidden vectors, removing noise added on the hidden vectors by adopting a noise predictor under the guidance of a text instruction, removing the noise based on the condition obtained by removing the noise, and rendering an edited image through a three-dimensional generation countermeasure network. The invention is used for solving the defects that the existing three-dimensional image editing method applied in the prior art cannot simultaneously ensure that natural language guidance is supported and text guidance is accurate.

Description

Three-dimensional image editing method, system, electronic device and storage medium
Technical Field
The present invention relates to the field of three-dimensional modeling technologies, and in particular, to a three-dimensional image editing method, system, electronic device, and storage medium.
Background
Three-dimensional modeling has a wide range of applications in modern digital society, including the creation of movies, virtual Reality (VR) and digital human assets, in particular, the generation and editing of three-dimensional faces is one of the most interesting tasks.
Existing three-dimensional face editing methods, for example: three-dimensional image editing methods based on three-dimensional generation countermeasure network inversion technology such as 3D-GAN-INV (Generative Adversarial Network, generation countermeasure network), PREI 3D (Precise Editing in Inversion Manifold with 3D Consistency, accurate three-dimensional consistent editing in inversion manifold), HFGI3D and E3DGE (Efficient Geometry-aware 3Dgenerative adversarial networks, efficient three-dimensional generation countermeasure network) and the like are used for carrying out inversion operation on a single image through optimization, an encoder or a combination of the two, mapping the single image to an hidden vector, generating a multi-view image consistent with an input image by using the hidden vector by the three-dimensional generation network, and editing the hidden vector to realize semantic editing consistent with the input image three-dimensionally. In the Image editing method based on the diffusion model such as the Rodin diffusion model, a prompting project is used for acquiring the change amount between the target text and the neutral text in a text embedding space of the CLIP (Contrastive Language-Image Pre-Training, contrast Image-language Pre-Training), and the change amount of the text embedding code is directly added to the Image embedding code by assuming that the Image of the CLIP and the text embedding code are collinear, so that a three-dimensional Image editing result guided by the text is obtained.
However, these three-dimensional image editing methods based on three-dimensional generation countermeasure network inversion techniques all require training of editing directions representing binary attributes, and each editing direction represents an attribute such as wearing glasses, but these editing methods all belong to structured editing and do not support natural language guidance, although they achieve impressive results. In the three-dimensional image editing method based on the diffusion model, because the change amount of text embedding is directly added into the image embedding, inaccurate editing results are sometimes caused, and the training cost is high.
Disclosure of Invention
The invention provides a three-dimensional image editing method, a system, electronic equipment and a storage medium, which are used for solving the defect that the currently applied three-dimensional image editing method in the prior art cannot simultaneously ensure that natural language guidance is supported and text guidance is accurate.
The invention provides a three-dimensional image editing method, which comprises the following steps:
acquiring an original image and a text instruction;
inputting the original image and the text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects in the original image displayed from different angles, and the form of each target object in each editing image is matched with the description of the text instruction;
The image editing model is used for mapping the original image into hidden vectors, removing noise added on the hidden vectors under the guidance of the text instruction through a noise predictor, and rendering the edited image through three-dimensional generation countermeasure network based on the conditional noise removal hidden vectors obtained by removing the noise, wherein the noise predictor is obtained by training based on an original image sample, an edited image sample and a text instruction sample;
the editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the morphology of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
According to the three-dimensional image editing method of the invention, the training process of the noise predictor comprises the following steps: mapping the original image sample and the edited image sample into an original image sample hidden vector and an edited image sample hidden vector respectively;
encoding the text instruction samples into text sample vectors;
training a noise predictor model based on the original image sample hidden vector, the edited image sample hidden vector, and the text sample vector until a loss function of the noise predictor model converges;
The noise predictor model that converges the loss function is taken as the noise predictor.
According to the three-dimensional image editing method of the present invention, the training process of the noise predictor further comprises:
extracting a feature vector sample of the target object sample from the original image sample;
the feature vector samples are used for training of the noise predictor model.
According to the three-dimensional image editing method of the invention, the noise predictor model comprises a text-image bi-conditional model, an image mono-conditional model and an unconditional model.
According to the three-dimensional image editing method of the present invention, the inputting the original image and the text instruction into an image editing model, obtaining an edited image output by the image editing model, includes:
performing image mapping processing based on the input original image to obtain the hidden vector corresponding to the original image;
based on the hidden vector, noise adding processing is carried out to obtain a noise adding hidden vector corresponding to the hidden vector;
based on the input text instruction, performing text encoding processing to obtain a text vector corresponding to the text instruction;
Based on the input original image, carrying out feature extraction processing on the target object to obtain a feature vector corresponding to the target object; based on the hidden vector, the text vector and the feature vector, denoising the denoising hidden vector through the noise predictor to obtain a conditional denoising hidden vector corresponding to the denoising hidden vector;
and rendering the edited image through the three-dimensional generation countermeasure network based on the conditional denoising hidden vector.
According to the three-dimensional image editing method of the present invention, the encoding the text instruction sample into a text sample vector includes:
converting the text instruction sample into a digital token;
converting the digital token according to a preset token conversion rule, wherein the preset token conversion rule is as follows: after converting the text instruction sample into the digital token with the preset length, randomly adjusting the token position corresponding to the text instruction sample to any position in the digital token;
and encoding the converted digital token based on a pre-trained text encoding model to obtain the text sample vector.
The invention also provides a three-dimensional image editing system, which comprises:
the data acquisition module is used for acquiring an original image and a text instruction;
the image processing module is used for inputting the original image and the text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of each target object in each editing image is matched with the description of the text instruction;
the image editing model is used for mapping the original image into hidden vectors, noise added on the hidden vectors is removed under the guidance of the text instruction through a noise predictor, the noise is removed hidden vectors based on the condition obtained by removing the noise, the edited image is rendered through a three-dimensional generation countermeasure network, and the noise predictor is obtained by training based on an original image sample, an edited image sample and a text instruction sample;
the editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the morphology of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
The three-dimensional image editing system according to the present invention further comprises:
the model training module is used for mapping the original image sample and the edited image sample into an original image sample hidden vector and an edited image sample hidden vector respectively; encoding the text instruction samples into text sample vectors; training a noise predictor model based on the original image sample hidden vector, the edited image sample hidden vector, and the text sample vector until a loss function of the noise predictor model converges; the noise predictor model that converges the loss function is taken as the noise predictor.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the three-dimensional image editing method as described above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the three-dimensional image editing method as described above.
According to the three-dimensional image editing method, the system, the electronic equipment and the storage medium, the original image and the text instruction are acquired, then the original image and the text instruction are input into the image editing model, the image editing model can remove noise added on the hidden vector under the guidance of the text instruction through the noise predictor in the image editing model after the original image is mapped into the hidden vector, so that the conditional denoising hidden vector is obtained, and finally a plurality of editing images with different visual angles, the morphology of which is matched with the description of the text instruction, of the target object is rendered through three dimensions based on the conditional denoising hidden vector. The three-dimensional editing of text guidance on the original image is realized by carrying out conditional diffusion in the hidden space of the three-dimensional countermeasure network, so that the image editing based on natural language guidance is realized, and the accuracy of text guidance is ensured.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a three-dimensional image editing method provided by the invention;
FIG. 2 is a schematic diagram of a training process of an image editing model provided by the present invention;
fig. 3 and 4 are effect contrast charts of image editing of an example picture by applying the three-dimensional image editing method provided by the invention;
FIG. 5 is a schematic diagram of a three-dimensional image editing system according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It will be appreciated that NeRF (Neural Radiance Field, neuro-radiation field) has a significant impact on three-dimensional modeling, and at present, three-dimensional generation networks such as GIRAFFE and EG3D have been integrated into the generation countermeasure network framework, training using two-dimensional image sets within a specific area, and being able to generate high resolution images of different angles of an object, such as faces, cars, cats, etc., from random hidden vectors. These work paves the way for structured three-dimensional image editing methods, such as 3D-GAN-Inv, PREIM3D, HFGI3D and E3DGE, which map a single Zhang Zhenshi image to hidden vectors through inversion operations, through which a three-dimensional generation network can generate multi-view images corresponding to the input single image, and through modification of the hidden vectors obtained by inversion, three-dimensional consistent semantic editing is achieved, such as changing hair color, wearing glasses, etc.
In addition, the diffusion model refers to the thought of a physical diffusion process, the data is modeled to be Gaussian noise through a forward noise adding process and a reverse noise removing process, and then the Gaussian noise is used for generating the data. Conditional diffusion models implemented by information modulation or cross-attention mechanisms facilitate editing tasks guided by text, sketches, masks, etc.
However, three-dimensional image editing methods based on three-dimensional generation against network inversion techniques, such as: 3D-GAN-INV, PREIM3D, HFGI3D, E3DGE, etc. all require training of edit directions representing binary properties, each representing a property, such as wearing glasses, and thus, although impressive results are achieved with these edit methods, they all belong to structured edits, not supporting natural language guidance. In the three-dimensional image editing method based on the diffusion model such as the Rodin diffusion model, the change amount of the text embedded code is directly added to the image embedded code by assuming that the image of the CLIP and the text embedded code are collinear, so that a text-guided three-dimensional image editing result is obtained. However, experiments show that cosine similarity in these two directions is typically around 0.2 instead of 1.0. Therefore, directly adding the change amount of text embedding to image embedding sometimes results in inaccurate editing results, the calculation amount of training the neural radiation field by the Rodin diffusion model is very large, the adopted three-plane feature is a high-dimensional space of 256×256×96, and the training cost of the diffusion model is also relatively large.
Based on the above, the embodiment of the invention provides a three-dimensional image editing method for performing a conditional diffusion process in a hidden space of a three-dimensional generation countermeasure network, thereby effectively solving the problems of no support of natural language guidance, low text guidance accuracy and the like existing in the existing three-dimensional image editing method.
A three-dimensional image editing method according to the present invention is described below with reference to fig. 1 to 4, and the method may be performed by software and/or hardware in an electronic device such as a computer, a tablet, a mobile phone, etc., as shown in fig. 1, and the method includes the steps of:
101. acquiring an original image and a text instruction;
in this embodiment, the original image is an image to be edited, and the text instruction is the basis for editing the original image, that is, the text description of the change between the edited image and the original image is desired, for example: taking a portrait as an example, the text instruction may be: wearing glasses, micro, crying, etc.
102. Inputting the original image and the text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of the target objects in each editing image is matched with the description of the text instruction.
The image editing model is used for mapping an original image into hidden vectors, removing noise added on the hidden vectors under the guidance of a text instruction through a noise predictor, removing the noise hidden vectors based on the conditions obtained by removing the noise, rendering an edited image through a three-dimensional generation countermeasure network, and training the noise predictor based on an original image sample, an edited image sample and a text instruction sample;
The editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the shape of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
In this embodiment, the original image and the text instruction are input into the image editing model obtained by training based on the original image sample, the editing image sample and the text instruction sample, so that the image editing model can remove noise added on the hidden vector by adopting a noise predictor which takes the text instruction as a condition after mapping the original image into the hidden vector, then the three-dimensional generation countermeasure network is used, the form of the target object in the original image is rendered based on the conditional denoising hidden vector obtained by removing the noise to be matched with the description of the text instruction, and a plurality of editing images of the target object are displayed from different angles, for example: when the original image is a portrait, the target object in the original image is a person, and the obtained multiple editing images are images showing the portrait from different angles and conforming to the description of the text instruction. The method comprises the steps of mapping an input image to a W+ hidden space of a three-dimensional generation countermeasure network (EG 3D) by using a pre-trained inversion encoder to obtain hidden vectors with preset dimensions (14 x 512), training a diffusion model taking texts and images as conditions in the W+ hidden space, sampling to obtain hidden vectors on the condition that images and text instructions to be edited are required to be inferred, and rendering edited images with different visual angles by using the hidden vectors by the three-dimensional generation countermeasure network. Therefore, the three-dimensional editing of text guidance on the original image is realized by conditional diffusion in the hidden space of the three-dimensional countermeasure network.
In this embodiment, the noise predictor is trained based on the original image sample, the edited image sample and the text instruction sample, that is, by training the image editing model by using the triplet data consisting of the original image sample, the edited image sample and the text instruction sample semantically representing the change from the original image sample to the edited image sample, the image editing model can be encouraged to learn the correlation between the change of the paired image and the corresponding text instruction, thereby achieving a more accurate editing effect. Based on the above embodiments, the image editing model needs to use the noise predictor to remove the noise added on the hidden vector under the guidance of the text instruction, and thus, the noise predictor needs to be trained.
Based on this, in one embodiment of the present invention, the three-dimensional image editing method further includes: training process of noise predictor:
mapping the original image sample and the edited image sample into an original image sample hidden vector and an edited image sample hidden vector respectively;
encoding the text instruction samples into text sample vectors;
training a noise predictor model based on the original image sample hidden vector, the edited image sample hidden vector and the text sample vector until a loss function of the noise predictor model converges;
The noise predictor model that converges the loss function is used as a noise predictor.
It may be appreciated that the noise predictor belongs to the image editing model provided by the embodiment of the present invention, and thus, the training process of the noise predictor is included in the training process of the image editing model.
Based on this, in the present embodiment, the training process of the noise predictor is specifically described from the training point of view of the image editing model.
Wherein the data for training the noise predictor is triplet data consisting of an original face image X o An edited two-dimensional face image X e And a text instruction T semantically representing a change from the original face to the edited face.
Fig. 2 illustrates a training process of an image editing model provided by the embodiment of the present invention, as shown in fig. 2, the image editing model provided by the embodiment of the present invention generates an countermeasure network G for a pretrained three-dimensional, and a multi-view face image can be rendered through the G, that is:
X=G(w,c) (1)
where w is a hidden vector, for example: for a real number in dimension 14X 512, c is the camera pose parameters, including internal and external parameters, typically a real number in dimension 25, and X is the generated edited image.
In the training process, the original face image input into the image editing model provided by the embodiment of the invention is an angle image, and the multi-view face image can be rendered through G.
The inversion encoder E used may map the input face image into a hidden vector w, i.e.:
w=E(X) (2)
wherein X is an input face image,is a reconstructed image obtained after inversion.
In the training process of the image editing model, an original face X in the triplet data is processed by an inversion encoder E o And edited two-dimensional face X e Respectively mapped to hidden vectors w o And w e Then w is o And w e Conversion from a two-dimensional vector to a three-dimensional vector, e.g. by converting w o And w e The 0 fill is used to 16 x 512 dimensions and then deformed to 4 x 512 dimensions. Thereafter, at the t-th step of diffusion, for w e Adding noise to obtain w after noise addition et Then by w o Noise predictor e conditioned by sum T θ To remove added noise to obtain a text-guided edited image.
Thus, noise predictor e θ The training process of (1) is specifically as follows:
specifically, wherein text T is encoded as c by CLIP T The training loss function of the noise predictor, i.e. the diffusion model, is then:
where ε is the added noise, L reg Is regularization loss between the predicted edited image and the original image in the training process, lambda reg Is a regularized weight that is set to be a regularized weight,is the predicted conditional denoising hidden vector in the training process, and c is the camera parameter.
Based on w o 、w e And c T And (4) continuously training, and converging the training loss function as shown in the formula (4) to obtain the trained noise predictor.
Based on the foregoing embodiment, the training process of the noise predictor further includes:
extracting a feature vector sample of a target object sample from the original image sample;
the feature vector samples are used for training of the noise predictor model.
In this embodiment, the feature vector samples of the target object samples extracted from the original image samples are used for training the noise predictor model, so that when the noise predictor obtained by training is used for denoising, not only the text instruction can be used as a guide, but also the features of the target object in the original image are considered, thereby improving the reduction degree of the target object in the edited image rendered based on the conditional denoising hidden vector to the target object in the original image.
Based on the content of the above embodiment, the noise predictor model includes a text-image bi-conditional model, an image mono-conditional model, and an unconditional model.
In this embodiment, the text-image bi-conditional model, the image mono-conditional model, and the unconditional model are parameterized by training a network. Wherein by probability p 1 Setting up To train an unconditional model. Then at the time of reasoning, using the image and text bi-conditional sampling method, the final predicted noise can be determined as:
wherein S is I And S is T The weights are aligned with the input image and with the text instruction, respectively.
Based on the content of the above embodiment, inputting the original image and the text instruction into the image editing model, obtaining an edited image matching the description of the text instruction output by the image editing model, includes:
based on the input original image, performing image mapping processing to obtain a hidden vector corresponding to the original image;
based on the hidden vector, noise adding processing is carried out to obtain a noise adding hidden vector corresponding to the hidden vector;
based on the input text instruction, performing text encoding processing to obtain a text vector corresponding to the text instruction;
based on the input original image, carrying out feature extraction processing on the target object to obtain a feature vector corresponding to the target object;
based on the hidden vector, the text vector and the feature vector, denoising the denoising hidden vector through a noise predictor to obtain a conditional denoising hidden vector corresponding to the denoising hidden vector;
and rendering an edited image through the three-dimensional generation countermeasure network based on the conditional denoising hidden vector.
In this embodiment, an instruction in an original image and a text is input into an image editing model, the original image is mapped into a hidden vector by the image editing model, then noise is added to the hidden vector to obtain a noise-added hidden vector, a text instruction is encoded into a text vector, a feature vector of a target object is extracted from the original image, and based on the hidden vector, the text vector and the feature vector, noise on the noise-added hidden vector can be removed by a pre-trained noise predictor under the conditions of the original image, the feature of the target object in the original image and the text instruction, and then a conditional noise-removed hidden vector is obtained, by which a multi-angle editing image corresponding to the text instruction can be rendered. Namely, by carrying out three-dimensional diffusion of hidden conditions on hidden space of the generated antagonism network, the generated antagonism network based on NeRF is connected with a diffusion model, so that three-dimensional image editing based on accurate text guidance is realized.
Based on the content of the above embodiment, encoding the text instruction samples into text sample vectors includes:
converting the text instruction sample into a digital token;
converting the digital token according to a preset token conversion rule, wherein the preset token conversion rule is as follows: after converting the text instruction sample into the digital token with the preset length, randomly adjusting the token position corresponding to the text instruction sample to any position in the digital token;
And encoding the converted digital token based on the pre-trained text encoding model to obtain a text sample vector.
It will be appreciated that in natural language processing, text is typically tokenized into a fixed length digital token, whereas for a single text instruction, typically only the first few bits in the translated digital token are non-zero, thus resulting in a cross-attention mechanism that focuses more on the head of the instruction.
In this embodiment, after converting the text instruction sample into a digital token with a preset length, the token position corresponding to the text instruction sample is randomly adjusted to an arbitrary position in the digital token, for example: when the token length is set to be 77, after the text instruction sample is converted into the digital token with the length of 77, the token position corresponding to the text instruction sample, namely the first few bits of the text token, is adjusted to be started from the 22 th bit of the digital token, so that the cross attention mechanism can pay attention to more bits of the text instruction sample, namely training on a single instruction data set is realized, and the text coding model capable of realizing editing of the combined instruction is obtained. Namely, when the image editing model provided by the embodiment of the invention is applied, the editing of the original image based on a plurality of text instructions can be realized.
In a specific embodiment, the preset token conversion rule may be: setting the token length as 77, randomly setting the token initial position of the text instruction sample as [0,30], enabling the last non-zero token position to be smaller than 77, and obtaining a text sample vector after converting the digital token based on the preset token conversion rule and then encoding through the CLIP.
The three-dimensional image editing method provided by the embodiment of the invention provides a new end-to-end text-guided three-dimensional image editing frame, and relates a NeRF-based generation countermeasure network to a diffusion model by carrying out three-dimensional diffusion of hidden conditions on a hidden space of a generated countermeasure network. Meanwhile, a text token position randomization training strategy is provided, training is realized on a single instruction data set, and the editing effect of the combined instruction can be realized during reasoning. In addition, by training the double-condition diffusion model by adopting the triplet data set, the relevance between the change of the paired images and the corresponding instructions can be learned, so that more accurate editing is realized.
Further, fig. 3 and fig. 4 are effect diagrams after the original image is edited based on text guidance by using the three-dimensional image editing method provided by the embodiment of the present invention, wherein the left diagram is the original image, the text instruction is located below the original image, the right diagram is the edited image obtained by using the three-dimensional image editing method provided by the embodiment of the present invention, and it can be seen from the effect diagram shown by the right diagram that after an original image to be edited and a text instruction are given, the three-dimensional image editing method provided by the embodiment of the present invention can implement the text instruction to edit an input image and output an edited image with multiple views, and meanwhile, the text instruction can be one instruction or multiple instructions, so that the validity of the three-dimensional image editing method provided by the embodiment of the present invention and the accuracy of text guidance are fully demonstrated.
Based on the same general inventive concept, the present invention also protects a three-dimensional image editing system, which is described below, and the three-dimensional image editing system described below and the three-dimensional image editing method described above may be referred to correspondingly to each other.
Fig. 5 is a schematic structural diagram of the three-dimensional image editing system provided by the present invention. As shown in fig. 5, includes: a data acquisition module 510 and an image processing module 520; wherein,
the data acquisition module 510 is used for acquiring an original image and a text instruction;
the image processing module 520 is configured to input an original image and a text instruction into the image editing model, to obtain an edited image output by the image editing model, where the edited image includes a plurality of images showing target objects in the original image from different angles, and the morphology of the target objects in each edited image is matched with the description of the text instruction;
the image editing model is used for mapping an original image into hidden vectors, removing noise added on the hidden vectors through a noise predictor under the guidance of a text instruction, removing the noise hidden vectors based on the conditions obtained by removing the noise, and rendering an edited image through a three-dimensional generation countermeasure network, wherein the noise predictor is obtained by training based on an original image sample, an edited image sample and the text instruction sample, the edited image sample is an image which shows a target object sample in the original image sample from the same angle, and the shape of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
According to the three-dimensional image editing system provided by the embodiment of the invention, the original image and the text instruction are acquired, then the original image and the text instruction are input into the image editing model, so that the image editing model can remove noise added on the hidden vector under the guidance of the text instruction through the noise predictor in the image editing model after the original image is mapped into the hidden vector, the conditional denoising hidden vector is obtained, and finally a plurality of editing images with different visual angles, the morphology of which is matched with the description of the text instruction, of the target object is rendered through three dimensions based on the conditional denoising hidden vector. The three-dimensional editing of text guidance on the original image is realized by carrying out conditional diffusion in the hidden space of the three-dimensional countermeasure network, so that the image editing based on natural language guidance is realized, and the accuracy of text guidance is ensured.
Based on the content of the above embodiment, the three-dimensional image editing system further includes:
the model training module is used for mapping the original image sample and the edited image sample into an original image sample hidden vector and an edited image sample hidden vector respectively; encoding the text instruction samples into text sample vectors; training a noise predictor model based on the original image sample hidden vector, the edited image sample hidden vector and the text sample vector until a loss function of the noise predictor model converges; the noise predictor model that converges the loss function is used as a noise predictor.
Optionally, the model training module is further configured to:
extracting a feature vector sample of a target object sample from the original image sample;
the feature vector samples are used for training of the noise predictor model.
Optionally, the noise predictor model includes a text-image bi-conditional model, an image mono-conditional model, and an unconditional model.
Optionally, the image processing module is specifically configured to:
based on the input original image, performing image mapping processing to obtain a hidden vector corresponding to the original image;
based on the hidden vector, noise adding processing is carried out to obtain a noise adding hidden vector corresponding to the hidden vector;
based on the input text instruction, performing text encoding processing to obtain a text vector corresponding to the text instruction;
based on the input original image, carrying out feature extraction processing on the target object to obtain a feature vector corresponding to the target object;
based on the hidden vector, the text vector and the feature vector, denoising the denoising hidden vector through a noise predictor to obtain a conditional denoising hidden vector corresponding to the denoising hidden vector;
and rendering an edited image through the three-dimensional generation countermeasure network based on the conditional denoising hidden vector.
Optionally, the model training module is more specifically configured to:
Converting the text instruction sample into a digital token;
converting the digital token according to a preset token conversion rule, wherein the preset token conversion rule is as follows: after converting the text instruction sample into the digital token with the preset length, randomly adjusting the token position corresponding to the text instruction sample to any position in the digital token;
and encoding the converted digital token based on the pre-trained text encoding model to obtain a text sample vector.
Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a three-dimensional image editing method comprising: acquiring an original image and a text instruction; inputting an original image and a text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of each target object in each editing image is matched with the description of the text instruction; the image editing model is used for mapping an original image into hidden vectors, removing noise added on the hidden vectors under the guidance of a text instruction through a noise predictor, removing the noise hidden vectors based on the conditions obtained by removing the noise, rendering an edited image through a three-dimensional generation countermeasure network, and training the noise predictor based on an original image sample, an edited image sample and a text instruction sample; the editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the shape of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the three-dimensional image editing method provided by the above methods, the method comprising: acquiring an original image and a text instruction; inputting an original image and a text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of each target object in each editing image is matched with the description of the text instruction; the image editing model is used for mapping an original image into hidden vectors, removing noise added on the hidden vectors under the guidance of a text instruction through a noise predictor, removing the noise hidden vectors based on the conditions obtained by removing the noise, rendering an edited image through a three-dimensional generation countermeasure network, and training the noise predictor based on an original image sample, an edited image sample and a text instruction sample; the editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the shape of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the three-dimensional image editing method provided by the above methods, the method comprising: acquiring an original image and a text instruction; inputting an original image and a text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of each target object in each editing image is matched with the description of the text instruction; the image editing model is used for mapping an original image into hidden vectors, removing noise added on the hidden vectors under the guidance of a text instruction through a noise predictor, removing the noise hidden vectors based on the conditions obtained by removing the noise, rendering an edited image through a three-dimensional generation countermeasure network, and training the noise predictor based on an original image sample, an edited image sample and a text instruction sample; the editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the shape of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A three-dimensional image editing method, comprising:
acquiring an original image and a text instruction;
inputting the original image and the text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects in the original image displayed from different angles, and the form of each target object in each editing image is matched with the description of the text instruction;
the image editing model is used for mapping the original image into hidden vectors, removing noise added on the hidden vectors under the guidance of the text instruction through a noise predictor, and rendering the edited image through three-dimensional generation countermeasure network based on the conditional noise removal hidden vectors obtained by removing the noise, wherein the noise predictor is obtained by training based on an original image sample, an edited image sample and a text instruction sample;
The editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the morphology of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
2. The three-dimensional image editing method according to claim 1, wherein the training process of the noise predictor comprises:
mapping the original image sample and the edited image sample into an original image sample hidden vector and an edited image sample hidden vector respectively;
encoding the text instruction samples into text sample vectors;
training a noise predictor model based on the original image sample hidden vector, the edited image sample hidden vector, and the text sample vector until a loss function of the noise predictor model converges;
the noise predictor model that converges the loss function is taken as the noise predictor.
3. The three-dimensional image editing method according to claim 2, wherein the training process of the noise predictor further comprises:
Extracting a feature vector sample of the target object sample from the original image sample;
the feature vector samples are used for training of the noise predictor model.
4. A three-dimensional image editing method according to claim 3, wherein the noise predictor model comprises a text-image bi-conditional model, an image mono-conditional model and an unconditional model.
5. The three-dimensional image editing method according to claim 4, wherein said inputting the original image and the text instruction into an image editing model to obtain an edited image output by the image editing model, comprises:
performing image mapping processing based on the input original image to obtain the hidden vector corresponding to the original image;
based on the hidden vector, noise adding processing is carried out to obtain a noise adding hidden vector corresponding to the hidden vector;
based on the input text instruction, performing text encoding processing to obtain a text vector corresponding to the text instruction;
based on the input original image, carrying out feature extraction processing on the target object to obtain a feature vector corresponding to the target object;
Based on the hidden vector, the text vector and the feature vector, denoising the denoising hidden vector through the noise predictor to obtain a conditional denoising hidden vector corresponding to the denoising hidden vector;
and rendering the edited image through the three-dimensional generation countermeasure network based on the conditional denoising hidden vector.
6. The three-dimensional image editing method according to claim 2, wherein the encoding the text instruction samples into text sample vectors comprises:
converting the text instruction sample into a digital token;
converting the digital token according to a preset token conversion rule, wherein the preset token conversion rule is as follows: after converting the text instruction sample into the digital token with the preset length, randomly adjusting the token position corresponding to the text instruction sample to any position in the digital token;
and encoding the converted digital token based on a pre-trained text encoding model to obtain the text sample vector.
7. A three-dimensional image editing system, comprising:
the data acquisition module is used for acquiring an original image and a text instruction;
The image processing module is used for inputting the original image and the text instruction into an image editing model to obtain an editing image output by the image editing model, wherein the editing image comprises a plurality of target objects displayed in the original image from different angles, and the form of each target object in each editing image is matched with the description of the text instruction;
the image editing model is used for mapping the original image into hidden vectors, removing noise added on the hidden vectors under the guidance of the text instruction through a noise predictor, and rendering the edited image through three-dimensional generation countermeasure network based on the conditional noise removal hidden vectors obtained by removing the noise, wherein the noise predictor is obtained by training based on an original image sample, an edited image sample and a text instruction sample;
the editing image sample is an image which shows a target object sample in the original image sample from the same angle, and the morphology of the target object sample is matched with the description of the text instruction sample; the text instruction sample semantically represents a change from the original image sample to the edited image sample.
8. The three-dimensional image editing system of claim 7, further comprising:
the model training module is used for mapping the original image sample and the edited image sample into an original image sample hidden vector and an edited image sample hidden vector respectively; encoding the text instruction samples into text sample vectors; training a noise predictor model based on the original image sample hidden vector, the edited image sample hidden vector, and the text sample vector until a loss function of the noise predictor model converges; the noise predictor model that converges the loss function is taken as the noise predictor.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the three-dimensional image editing method according to any of claims 1 to 6 when executing the program.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the three-dimensional image editing method according to any one of claims 1 to 6.
CN202311267121.8A 2023-09-27 2023-09-27 Three-dimensional image editing method, system, electronic device and storage medium Pending CN117496099A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311267121.8A CN117496099A (en) 2023-09-27 2023-09-27 Three-dimensional image editing method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311267121.8A CN117496099A (en) 2023-09-27 2023-09-27 Three-dimensional image editing method, system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN117496099A true CN117496099A (en) 2024-02-02

Family

ID=89671514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311267121.8A Pending CN117496099A (en) 2023-09-27 2023-09-27 Three-dimensional image editing method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN117496099A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953180A (en) * 2024-03-26 2024-04-30 厦门大学 Text-to-three-dimensional object generation method based on dual-mode latent variable diffusion
CN118096944A (en) * 2024-04-24 2024-05-28 武汉人工智能研究院 Method, device, equipment, medium and product for constructing costume editing model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117953180A (en) * 2024-03-26 2024-04-30 厦门大学 Text-to-three-dimensional object generation method based on dual-mode latent variable diffusion
CN118096944A (en) * 2024-04-24 2024-05-28 武汉人工智能研究院 Method, device, equipment, medium and product for constructing costume editing model

Similar Documents

Publication Publication Date Title
CN112734634B (en) Face changing method and device, electronic equipment and storage medium
CN117496099A (en) Three-dimensional image editing method, system, electronic device and storage medium
CN111243050B (en) Portrait simple drawing figure generation method and system and painting robot
CN111275057B (en) Image processing method, device and equipment
CN115018954B (en) Image generation method, device, electronic equipment and medium
CN116109733A (en) Text-driven image editing method, device, electronic equipment and storage medium
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
CN117635771A (en) Scene text editing method and device based on semi-supervised contrast learning
US11803950B2 (en) Universal style transfer using multi-scale feature transform and user controls
CN114783017A (en) Method and device for generating confrontation network optimization based on inverse mapping
CN117422823A (en) Three-dimensional point cloud characterization model construction method and device, electronic equipment and storage medium
CN116630549A (en) Face modeling method and device, readable storage medium and electronic equipment
JP7479507B2 (en) Image processing method and device, computer device, and computer program
CN115810215A (en) Face image generation method, device, equipment and storage medium
CN111145096A (en) Super-resolution image reconstruction method and system based on recursive extremely-deep network
CN112990123B (en) Image processing method, apparatus, computer device and medium
US20220101145A1 (en) Training energy-based variational autoencoders
CN115631285A (en) Face rendering method, device and equipment based on unified drive and storage medium
Dere et al. Conditional reiterative High-Fidelity GAN inversion for image editing
CN117351575B (en) Nonverbal behavior recognition method and nonverbal behavior recognition device based on text-generated graph data enhancement model
CN115115537B (en) Image restoration method based on mask training
CN116704588B (en) Face image replacing method, device, equipment and storage medium
Park et al. StyleBoost: A Study of Personalizing Text-to-Image Generation in Any Style using DreamBooth
CN117237542B (en) Three-dimensional human body model generation method and device based on text
CN116862803B (en) Reverse image reconstruction method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination