CN117541681A - Image editing method and device based on depth feature generation and electronic equipment - Google Patents

Image editing method and device based on depth feature generation and electronic equipment Download PDF

Info

Publication number
CN117541681A
CN117541681A CN202311390946.9A CN202311390946A CN117541681A CN 117541681 A CN117541681 A CN 117541681A CN 202311390946 A CN202311390946 A CN 202311390946A CN 117541681 A CN117541681 A CN 117541681A
Authority
CN
China
Prior art keywords
vector
image
generator
patch
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311390946.9A
Other languages
Chinese (zh)
Inventor
王金桥
蔡鹏祥
刘智威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sinovision Jurong Technology Co ltd
Original Assignee
Sinovision Jurong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sinovision Jurong Technology Co ltd filed Critical Sinovision Jurong Technology Co ltd
Priority to CN202311390946.9A priority Critical patent/CN117541681A/en
Publication of CN117541681A publication Critical patent/CN117541681A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an image editing method and device based on depth feature generation and electronic equipment, wherein the method comprises the following steps: responding to an operation point and a target point determined by a user on an image to be edited, and acquiring a patch neighborhood feature corresponding to the operation point and a patch neighborhood feature corresponding to the target point; respectively aggregating the patch neighborhood features corresponding to the operation points and the patch neighborhood features corresponding to the target points into operation point feature vectors and target point feature vectors which are equal to the first w vector dimension; based on the operation point characteristic vector and the target point characteristic vector, a second w vector is obtained by using a transducer model; and performing splicing processing on the second w vector and the first w vector to obtain a third w vector, and inputting the third w vector into a pre-trained StyleGAN generator to obtain an edited image. Thus realizing more efficient, more accurate and more semantic interactive image editing based on 'point dragging'.

Description

Image editing method and device based on depth feature generation and electronic equipment
Technical Field
The present invention relates to the field of image editing technologies, and in particular, to an image editing method and apparatus based on depth feature generation, and an electronic device.
Background
The image editing technology is a technology for carrying out detail modification and editing processing on image information by using a computer, and most of the image editing technologies are based on an image generation technology, wherein the image generation technology is a technology for generating images by using a computer program, and the technology can be used for generating various types of images including digital arts, animations, game scenes, product designs and the like.
At present, most image editing technologies either cannot achieve accurate control, for example, direct interpolation processing by using the intermediate hidden vector of StyleGAN can only change the style characteristic attribute of certain characteristics, cannot completely decouple control, and control is not accurate and reliable. Or DragGAN, the intermediate feature discrimination capability based on the StyleGAN generator is strong, the intermediate feature discrimination capability is highly correlated with the generated image in two dimensions of semanteme and space, the w vector is optimized through supervising the loss of the target point neighborhood and the handle point neighborhood of the intermediate feature of the generator, and finally the optimized w vector is obtained through the StyleGAN generator to obtain the image after drag editing. Although DragGAN obtains an accurate drag editing effect at the pixel level, the optimization algorithm consumes long time in an actual application scene, and the final result of the algorithm based on local optimization is not necessarily semantic, and cannot better distinguish editing motions or editing deformations, for example, a large-scale drag should cause motions, but a large deformation occurs.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides an image editing method and device based on depth feature generation and electronic equipment.
In a first aspect, the present invention provides an image editing method based on depth feature generation, including:
responding to an operation point and a target point determined by a user on an image to be edited, and acquiring a patch neighborhood feature corresponding to the operation point and a patch neighborhood feature corresponding to the target point;
based on a convolutional neural network, the patch neighborhood features corresponding to the operation points and the patch neighborhood features corresponding to the target points are respectively aggregated into operation point feature vectors and target point feature vectors which are equal to the first w vector dimension; the first w vector is used for generating the image to be edited;
based on the operation point characteristic vector and the target point characteristic vector, a second w vector is obtained by using a transducer model;
the second w vector and the first w vector are spliced to obtain a third w vector, and then the third w vector is input into a pre-trained StyleGAN generator to obtain an edited image;
wherein, the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training of the pre-trained StyleGAN generator are utilized.
In some embodiments, the obtaining a second w vector using a transducer model based on the operating point feature vector and the target point feature vector includes:
taking the first w vector as the input of a self-attention module of the transducer model, adding the output of the self-attention module with a high-dimensional vector obtained by MLP processing the operation point feature vector through an AdaIN layer, and inputting the added result into a cross-attention module of the transducer model;
and taking the vector obtained by repeatedly copying the target point characteristic vector as the input of the cross attention module, and performing cross attention operation with the addition result to obtain the second w vector.
In some embodiments, the convolutional neural network is a network structure of three convolutional layers followed by a maximum pooling layer.
In some embodiments, the performing the stitching processing on the second w vector and the first w vector to obtain a third w vector includes:
and splicing the second w vector with part of vectors in the first w vector to obtain the third w vector.
In some embodiments, the training process of the transducer model comprises:
For any training sample used in training, firstly carrying out edge detection on a sample image in the training sample to obtain an edge information graph of the sample image, then carrying out random sampling on high response points in the edge information graph to obtain a training sample operation point, and carrying out random sampling in a neighborhood of a designated range around the training sample operation point to obtain a training sample target point;
based on the patch neighborhood characteristics corresponding to the training sample operation points and the patch neighborhood characteristics corresponding to the training sample target points, obtaining edited sample images and generator intermediate characteristics corresponding to the edited sample images by using the convolutional neural network, the transducer model and the pre-trained StyleGAN generator;
training the transducer model by using a first loss function and a second loss function based on the edited sample image and the generator intermediate characteristics corresponding to the edited sample image;
the first loss function is used for pulling the patch neighborhood characteristics corresponding to the target point of the training sample after editing to the original patch neighborhood characteristics at the operating point of the training sample; the second loss function is used for zooming the edited image into a real image.
In some embodiments, the first loss function is formulated as:
L drag =‖dragFeat[target patch]-generatorFeat[handle patch]‖ 1
wherein L is drag A loss value representing the first loss function; dragFeat represents generator intermediate features corresponding to the edited sample image; the generationfeat represents the corresponding generator intermediate features of the sample image; generator Feat [ handle batch ]]Represents the describedTraining original patch neighborhood characteristics at sample operation points; dragFeat [ target latch ]]Representing patch neighborhood features at the target point of the training sample after editing.
In some embodiments, the second loss function is formulated as:
L GAN =‖D(dragImg)-1‖
wherein L is GAN A loss value representing the second loss function; d (dragImg) represents a scoring of the edited sample image by a discriminator; 1 is the highest score of the arbiter, representing the real image.
In a second aspect, the present invention further provides an image editing apparatus based on depth feature generation, including:
the acquisition module is used for responding to the operation point and the target point determined by the user on the image to be edited, and acquiring the patch neighborhood characteristics corresponding to the operation point and the patch neighborhood characteristics corresponding to the target point;
the convolution module is used for respectively aggregating the patch neighborhood characteristics corresponding to the operation points and the patch neighborhood characteristics corresponding to the target points into operation point characteristic vectors and target point characteristic vectors which are equal to the first w vector dimension based on a convolution neural network; the first w vector is used for generating the image to be edited;
The attention module is used for obtaining a second w vector by using a transducer model based on the operation point characteristic vector and the target point characteristic vector;
the generation module is used for splicing the second w vector and the first w vector to obtain a third w vector, and inputting the third w vector into a pre-trained StyleGAN generator to obtain an edited image;
wherein, the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training of the pre-trained StyleGAN generator are utilized.
In some embodiments, the obtaining a second w vector using a transducer model based on the operating point feature vector and the target point feature vector includes:
taking the first w vector as the input of a self-attention module of the transducer model, adding the output of the self-attention module with a high-dimensional vector obtained by MLP processing the operation point feature vector through an AdaIN layer, and inputting the added result into a cross-attention module of the transducer model;
And taking the vector obtained by repeatedly copying the target point characteristic vector as the input of the cross attention module, and performing cross attention operation with the addition result to obtain the second w vector.
In some embodiments, the convolutional neural network is a network structure of three convolutional layers followed by a maximum pooling layer.
In some embodiments, the performing the stitching processing on the second w vector and the first w vector to obtain a third w vector includes:
and splicing the second w vector with part of vectors in the first w vector to obtain the third w vector.
In some embodiments, the training process of the transducer model comprises:
for any training sample used in training, firstly carrying out edge detection on a sample image in the training sample to obtain an edge information graph of the sample image, then carrying out random sampling on high response points in the edge information graph to obtain a training sample operation point, and carrying out random sampling in a neighborhood of a designated range around the training sample operation point to obtain a training sample target point;
based on the patch neighborhood characteristics corresponding to the training sample operation points and the patch neighborhood characteristics corresponding to the training sample target points, obtaining edited sample images and generator intermediate characteristics corresponding to the edited sample images by using the convolutional neural network, the transducer model and the pre-trained StyleGAN generator;
Training the transducer model by using a first loss function and a second loss function based on the edited sample image and the generator intermediate characteristics corresponding to the edited sample image;
the first loss function is used for pulling the patch neighborhood characteristics corresponding to the target point of the training sample after editing to the original patch neighborhood characteristics at the operating point of the training sample; the second loss function is used for zooming the edited image into a real image.
In some embodiments, the first loss function is formulated as:
L drag =‖dragFeat[target patch]-generatorFeat[handle patch]‖ 1
wherein L is drag A loss value representing the first loss function; dragFeat represents generator intermediate features corresponding to the edited sample image; the generationfeat represents the corresponding generator intermediate features of the sample image; generator Feat [ handle batch ]]Representing original patch neighborhood features at the training sample operation points; dragFeat [ target latch ]]Representing patch neighborhood features at the target point of the training sample after editing.
In some embodiments, the second loss function is formulated as:
L GAN =‖D(dragImg)-1‖
wherein L is GAN A loss value representing the second loss function; d (dragImg) represents a scoring of the edited sample image by a discriminator; 1 is the highest score of the arbiter, representing the real image.
In a third aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and running on the processor, where the processor implements the image editing method based on depth feature generation according to the first aspect as described above when executing the program.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image editing method based on depth feature generation as described in the first aspect.
In a fifth aspect, the invention also provides a computer program product comprising a computer program which, when executed by a processor, implements an image editing method based on depth feature generation as described in any of the above.
According to the depth feature generation-based image editing method, device and electronic equipment, more efficient, more accurate and more semantic interactive image editing based on 'point dragging' is realized through the strong discrimination capability of the intermediate features of the generator of the StyleGAN pre-training model and the strong sequence modeling and prediction capability of the transducer model structure.
Drawings
In order to more clearly illustrate the invention or the technical solutions in the related art, the following description will briefly explain the drawings used in the embodiments or the related art description, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is a schematic flow chart of an image editing method based on depth feature generation provided by the invention;
FIG. 2 is an exemplary diagram of a transducer model provided by the present invention;
FIG. 3 is a diagram illustrating an exemplary overall design of a model training provided by the present invention;
FIG. 4 is a schematic structural diagram of an image editing apparatus based on depth feature generation according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
The term "and/or" in the present invention describes an association relationship of association objects, which means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The term "plurality" in the present invention means two or more, and other adjectives are similar thereto.
The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more.
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to facilitate a clearer understanding of the technical solution of the present invention, some technical matters related to the present invention will be described first.
The most advanced image editing technology is developed based on StyleGAN image generation technology at present, and the main idea is that different attributes of an object can be accurately controlled by utilizing decoupling property of depth features of StyleGAN in a hidden space, the hidden space of the pre-trained StyleGAN is found, an image overlay is generated into the hidden space, then hidden vectors are projected to different attribute directions to obtain hidden vector directions representing different attributes, and then conditional vectors are added to adjust object attributes. Since the StyleGAN depth feature does not achieve complete decoupling of object properties, the direction of the hidden vector obtained by projection does not necessarily represent a single property, and thus unexpected image editing results are easy to edit. Moreover, the hidden vector obtained by projection has poor interpretability, cannot achieve an accurate editing effect, and many methods interpolate between features and cannot interpret the editing effect.
DragGAN can follow precisely the instructions entered by the user with an intuitive point-based interactive method of image editing while maintaining the diversity of the images. DragGAN is a feature optimization-based method, which realizes image drag editing at pixel level, and can accurately edit any position of an image to achieve the effect desired by a user. The characteristic optimization of the DragGAN is also based on a StyleGAN generation model, the DragGAN uses the middle characteristic of a StyleGAN generator, the w vector is mapped in the w space obtained by a network, the w vector passes through the middle characteristic of the generator obtained by the generator, the characteristic semanteme and the image space associated with the generated image are provided, the middle characteristic of the generator is optimized, the neighborhood near the target point of the middle characteristic of the generator is used as the supervision of the neighborhood near the handle point of the middle characteristic of the generator, and the w vector is indirectly optimized, so that the effect of dragging the handle point to the target point is achieved.
However, the current image editing technology cannot achieve precise control, for example, the middle hidden vector direct interpolation processing of StyleGAN is utilized, only the style characteristic attribute of certain characteristics can be changed, the control cannot be completely decoupled, and the control is not accurate and reliable. Or DragGAN, the intermediate feature discrimination capability based on the StyleGAN generator is strong, the intermediate feature discrimination capability is highly correlated with the generated image in two dimensions of semanteme and space, the w vector is optimized through supervising the loss of the target point neighborhood and the handle point neighborhood of the intermediate feature of the generator, and finally the optimized w vector is used for obtaining the image after drag editing through the StyleGAN generator. Although DragGAN obtains an accurate drag editing effect at the pixel level, the optimization algorithm consumes long time in an actual application scene, and the final result of the algorithm based on local optimization is not necessarily semantic, and cannot better distinguish editing motions or editing deformations, for example, a large-scale drag should cause motions, but a large deformation occurs.
Based on the difficulties and problems of coarse editing granularity, low optimization time efficiency, lack of semanteme of a generation effect and the like existing in the current image editing method, the invention provides an image editing algorithm based on depth feature generation by means of strong discrimination capability of the intermediate features of a StyleGAN pre-training model generator and strong sequence modeling and prediction capability of a transform model structure, realizes a neural network based on Transformer encoder-Transformer decoder architecture, implicitly learns a moving path of the intermediate features of the StyleGAN generator in an editing space by designing a training strategy, and realizes more efficient, more accurate and more semanteme interactive image editing based on 'point drag'.
Fig. 1 is a schematic flow chart of an image editing method based on depth feature generation, as shown in fig. 1, the method includes the following steps:
and 100, responding to the operation points and the target points determined by the user on the image to be edited, and acquiring the patch neighborhood characteristics corresponding to the operation points and the patch neighborhood characteristics corresponding to the target points.
Specifically, the main execution body of each step in the method may be an image editing device, and the device may be implemented by software and/or hardware, and the device may be integrated in an electronic device, where the electronic device may be a terminal device (such as a smart phone, a personal computer, etc.), or may be a server (such as a local server or a cloud server, or may be a server cluster, etc.), or may be a processor, or may be a chip, etc.
When a user edits an image, an operation point (handle point) and a target point (target point) on the image to be edited need to be determined (for example, clicked and input). The image editing device responds to the operation point and the target point determined by the user on the image to be edited, and then the patch neighborhood feature corresponding to the operation point and the patch neighborhood feature corresponding to the target point can be obtained first. The patch neighborhood feature refers to a generator middle feature corresponding to a neighborhood range taking a patch as a unit around an operation point or a target point, and the generator middle feature refers to a feature output by each layer of a generating network through a pre-trained StyleGAN generator when the pre-trained StyleGAN generator generates a corresponding image, or can be simply called as a generator feature.
In the present invention, the pre-trained StyleGAN generator may be any pre-trained StyleGAN type image generator, for example, may be a generator of StyleGAN, styleGAN, styleGAN3, etc., which is not limited herein.
The above step 100 refers to a similar process of editing an image using DragGAN, and the present invention will not be described in detail.
Step 101, based on a convolutional neural network, respectively aggregating patch neighborhood features corresponding to an operation point and patch neighborhood features corresponding to a target point into an operation point feature vector and a target point feature vector which are equal to a first w vector dimension; the first w vector is a w vector for generating an image to be edited.
Specifically, after the patch neighborhood feature corresponding to the operation point and the patch neighborhood feature corresponding to the target point are obtained, the image editing device may then aggregate the patch neighborhood feature corresponding to the operation point and the patch neighborhood feature corresponding to the target point into an operation point feature vector and a target point feature vector which are equal to the first w vector dimension respectively by using a convolutional neural network.
For example, the first w vector is an 18×512 w vector, the image editing apparatus may aggregate the patch neighborhood feature corresponding to the operation point and the patch neighborhood feature corresponding to the target point into an operation point feature vector and a target point feature vector with dimensions of 512, respectively, by using a convolutional neural network, for example, the operation point feature vector and the target point feature vector may be 1×512 vectors.
In some embodiments, the convolutional neural network may be a network structure with three convolutional layers followed by a maximum pooling layer, i.e., the convolutional operation is performed on the patch neighborhood feature first, and then the maximum pooling operation is performed to obtain the feature vector. The specific structure of the convolutional neural network is not limited by the present invention, as long as the aim of aggregation can be achieved.
Step 102, obtaining a second w vector by using a transducer model based on the operation point feature vector and the target point feature vector.
Specifically, after acquiring the operation point feature vector and the target point feature vector, the image editing apparatus may then use a transducer model to obtain a second w vector, where the transducer model may refer to an existing transducer model structure, which includes a self-attention module and a cross-attention module.
In some embodiments, deriving the second w vector using the transducer model based on the operating point feature vector and the target point feature vector comprises:
taking the first w vector as the input of a self-attention module of the transducer model, adding the high-dimensional vector obtained by the MLP processing of the output of the self-attention module through the AdaIN layer and the operation point feature vector, and inputting the added result into a cross-attention module of the transducer model;
And taking the vector obtained by repeatedly copying the characteristic vector of the target point as the input of the cross attention module, and performing cross attention operation with the added result to obtain a second w vector.
Fig. 2 is an exemplary diagram of a transducer model provided in the present invention, as shown in fig. 2, an operation point feature vector and a target point feature vector extracted from generator feature aggregation have the same dimension as an original w vector, but have larger data distribution differences, and are not suitable for direct attention operation. Referring specifically to FIG. 2, the original w vector is taken as the input for the self-attention operation, and the self-attention output is obtained by combining the AdaIN layer with h 1*512 Adding (here, we realize that h 1*512 The higher dimension vector is obtained by MLP (Multi-layer perceptron), and then this higher dimension vector reshape is post-added to the self-attention output, h in this example 1*512 I.e. 1 x 512 operating point feature vectors). t is t 6*512 (from t 1*512 The replication is six passes (by way of example only), t of this example 1*512 I.e. 1 x 512 target point feature vector), t) as input for a cross-attention operation 6*512 And performing cross attention operation with the added output of the AdaIN layer to obtain a final attention result, namely a second w vector.
And 103, performing splicing processing on the second w vector and the first w vector to obtain a third w vector, and inputting the third w vector into a pre-trained StyleGAN generator to obtain an edited image.
Specifically, after the second w vector is obtained, the image editing device may then splice the second w vector with the first w vector to obtain a third w vector, and then input the third w vector into the pre-trained StyleGAN generator, so as to obtain an edited image, that is, obtain an image editing result.
In some embodiments, the stitching the second w vector with the first w vector to obtain a third w vector includes:
and splicing the second w vector with part of vectors in the first w vector to obtain a third w vector.
For example, assuming that the first w vector is a 18×512 vector and the second w vector is a 6×512 vector, the first 6×512 vector of the first w vector may be replaced with the second w vector, that is, the second w vector is spliced with the last 12×512 vector of the first w vector, thereby obtaining a third w vector, and the third w vector is used to generate a final image editing result. Of course, this is merely exemplary, and not limiting, and there may be other ways to splice the second w vector with the first w vector according to knowledge of those skilled in the art, which will not be described herein.
Fig. 3 is a diagram illustrating an overall design of model training provided in the present invention, as shown in fig. 3, in which the training samples are mainly used to optimize parameters of a transducer model, and the pre-trained StyleGAN generator may use an existing pre-trained model, and the parameters of the transducer model are not trained by freezing weights when the parameters of the transducer model are optimized.
The training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training the pre-trained StyleGAN generator.
Specifically, the invention also learns the moving path of the feature in the 'editing space' based on the editability of the strong discrimination capability of the intermediate feature of the StyleGAN generator, so that a pre-trained StyleGAN generation model is needed to be utilized before a transducer model is trained, a batch of training data is generated according to random number seeds, the training data comprises a sample image (or fake) finally generated by the pre-trained StyleGAN generator, a w vector, a generator intermediate feature (or feat) and a real image (or real) used by the StyleGAN pre-training model to train, and the four training data are used as data of training samples.
The training process of the model is basically the same as the model reasoning process, except that the handle point and the target point are input by a user during model reasoning, and the handle point and the target point are selected through sampling during model training.
In some embodiments, the training process of the transducer model includes:
for any training sample used in training, firstly carrying out edge detection on a sample image in the training sample to obtain an edge information graph of the sample image, then carrying out random sampling on high response points in the edge information graph to obtain a training sample operation point, and carrying out random sampling in a neighborhood of a designated range around the training sample operation point to obtain a training sample target point;
based on the patch neighborhood characteristics corresponding to the training sample operation points and the patch neighborhood characteristics corresponding to the training sample target points, obtaining edited sample images and generator middle characteristics corresponding to the edited sample images by using a convolutional neural network, a transducer model and a pre-trained StyleGAN generator;
training a transducer model by using a first loss function and a second loss function based on the edited sample image and the generator intermediate characteristics corresponding to the edited sample image;
The first loss function is used for pulling the patch neighborhood characteristics corresponding to the target point of the training sample after editing to the original patch neighborhood characteristics at the operating point of the training sample; the second loss function is used to bring the edited image closer to the real image.
Specifically, taking the example process shown in fig. 3 as an example, a w vector generated by a random number seed is used as a training sample, the w vector is subjected to a pre-trained StyleGAN2 model, a generator feature is obtained, and a generated sample image fake is obtained. In order to make the "drag" mode of the web learning more common and more reasonable, the embodiment of the invention can sample the training sample point by adopting an edge sampling mode, firstly, edge detection (for example, by a Canny edge detection algorithm) can be carried out on the fake to obtain an edge information graph of the fake, because the edge information is obviously more efficient and available information for the "drag" editing of the article, the edge information is naturally closer to the object main body in the image, thereby reducing the inefficient sampling of the background information, having stronger semanteme of the edge information and being more in line with the application object of the "drag".
After the fake edge information graph is obtained, high response points in the edge information graph are randomly sampled to obtain training sample operation points (simply referred to as operation points or handle points in the model training process), wherein the high response points can be pixel points with gray values higher than a preset threshold in the edge information graph, and the training sample operation points are obtained by randomly sampling the high response points.
Then, random sampling is performed in a neighborhood of a designated range around the training sample operation point, so as to obtain a training sample target point (a target point or a target point for short in the model training process). For example, the training sample target points may be randomly sampled within a neighborhood of the training sample operation point with a radius of 100 pixels.
Then, the same as the model reasoning process, the Patch neighborhood feature of the handle point and the Patch neighborhood feature of the target point are respectively input into a Patch Conv network, wherein the Patch Conv network is a network of three layers of convolution layers and one layer of maximum pooling layer, and the generator feature Patch is aggregated into feature vectors handle and target email equal to the original w vector dimension, and the following formula is adopted:
h 1*512 =maxpool(conv(handle patch))
t 1*512 =maxpool(conv(target patch))
in the formula, the handle patch and the target patch respectively represent the patch neighborhood characteristics of the handle point and the patch neighborhood characteristics of the target point; conv denotes a convolution operation, maxpool denotes a max pooling operation; h is a 1*512 、t 1*512 Representing 1×512 handle email (operation point feature vector), 1×512 target email (target point feature vector), respectively.
The training sample original w vector is used as the input of a self-attention module, the self-attention output and the handle email are added through an AdaIN layer and then input into a cross-attention module, cross-attention operation is carried out on the self-attention output and the handle email, a predicted w vector (corresponding to a second w vector of the training sample) is finally obtained, the predicted w vector and the original w vector are spliced together, and the pre-trained StyleGAN2 model is used for obtaining an edited sample image (dragimage) and a generator feature (dragfeature) corresponding to the edited sample image.
Then, based on the dragimage and the dragfeature, parameters of the transducer model are optimized using the first loss function and the second loss function. The first loss function uses a handle patch and a target patch of the original generator feature as supervision, and features at the target patch in the edited generator feature are pulled to be close to features of the handle patch in the original generator feature.
In some embodiments, the first loss function may use an L1 loss constraint, the formula of which is:
L drag =‖dragFeat[target patch]-generatorFeat[handle patch]‖ 1
wherein L is drag A loss value representing a first loss function; dragFeat represents generator intermediate features corresponding to the edited sample image; the generationfeat represents the corresponding generator intermediate features of the sample image; generator Feat [ handle batch ]]Representing original patch neighborhood features at the operating point of the training sample; dragFeat [ target latch ]]Representing patch neighborhood features at the target point corresponding to the training samples after editing.
In some embodiments, the compiled sample image may be countertrained with the real image real using a square countercheck Loss (GAN Loss), the discriminant using a discriminant of a pre-trained StyleGAN2 model, the second Loss function being formulated as:
L GAN =‖D(dragImg)-1‖
wherein L is GAN A loss value representing a second loss function; d (dragImg) represents a score of the edited sample image by the arbiter; 1 is the highest score of the arbiter, representing the real image. Wherein, the discriminatorThe freezing weight is not trained, and the arbiter only serves as a supervisor to guide the network to generate a better editing result.
According to the image editing method based on depth feature generation, through the strong discrimination capability of the intermediate features of the generator of the StyleGAN pre-training model and the strong sequence modeling and predicting capability of the transducer model structure, more efficient, more accurate and more semantic interactive image editing based on 'point dragging' is realized.
The image editing apparatus based on depth feature generation provided by the present invention will be described below, and the image editing apparatus based on depth feature generation described below and the image editing method based on depth feature generation described above may be referred to correspondingly to each other.
Fig. 4 is a schematic structural diagram of an image editing apparatus based on depth feature generation according to the present invention, as shown in fig. 4, the apparatus includes:
the obtaining module 400 is configured to obtain a patch neighborhood feature corresponding to an operation point and a patch neighborhood feature corresponding to a target point in response to the operation point and the target point determined by the user on the image to be edited;
The convolution module 410 is configured to aggregate, based on a convolutional neural network, a patch neighborhood feature corresponding to an operation point and a patch neighborhood feature corresponding to a target point into an operation point feature vector and a target point feature vector that are equal to a first w vector dimension, respectively; the first w vector is used for generating an image to be edited;
an attention module 420, configured to obtain a second w vector using a transducer model based on the operation point feature vector and the target point feature vector;
the generating module 430 is configured to splice the second w vector with the first w vector to obtain a third w vector, and then input the third w vector into the pre-trained StyleGAN generator to obtain an edited image;
the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training the pre-trained StyleGAN generator.
In some embodiments, deriving the second w vector using the transducer model based on the operating point feature vector and the target point feature vector comprises:
taking the first w vector as the input of a self-attention module of the transducer model, adding the high-dimensional vector obtained by the MLP processing of the output of the self-attention module through the AdaIN layer and the operation point feature vector, and inputting the added result into a cross-attention module of the transducer model;
And taking the vector obtained by repeatedly copying the characteristic vector of the target point as the input of the cross attention module, and performing cross attention operation with the added result to obtain a second w vector.
In some embodiments, the convolutional neural network is a network structure of three convolutional layers followed by a maximum pooling layer.
In some embodiments, the stitching the second w vector with the first w vector to obtain a third w vector includes:
and splicing the second w vector with part of vectors in the first w vector to obtain a third w vector.
In some embodiments, the training process of the transducer model includes:
for any training sample used in training, firstly carrying out edge detection on a sample image in the training sample to obtain an edge information graph of the sample image, then carrying out random sampling on high response points in the edge information graph to obtain a training sample operation point, and carrying out random sampling in a neighborhood of a designated range around the training sample operation point to obtain a training sample target point;
based on the patch neighborhood characteristics corresponding to the training sample operation points and the patch neighborhood characteristics corresponding to the training sample target points, obtaining edited sample images and generator middle characteristics corresponding to the edited sample images by using a convolutional neural network, a transducer model and a pre-trained StyleGAN generator;
Training a transducer model by using a first loss function and a second loss function based on the edited sample image and the generator intermediate characteristics corresponding to the edited sample image;
the first loss function is used for pulling the patch neighborhood characteristics corresponding to the target point of the training sample after editing to the original patch neighborhood characteristics at the operating point of the training sample; the second loss function is used to bring the edited image closer to the real image.
In some embodiments, the first loss function is formulated as:
L drag =‖dragFeat[target patch]-generatorFeat[handle patch]‖ 1
wherein L is drag A loss value representing a first loss function; dragFeat represents generator intermediate features corresponding to the edited sample image; the generationfeat represents the corresponding generator intermediate features of the sample image; generator Feat [ handle batch ]]Representing original patch neighborhood features at the operating point of the training sample; dragFeat [ target latch ]]Representing patch neighborhood features at the target point corresponding to the training samples after editing.
In some embodiments, the second loss function is formulated as:
L GAN =‖D(dragImg)-1‖
wherein L is GAN A loss value representing a second loss function; d (dragImg) represents a score of the edited sample image by the arbiter; 1 is the highest score of the arbiter, representing the real image.
It should be noted that, the device provided by the present invention can implement all the method steps implemented by the method embodiment and achieve the same technical effects, and the parts and beneficial effects that are the same as those of the method embodiment in the present embodiment are not described in detail herein.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, as shown in fig. 5, the electronic device may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform any of the depth feature generation based image editing methods provided by the embodiments described above, such as: responding to an operation point and a target point determined by a user on an image to be edited, and acquiring a patch neighborhood feature corresponding to the operation point and a patch neighborhood feature corresponding to the target point; based on a convolutional neural network, the patch neighborhood features corresponding to the operation points and the patch neighborhood features corresponding to the target points are respectively aggregated into an operation point feature vector and a target point feature vector which are equal to the first w vector dimension; the first w vector is used for generating an image to be edited; based on the operation point characteristic vector and the target point characteristic vector, a second w vector is obtained by using a transducer model; the second w vector and the first w vector are spliced to obtain a third w vector, and then the third w vector is input into a pre-trained StyleGAN generator to obtain an edited image; the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training the pre-trained StyleGAN generator.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that, the electronic device provided by the present invention can implement all the method steps implemented by the method embodiments and achieve the same technical effects, and the details and beneficial effects of the same parts and advantages as those of the method embodiments in the present embodiment are not described in detail.
In another aspect, the present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, can implement any of the depth feature generation-based image editing methods provided in the above embodiments, for example: responding to an operation point and a target point determined by a user on an image to be edited, and acquiring a patch neighborhood feature corresponding to the operation point and a patch neighborhood feature corresponding to the target point; based on a convolutional neural network, the patch neighborhood features corresponding to the operation points and the patch neighborhood features corresponding to the target points are respectively aggregated into an operation point feature vector and a target point feature vector which are equal to the first w vector dimension; the first w vector is used for generating an image to be edited; based on the operation point characteristic vector and the target point characteristic vector, a second w vector is obtained by using a transducer model; the second w vector and the first w vector are spliced to obtain a third w vector, and then the third w vector is input into a pre-trained StyleGAN generator to obtain an edited image; the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training the pre-trained StyleGAN generator.
It should be noted that, the non-transitory computer readable storage medium provided by the present invention can implement all the method steps implemented by the method embodiments and achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those of the method embodiments in this embodiment are omitted.
In yet another aspect, the present invention further provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing any of the depth feature generation based image editing methods provided in the above embodiments, for example: responding to an operation point and a target point determined by a user on an image to be edited, and acquiring a patch neighborhood feature corresponding to the operation point and a patch neighborhood feature corresponding to the target point; based on a convolutional neural network, the patch neighborhood features corresponding to the operation points and the patch neighborhood features corresponding to the target points are respectively aggregated into an operation point feature vector and a target point feature vector which are equal to the first w vector dimension; the first w vector is used for generating an image to be edited; based on the operation point characteristic vector and the target point characteristic vector, a second w vector is obtained by using a transducer model; the second w vector and the first w vector are spliced to obtain a third w vector, and then the third w vector is input into a pre-trained StyleGAN generator to obtain an edited image; the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training the pre-trained StyleGAN generator.
It should be noted that, the computer program product provided by the present invention can implement all the method steps implemented by the method embodiments and achieve the same technical effects, and the details of the same parts and the advantages as those of the method embodiments in the present embodiment are not described herein.
The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. An image editing method based on depth feature generation, comprising:
responding to an operation point and a target point determined by a user on an image to be edited, and acquiring a patch neighborhood feature corresponding to the operation point and a patch neighborhood feature corresponding to the target point;
based on a convolutional neural network, the patch neighborhood features corresponding to the operation points and the patch neighborhood features corresponding to the target points are respectively aggregated into operation point feature vectors and target point feature vectors which are equal to the first w vector dimension; the first w vector is used for generating the image to be edited;
based on the operation point characteristic vector and the target point characteristic vector, a second w vector is obtained by using a transducer model;
The second w vector and the first w vector are spliced to obtain a third w vector, and then the third w vector is input into a pre-trained StyleGAN generator to obtain an edited image;
wherein, the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training of the pre-trained StyleGAN generator are utilized.
2. The image editing method based on depth feature generation according to claim 1, wherein the obtaining a second w vector using a transducer model based on the operation point feature vector and the target point feature vector comprises:
taking the first w vector as the input of a self-attention module of the transducer model, adding the output of the self-attention module with a high-dimensional vector obtained by MLP processing the operation point feature vector through an AdaIN layer, and inputting the added result into a cross-attention module of the transducer model;
and taking the vector obtained by repeatedly copying the target point characteristic vector as the input of the cross attention module, and performing cross attention operation with the addition result to obtain the second w vector.
3. The image editing method based on depth feature generation according to claim 1, wherein the convolutional neural network is a network structure of three convolutional layers followed by a maximum pooling layer.
4. The image editing method based on depth feature generation according to claim 1, wherein the performing the stitching processing on the second w vector and the first w vector to obtain a third w vector includes:
and splicing the second w vector with part of vectors in the first w vector to obtain the third w vector.
5. The depth feature generation-based image editing method according to any one of claims 1 to 4, wherein the training process of the transducer model comprises:
for any training sample used in training, firstly carrying out edge detection on a sample image in the training sample to obtain an edge information graph of the sample image, then carrying out random sampling on high response points in the edge information graph to obtain a training sample operation point, and carrying out random sampling in a neighborhood of a designated range around the training sample operation point to obtain a training sample target point;
based on the patch neighborhood characteristics corresponding to the training sample operation points and the patch neighborhood characteristics corresponding to the training sample target points, obtaining edited sample images and generator intermediate characteristics corresponding to the edited sample images by using the convolutional neural network, the transducer model and the pre-trained StyleGAN generator;
Training the transducer model by using a first loss function and a second loss function based on the edited sample image and the generator intermediate characteristics corresponding to the edited sample image;
the first loss function is used for pulling the patch neighborhood characteristics corresponding to the target point of the training sample after editing to the original patch neighborhood characteristics at the operating point of the training sample; the second loss function is used for zooming the edited image into a real image.
6. The depth feature generation-based image editing method of claim 5, wherein the first loss function is formulated as:
L drag =‖dragFeat[target patch]-generatorFeat[handle patch]‖ 1
wherein L is drag A loss value representing the first loss function; dragFeat represents generator intermediate features corresponding to the edited sample image; the generationfeat represents the corresponding generator intermediate features of the sample image; generator Feat [ and Patch ]]Representing original patch neighborhood features at the training sample operation points; dragFeat [ target latch ]]Representing patch neighborhood features at the target point of the training sample after editing.
7. The depth feature generation-based image editing method of claim 5, wherein the second loss function is formulated as:
L GAN =‖D(dragImg)-1‖
Wherein L is GAN A loss value representing the second loss function; d (dragImg) represents a scoring of the edited sample image by a discriminator; 1 is the highest score of the arbiter, representing the real image.
8. An image editing apparatus based on depth feature generation, comprising:
the acquisition module is used for responding to the operation point and the target point determined by the user on the image to be edited, and acquiring the patch neighborhood characteristics corresponding to the operation point and the patch neighborhood characteristics corresponding to the target point;
the convolution module is used for respectively aggregating the patch neighborhood characteristics corresponding to the operation points and the patch neighborhood characteristics corresponding to the target points into operation point characteristic vectors and target point characteristic vectors which are equal to the first w vector dimension based on a convolution neural network; the first w vector is used for generating the image to be edited;
the attention module is used for obtaining a second w vector by using a transducer model based on the operation point characteristic vector and the target point characteristic vector;
the generation module is used for splicing the second w vector and the first w vector to obtain a third w vector, and inputting the third w vector into a pre-trained StyleGAN generator to obtain an edited image;
Wherein, the training data of the transducer model comprises: sample images generated by the pre-trained StyleGAN generator, w vectors corresponding to the sample images, generator intermediate features corresponding to the sample images, and real images used in training of the pre-trained StyleGAN generator are utilized.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the depth feature generation based image editing method of any one of claims 1 to 7 when the program is executed.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the depth feature generation based image editing method of any of claims 1 to 7.
CN202311390946.9A 2023-10-25 2023-10-25 Image editing method and device based on depth feature generation and electronic equipment Pending CN117541681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311390946.9A CN117541681A (en) 2023-10-25 2023-10-25 Image editing method and device based on depth feature generation and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311390946.9A CN117541681A (en) 2023-10-25 2023-10-25 Image editing method and device based on depth feature generation and electronic equipment

Publications (1)

Publication Number Publication Date
CN117541681A true CN117541681A (en) 2024-02-09

Family

ID=89790882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311390946.9A Pending CN117541681A (en) 2023-10-25 2023-10-25 Image editing method and device based on depth feature generation and electronic equipment

Country Status (1)

Country Link
CN (1) CN117541681A (en)

Similar Documents

Publication Publication Date Title
Xu et al. Adversarially approximated autoencoder for image generation and manipulation
CN111758105A (en) Learning data enhancement strategy
Li et al. Fast a3rl: Aesthetics-aware adversarial reinforcement learning for image cropping
CN111046178B (en) Text sequence generation method and system
CN112800893B (en) Face attribute editing method based on reinforcement learning
Tan et al. Selective dependency aggregation for action classification
CN116910572B (en) Training method and device for three-dimensional content generation model based on pre-training language model
CN114240735B (en) Arbitrary style migration method, system, storage medium, computer equipment and terminal
CN111476771A (en) Domain self-adaptive method and system for generating network based on distance countermeasure
Mo et al. Reload: Using reinforcement learning to optimize asymmetric distortion for additive steganography
CN113033410B (en) Domain generalization pedestrian re-recognition method, system and medium based on automatic data enhancement
Du et al. Boosting dermatoscopic lesion segmentation via diffusion models with visual and textual prompts
Liu et al. Drag your noise: Interactive point-based editing via diffusion semantic propagation
CN110942463B (en) Video target segmentation method based on generation countermeasure network
Jin et al. Text2poster: Laying out stylized texts on retrieved images
CN117541681A (en) Image editing method and device based on depth feature generation and electronic equipment
Lu et al. Siamese graph attention networks for robust visual object tracking
CN115690428A (en) Passive data unsupervised field self-adaption method for semantic segmentation
Li et al. Active instance segmentation with fractional-order network and reinforcement learning
Cui et al. StableDrag: Stable Dragging for Point-based Image Editing
CN112529772A (en) Unsupervised image conversion method under zero sample setting
Xiao et al. Optimizing generative adversarial networks in Latent Space
WO2024078308A1 (en) Image optimization method and apparatus, electronic device, medium, and program product
CN117423108B (en) Image fine granularity description method and system for instruction fine adjustment multi-mode large model
Chen et al. Unsupervised Learning: Deep Generative Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination