WO2024043109A1 - Image processing method, image processing device, and program - Google Patents

Image processing method, image processing device, and program Download PDF

Info

Publication number
WO2024043109A1
WO2024043109A1 PCT/JP2023/029203 JP2023029203W WO2024043109A1 WO 2024043109 A1 WO2024043109 A1 WO 2024043109A1 JP 2023029203 W JP2023029203 W JP 2023029203W WO 2024043109 A1 WO2024043109 A1 WO 2024043109A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
latent variable
latent
image processing
machine learning
Prior art date
Application number
PCT/JP2023/029203
Other languages
French (fr)
Japanese (ja)
Inventor
雪乃 大野
Original Assignee
キヤノン株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by キヤノン株式会社 filed Critical キヤノン株式会社
Publication of WO2024043109A1 publication Critical patent/WO2024043109A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning

Definitions

  • the present invention relates to an image processing method using a machine learning model.
  • An image processing method is known that edits arbitrary feature values in an image by manipulating latent variables acquired based on an image to be processed (target image) and inputting the manipulated data to a machine learning model.
  • a machine learning model is generated using a generative adversarial network (GAN).
  • GAN generative adversarial network
  • Non-Patent Document 1 describes that, among the latent variables corresponding to the target image, the target image is edited using a latent variable that exists in the position where the feature values of the image used for learning the machine learning model are most densely distributed. A method for doing so is disclosed.
  • Non-Patent Document 1 by editing the feature values to a large extent, harmful effects (artifacts) are likely to occur in the edited image.
  • the present invention aims to generate images with fewer harmful effects using a machine learning model.
  • the image processing method of the present invention includes a step of obtaining a first latent variable based on a first image, and a step of obtaining a second latent variable different from the first latent variable based on the first latent variable. and has. Further, the step of generating a second image by inputting the second latent variable into the first machine learning model, and the step of generating a third latent variable different from the second latent variable based on the second image. and a step of obtaining. In addition, the step of obtaining a fourth latent variable different from the third latent variable based on the third latent variable, and inputting the fourth latent variable into the first machine learning model, and generating an image.
  • an image with fewer harmful effects can be generated using a machine learning model.
  • FIG. 2 is a schematic diagram showing a latent space.
  • 1 is a block diagram of an image processing system in Example 1.
  • FIG. 1 is an external view of an image processing system in Example 1.
  • FIG. 3 is a diagram showing transitions of latent variables in Example 1.
  • FIG. 7 is a flowchart regarding the estimation phase in Example 1.
  • FIG. 5 is a flowchart regarding the learning phase of the first machine learning model.
  • FIG. 3 is a diagram showing the flow of the learning phase of the first machine learning model.
  • 7 is a flowchart of generation of a first latent variable in Example 1.
  • FIG. 3 is a block diagram of an image processing system in Example 2.
  • FIG. 7 is a diagram showing transitions of latent variables in Example 2.
  • FIG. 7 is a flowchart regarding the learning phase of the second machine learning model.
  • FIG. 6 is a diagram showing the flow of the learning phase of the second machine learning model.
  • 3 is a block diagram of an image processing system in Example 3.
  • FIG. 7
  • an estimated image is generated by editing feature values (attributes) in an image using a machine learning model generated using a generative adversarial network (GAN).
  • GAN generative adversarial network
  • a latent variable that is a multidimensional tensor is input to the GAN.
  • the space in which latent variables exist is called a latent space.
  • embedding (inversion) in a latent space is referred to as embedding (inversion) in a latent space.
  • a feature value is a value (amount) indicating a feature in an image.
  • the feature includes, for example, facial expression, facial orientation, age, gender, hairstyle, and the like.
  • An image can be generated by inputting the manipulated latent variables into a machine learning model.
  • artifacts false structures
  • Artifacts are distortions that reduce the realism of an image and give a perceptually strange feeling, and are, for example, stains that are unintentionally generated in an edited image.
  • the machine learning model (first machine learning model) according to this embodiment is generated by GAN.
  • a GAN includes a generator that generates an image based on latent variables (noise), and a classifier that identifies whether the image created by the generator is a fake image (fake image) or a correct image (real image). Further, the generator learns based on the classification result of the classifier in order to generate an image that the classifier misidentifies. The classifier also learns to distinguish between real images and fake images generated by the generator. For example, StyleGAN, StyleGAN2, and StyleGAN3 may be used to generate a machine learning model.
  • the GAN generator in this embodiment includes a mapping network and a synthesis network.
  • the mapping network generates mapped latent variables based on the initial latent variables by nonlinearly transforming the initial latent space into a mapped latent space. Details of the initial latent space and the mapping latent space will be described later.
  • the synthesis network generates an image based on the mapped latent variables that are duplicated in a number corresponding to the resolution of the image to be generated.
  • a synthesis network for example, by duplicating a 512-dimensional mapping latent variable, it becomes 18, and by inputting the 18 mapping latent variables to the synthesis network, the image is generated. do.
  • the initial latent variable is a tensor with an arbitrary number of dimensions, for example, a tensor sampled based on a Gaussian distribution can be used as the initial latent variable.
  • the space in which the initial latent variables exist is the initial latent space, and the correlation between the dimensions included in the initial latent space and the feature values of the images used for learning the machine learning model is low.
  • the mapping latent space is an area in which mapping latent variables obtained based on images used for learning by the first machine learning model exist. The closer a latent variable is to the center of gravity defined by a plurality of mapping latent variables in the mapping latent space, the lower the probability that an adverse effect will occur in the edited image even if the feature value is significantly edited by manipulating the latent variables.
  • intermediate latent variables existing in the intermediate latent space or extended latent variables existing in the extended latent space may be used.
  • An extension of the mapping latent space is an intermediate latent space
  • an extension of the intermediate latent space is an extended latent space.
  • the intermediate latent variables and extended latent variables include latent variables that are distant from the center of gravity defined by the plurality of mapped latent variables. The farther an image is generated based on a latent variable from the center of gravity defined by a plurality of mapping latent variables, the higher the probability that an adverse effect will occur.
  • the intermediate latent variable is a latent variable different from the mapping latent variable.
  • the extended latent variable which is a 512 ⁇ 18-dimensional tensor obtained based on, for example, 18 types of 512-dimensional tensors, is input to the synthesis network.
  • FIG. 1 is a graph schematically showing a part of the latent space.
  • arbitrary two tensors having the same dimensions as the mapping latent variable and the intermediate latent variable are defined as the first and second expanded latent variables.
  • X is the first expanded latent variable
  • Y is the second expanded latent variable.
  • a latent variable (first latent variable) is acquired based on the original image (first image), and a latent variable (second latent variable) is acquired by manipulating the first latent variable.
  • an image (second image) is generated by inputting the second latent variable to the machine learning model.
  • An image (third image) is generated by inputting this into a machine learning model.
  • the third latent variable exists near the center of gravity defined by the plurality of mapping latent variables with respect to the second latent variable.
  • the operation of the latent variable is divided into multiple steps: the process of manipulating the latent variable, the process of generating an image based on the manipulated latent variable, and the process of transferring the generated image to the latent space.
  • the steps of embedding are performed in order.
  • FIG. 2 is a block diagram of the image processing system 100 in this embodiment.
  • FIG. 3 is an external view of the image processing system 100.
  • the image processing system 100 includes a learning device 101, an image processing device (image estimation device) 102, a display device 103, a recording medium 104, an output device 105, and a network 106.
  • the learning device 101 and the image processing device 102 can communicate with each other via the network 106.
  • the learning device (first learning device) 101 includes a storage section 101a, an acquisition section 101b, a generation section 101c, and an updating section 101d, and determines the weight of the first machine learning model.
  • the image processing device 102 includes a storage unit 102a, an acquisition unit 102b, a conversion unit 102c, a generation unit 102d, and an estimation unit 102e, and generates an estimated image (output image) using a first machine learning model.
  • the function of generating an estimated image by the image processing device 102 can be implemented by one or more processors (processing means) such as a CPU.
  • the output image is output to at least one of the display device 103, the recording medium 104, or the output device 105.
  • the display device 103 is a liquid crystal display, a projector, or the like. The user can edit the image while checking the image being processed through the display device 103.
  • the recording medium 104 is a semiconductor memory, a hard disk, a server on a network, or the like, and stores the output image.
  • the output device 105 is a printer or the like. Note that input devices such as a mouse and a keyboard are not shown.
  • FIG. 4 is a graph schematically showing a part of the latent space similar to FIG. 1, and shows the behavior of the latent variables in this example.
  • FIG. 5 is a flowchart regarding the estimation phase.
  • the acquisition unit 102b acquires an original image (first image).
  • the first image may be an image stored in advance in the storage unit 102a.
  • the first image may be preprocessed if necessary.
  • the preprocessing performed on the first image is correction of the feature values. For example, when the first image is an image of a person's face, the positions of major organs such as the eyes, mouth, and nose of the person's face are adjusted (corrected).
  • the age of a person's face will be used as the first feature value.
  • step S102 the conversion unit 102c converts the first image into a first latent variable.
  • the first latent variable is obtained by inverse analysis using the first machine learning model.
  • feature values of the image used for learning the first machine learning model are embedded in positions where they are densely distributed by inverse analysis using the first machine learning model described later. , obtain a first latent variable based on the first image. With such a configuration, it is possible to obtain a first latent variable with a low probability of causing an adverse effect.
  • the first latent variable is an intermediate latent variable.
  • the generation unit 102d generates a second latent variable based on the first latent variable.
  • the second latent variable is obtained by manipulating the first latent variable.
  • the second latent variable obtained in this way has a low density distribution of feature values of the image used for learning the first machine learning model with respect to the first latent variable in the latent space, as shown in Figure 4. be in a position to do so.
  • the second latent variable is an intermediate latent variable.
  • latent variables style latent variables
  • a latent variable is calculated by transitioning the first feature value in the style latent space.
  • extended latent variables may be used.
  • the operation of the latent variable is calculated based on the transition of the latent variable in the normal direction of the separating hyperplane.
  • the separating hyperplane is a plane that divides the label related to the first feature value in the intermediate latent space, and the intermediate latent space divided by the separating hyperplane has different features regarding the first feature value. For example, in one intermediate latent space divided by a separating hyperplane, latent variables with feature values corresponding to "young" are distributed, and in the other intermediate latent space, feature values corresponding to "old” are distributed.
  • the separation hyperplane is estimated using, for example, SVM (Support Vector Machine).
  • the normal vector direction of the separating hyperplane is the direction corresponding to the change in the first feature value. Note that the method for determining the orientation corresponding to the first feature value is not limited to this.
  • a threshold is set for the amount by which the first latent variable is manipulated as needed, and the generation unit 102d sets the amount by which the first latent variable is manipulated to be less than this threshold.
  • step S104 the estimation unit 102e estimates (generates) a second image based on the second latent variable.
  • a second image is estimated based on the second latent variable using the first machine learning model. Note that the information on the weights of the first machine learning model has been learned in the learning device 101 and is stored in the storage unit 102a.
  • step S105 the conversion unit 102c generates a third latent variable based on the second latent variable.
  • the third latent variable can be obtained by converting the second latent variable using a method similar to step S102.
  • the third latent variable is a latent variable different from the second latent variable, and exists near the center of gravity defined by the plurality of mapping latent variables with respect to the second latent variable in the latent space.
  • the third latent variable may be set at a position where the feature values of the image used for learning the first machine learning model are more densely distributed with respect to the second latent variable.
  • the third latent variable is an intermediate latent variable.
  • step S106 the generation unit 102d generates a fourth latent variable based on the third latent variable.
  • the fourth latent variable can be obtained using a method similar to step S103.
  • the fourth latent variable may be acquired by editing a second feature value that is different from the first feature value.
  • the fourth latent variable obtained in this way has a lower density of feature values of the image used for learning the first machine learning model than the third latent variable in the extended latent space, as shown in Figure 4. Exist in distributed locations.
  • the fourth latent variable is an intermediate latent variable.
  • step S107 the estimation unit 102e estimates (generates) the third image based on the fourth latent variable.
  • the third image can be acquired using a method similar to step S104.
  • the third image is an image obtained by editing the first feature value (or second feature value) with respect to the second image. Note that, if necessary, the third image may be used as a new second image, and the steps from step S104 to step S107 may be repeated one or more times to generate a new third image.
  • the operation of the latent variable is divided into multiple steps: the step of manipulating the latent variable, the step of generating an image based on the manipulated latent variable, and the step of transferring the generated image to the latent space.
  • the steps of embedding are performed in order.
  • FIG. 6 is a flowchart of updating (learning) the weights of the first machine learning model. Each step in FIG. 6 is mainly performed by the acquisition unit 101b, the generation unit 101c, or the update unit 101d.
  • FIG. 7 is a diagram showing the configuration of the GAN in this example.
  • the GAN in this embodiment includes a generator 10 that generates an image and a classifier 11 that identifies the generated image.
  • the acquisition unit 101b acquires the correct image 12 from the storage unit 101a.
  • the correct image 12 is a plurality of images, and may be a captured image acquired by an imaging device or a CG (Computer Graphics) image. Further, it is preferable that the images included in the correct image 12 include a plurality of images in which feature values of the images change in stages. For example, when editing age as a feature value of a person's face, it is preferable to use an image containing the face of a person in school age or boyhood, which is between infancy and adolescence, as the correct image 12. With such a configuration, it is possible to generate a machine learning model that can edit the feature values of an image with high precision.
  • the correct image 12 is a real image in the discriminator 11, it has a correct label that corresponds to reality.
  • the correct image 12 may be preprocessed if necessary.
  • the preprocessing performed on the correct image 12 is adjustment of the feature values from the correct image 12. For example, when the correct image 12 is an image of a person's face, the positions of major organs such as the eyes, mouth, nose, etc. of the person's face are adjusted (corrected).
  • the generation unit 101c generates the training latent variable 13.
  • the training latent variable 13 is a 512-dimensional tensor, and for example, any tensor sampled based on a Gaussian distribution may be used as the training latent variable 13.
  • step S203 the generation unit 101c inputs the training latent variable 13 to the generator 10 to generate the estimated image 14. Note that since the estimated image 14 is a fake image in the classifier 11, it has a correct label corresponding to the fake image.
  • Step S202 and Step S203 may be executed without performing Step S201.
  • the first machine learning model can be learned without using the correct image 12.
  • step S201, step S202, and step S203 may be performed one by one at random.
  • the number of times that step S201, step S202, and step S203 are performed is not limited to the same number of times.
  • step S204 the updating unit 101d updates the weight of the discriminator 11.
  • the classifier 11 acquires the correct image 12 or the estimated image 14 and generates an identification label.
  • the updating unit 101d updates the weight of the classifier 11 based on the error between the identification label and the correct label. If necessary, the weights of the classifier 11 are updated by inputting to the classifier 11 an image that has been processed by geometric transformation such as inversion, parallel translation, or rotation on the estimated image 14 and the correct image 12. Good too. With this configuration, even when the number of correct images 12 is small or when the characteristics of the correct images 12 are biased, the accuracy of identification by the classifier 11 can be improved.
  • step S205 the updating unit 101d updates the weight of the generator 10 based on the identification label.
  • the weights are updated using the sigmoid cross entropy of the identification label.
  • it is not limited to this.
  • step S206 the updating unit 101d determines whether learning has been completed. Completion of learning can be determined based on whether the number of repetitions of weight updates has reached a predetermined number, whether the weights have been edited at the time of updating, or whether the quality of the estimated image 14 is higher than a predetermined quality. Whether the quality of the estimated image 14 is higher than a predetermined quality can be calculated using a metric such as Frechet Inception Distance, which measures the distance between the distributions of the estimated image 14 and the correct image 12, for example. If it is determined that weight learning is not completed, the process returns to step S201, and the acquisition unit 101b acquires a new correct image 12. On the other hand, if it is determined that the weight learning is completed, the updating unit 101d ends the learning and stores the weight information in the storage unit 101a.
  • a metric such as Frechet Inception Distance
  • FIG. 8 is a flowchart of the generation of the first latent variable.
  • the conversion from the first image to the first latent variable is performed by inverse analysis using the first machine learning model.
  • Each step in FIG. 8 is mainly performed by the converter 102c or the generator 102d.
  • the inverse analysis using the first machine learning model in this example is based on the image generated by inputting an arbitrary latent variable (first input latent variable) to the first machine learning model and the first machine learning model. Images are compared, and the first input latent variable is updated (optimized) using the error. Note that information on the weights of the first machine learning model is read out in advance from the storage unit 101a and stored in the storage unit 102a. By repeating inputting the first input latent variable to the first machine learning model and updating the first input latent variable, an image similar to the first image is obtained using the first machine learning model. Generate possible second input latent variables. Whether or not the image is similar to the first image can be determined, for example, from the difference in each pixel value.
  • the second input latent variable having the closest distance to the center of gravity determined by the plurality of mapping latent variables is set as the first latent variable.
  • the second input latent variable By setting the second input latent variable to the first latent variable using the method described above, it is possible to obtain a first latent variable with a low probability of generating a false structure.
  • a second input latent variable that exists in a position where the feature values of the image used for learning the first machine learning model are more densely distributed may be set as the first latent variable.
  • a first input latent variable is set.
  • Any latent variable in the latent space can be the first input latent variable.
  • any tensor sampled based on a Gaussian distribution or the like may be used as the first input latent variable.
  • step S302 the estimated image 14 is generated based on the first input latent variable using the first machine learning model. Further, the first machine learning model is trained in advance, and the weight information is stored in the storage unit 102a.
  • the loss function uses the Euclidean norm of the difference between the pixel values of the first image and the embedded image, or the Euclidean norm calculated for each element of the feature map converted based on the first image and the embedded image.
  • the loss function is not limited to this. For example, an error backpropagation method may be used.
  • step S304 it is determined whether the update of the first input latent variable is completed.
  • the completion of the update is determined in step S303 when the number of repetitions of updating the first input latent variable reaches a predetermined number of times, the amount of editing of the first input latent variable at the time of update is smaller than a predetermined value, or when the update is completed in step S303.
  • the determination can be made based on, for example, whether the value of the loss function is smaller than a predetermined value. If it is determined that the update of the first input latent variable is not completed, the process returns to step S302, and the updated first input latent variable is applied to the first machine learning model to generate a new embedded image. do. On the other hand, if it is determined that the update of the first input latent variable is completed, the updated first input latent variable is set as the second input latent variable and the process proceeds to step S305.
  • step S305 it is determined whether the generation of the second input latent variable is completed. Completion of generation can be determined based on whether the number of generated second input latent variables has reached a predetermined number. If it is determined that the generation of the second input latent variable is not completed, the process returns to step S301 and the first input latent variable is set. At this time, the latent variable that has not been set in step S301 up to that point is set as the first input latent variable. On the other hand, if it is determined that the generation of the second input latent variable is completed, the generation of the second input latent variable is finished and the process advances to step S306.
  • a first latent variable is generated. Distances from the position of the center of gravity determined by the plurality of mapping latent variables to the plurality of second input latent variables are respectively calculated, and the second input latent variable having the closest distance is stored in the storage unit 102a as the first latent variable. do.
  • the distance from the center of gravity determined by the plurality of mapping latent variables can be calculated using, for example, the Euclidean norm.
  • FIG. 9 is a block diagram of the image processing system 200 in this embodiment.
  • the image processing system 200 differs from the first embodiment in that a learning device 201 includes a first learning section (first learning means) 211 and a second learning section (second learning means) 212.
  • the image processing system 200 includes a learning device 201, an image processing device (image estimation device) 202, a display device 203, a recording medium 204, an output device 205, and a network 206.
  • the learning device 201 and the image processing device 202 can communicate with each other via the network 206.
  • the learning device 201 has a first learning section 211 and a second learning section 212.
  • the first learning unit 211 includes a storage unit 211a, an acquisition unit 211b, a generation unit 211c, and an update unit 211d, and generates a first machine learning model.
  • the second learning unit 212 includes a storage unit 212a, an acquisition unit 212b, a generation unit 212c, and an update unit 212d, and determines the weight of the second machine learning model.
  • the second machine learning model can obtain latent variables based on images.
  • the learning device 201 can implement its functions using one or more processors (learning means) such as a CPU.
  • the learning device 201 may be a server. Further, the first and second learning sections may be separate devices.
  • the image processing device 202 is the same as the image processing device 102 in the first embodiment, so the description thereof will be omitted. Further, the output image is output to at least one of the display device 203, the recording medium 204, or the output device 205.
  • the display device 203, the recording medium 204, and the output device 205 are the same as the display device 103, the recording medium 104, and the output device 105 in the first embodiment.
  • FIG. 10 is a graph schematically showing a part of the latent space similar to FIG. 1, and shows the behavior of the latent variables in this example. Note that the first machine learning model in this example is trained in the same manner as in Example 1, and the weight information is stored in the storage unit 102a.
  • the conversion unit 102c converts the first image into a first latent variable using a second machine learning model to be described later.
  • the third latent variable is a latent variable different from the second latent variable, and is near the center of gravity defined by the plurality of mapping latent variables with respect to the second latent variable in the latent space. exists in Note that the third latent variable may be set at a position where the feature values of the image used for learning the first machine learning model are more densely distributed with respect to the second latent variable.
  • the first to fourth latent variables in this example are extended latent variables. Further, the first latent variable exists at a position where the feature values of the image used for learning the first machine learning model are distributed with high density with respect to the second latent variable. Furthermore, similarly, the third latent variable exists at a position where the feature values of the image used for learning the first machine learning model are distributed with high density with respect to the fourth latent variable.
  • the manipulation of the latent variable is divided into multiple steps: the process of manipulating the latent variable, the process of generating an image based on the manipulated latent variable, and the process of transferring the generated image to the latent space.
  • the process of embedding is performed in order.
  • FIG. 11 is a flowchart of learning weights of the second machine learning model.
  • FIG. 12 is a diagram showing the learning flow of the second machine learning model.
  • a second machine learning model that generates latent variables based on images is generated using a GAN having a generator 20 and a classifier 21.
  • the second machine learning model in this example converts latent variables near the center of gravity defined by a plurality of mapped latent variables. Further, a second input latent variable that exists in a position where the feature values of the image used for learning the first machine learning model are more densely distributed may be set as the first latent variable. Note that the second machine learning model may be trained so that the variance of a plurality of tensors including the first and second extended latent variables becomes small when generating (estimating) latent variables. If necessary, learning may be performed so that the latent variable estimated by the second machine learning model is determined to be a mapped latent variable by the GAN classifier. By performing the above-described learning, it is possible to generate a second machine learning model that can estimate latent variables with a low probability of generating a false structure.
  • the acquisition unit 212b acquires the correct image 22.
  • the correct image 22 is a plurality of images, and may be a captured image acquired by an imaging device or a CG (Computer Graphics) image. Further, the correct image 22 may include an image used for learning the first machine learning model (correct image 12) or an image generated by the first machine learning model (estimated image 14).
  • step S502 the generation unit 212c generates (estimates) the estimated latent variable 23.
  • the generation unit 212c generates the estimated latent variable 23 by inputting the correct image 22 to the generator 20. Note that the estimated latent variable 23 has a correct label corresponding to a fake in the classifier 21.
  • step S503 the generation unit 212c acquires the correct latent variable 25.
  • a 512-dimensional tensor (corresponding to an initial latent variable) is input to the mapping network of the first machine learning model to generate a correct latent variable 25 (corresponding to a mapping latent variable).
  • the correct answer latent variable 25 has a correct answer label that corresponds to reality in the discriminator 21. Note that only one of steps S501 and S502 and step S503 may be performed at random. Furthermore, the number of times that step S501, step S502, and step S503 are executed is not limited to the same number of times.
  • step S504 the updating unit 212d updates the weight of the discriminator 21.
  • the classifier 21 obtains the estimated latent variable 23 or the correct latent variable 25 and generates an identification label.
  • the classifier 21 is updated based on the error between the identification label and the correct label.
  • step S505 the generation unit 212c generates the estimated image 24.
  • the estimated image 24 is generated by inputting the estimated latent variable 23 into the first machine learning model. Note that the first machine learning model has been trained in advance, and the weight information is stored in the storage unit 212a.
  • the updating unit 212d updates the weight of the generator 20.
  • the loss function may be, for example, a loss function related to the identification label of the classifier 21, a loss function related to the estimated latent variable 23, a loss function related to the correct image 22 and the estimated image 24, or the like.
  • the loss function related to the identification label of the classifier 21 updates the weight based on the sigmoid cross entropy of the identification label.
  • the loss function for the estimated latent variable 23 is the variance of a plurality of tensors having a specific dimension included in the estimated latent variable 23.
  • the loss function regarding the correct image 22 and the estimated image 24 is the Euclidean norm of the difference between the pixel values of each image, the Euclidean norm calculated for each element of the feature map converted based on each image, or the like.
  • step S507 the updating unit 212d determines whether learning has been completed. Completion of learning can be determined based on whether the number of repetitions of weight updating has reached a predetermined number of times, or whether the amount of weight editing at the time of updating is smaller than a predetermined value. If it is determined that weight learning is not completed, the process returns to step S501, and the acquisition unit 212b acquires a new correct image 22. On the other hand, if it is determined that the weight learning is completed, the updating unit 212d ends the learning and stores the weight information in the storage unit 212a.
  • the image processing system 300 of this embodiment differs from the first embodiment in that it includes a control device 303 that requests the image processing device 302 regarding image processing for the first image.
  • FIG. 13 is a block diagram of the image processing system 300 in this embodiment.
  • the image processing system 300 includes a learning device 301, an image processing device (image estimation device) 302, and a control device 303.
  • the learning device 301 and the image processing device 302 are servers.
  • the control device 303 is, for example, a user terminal such as a personal computer or a smartphone.
  • the control device 303 is connected to the image processing device 302 via a network 304.
  • the image processing device 302 is connected to the learning device 301 via a network 305. That is, the control device 303 and the image processing device 302 as well as the image processing device 302 and the learning device 301 are configured to be able to communicate with each other.
  • the learning device 301 in the image processing system 300 has the same configuration as the learning device 101, so a description thereof will be omitted.
  • the image processing device 302 differs from the image processing device 102 in that it includes a communication section (receiving means) 302f.
  • the control device 303 includes a communication section (transmission means) 303a, a display section (display means) 303b, an input section (input means) 303c, a processing section (processing means) 303d, and a recording section 303e.
  • the communication unit 303a can transmit a request to the image processing device 302 for causing the image processing device 302 to perform processing on the first image. Additionally, output images processed by the image processing device 302 can be received.
  • the display section 303b displays various information. Various information displayed by the display unit 303b is, for example, an input image to the image processing device 302 and an output image generated by the image processing device 302.
  • the input unit 303c allows the user to input an instruction to start image processing.
  • the processing unit 303d can perform arbitrary image processing on the output image received from the image processing device 302.
  • the recording unit 303e stores the output image received from the image processing device 302.
  • the method for transmitting the first image to be processed to the image processing apparatus 302 does not matter; for example, the first image may be uploaded to the image processing apparatus 302 at the same time as S601, or the first image may be uploaded to the image processing apparatus 302 before S601. 302 may be uploaded. Further, the first image may be an image stored on a server different from the image processing device 302.
  • FIG. 14 is a flowchart regarding the estimation phase in this embodiment.
  • the operation of the control device 303 will be explained.
  • the image processing in this embodiment is started by a user's instruction to start image processing via the control device 303.
  • step S601 the communication unit 303a transmits a first image processing request to the image processing device 302.
  • the control device 303 may transmit information regarding editing, an ID for authenticating the user, and the like together with a request for processing the first image.
  • the information regarding editing includes the feature value to be edited and the degree of editing of the feature value. For example, when a user generates an image in which the age of the subject in the first image is increased by 10 years.
  • the information regarding editing includes information such as "age”, "10 years old", and "old age” specified by the user.
  • step S602 the communication unit 303a receives the third image generated by the image processing device 302.
  • step S701 the communication unit 302f receives a request to process the first image transmitted from the communication unit 303a.
  • the image processing device 302 executes the processing from step S702 upon receiving the instruction to process the first image.
  • step S702 the acquisition unit 302b acquires information regarding editing and the first image.
  • the information regarding editing and the first image are transmitted from the control device 303.
  • steps S701 and step S702 may be performed simultaneously.
  • steps S702 to S708 are the same as steps S101 to S107, so their explanation will be omitted.
  • step S709 the communication unit 302f transmits the third image to the control device 303.
  • the manipulation of the latent variable is divided into multiple steps: the process of manipulating the latent variable, the process of generating an image based on the manipulated latent variable, and the process of transferring the generated image to the latent space.
  • the process of embedding is performed in order.
  • the control device 303 only requests processing for a specific image, and the actual image processing is performed by the image processing device 302. Therefore, by using the control device 303 as a user terminal, it is possible to reduce the processing load on the user terminal. Therefore, the user can obtain an output image with a low processing load.
  • the present invention provides a program that implements one or more of the functions of the embodiments described above to a system or device via a network or a storage medium, and one or more processors in a computer of the system or device reads the program. This can also be achieved by executing a process. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.
  • the image processing device according to the present invention may be any device having the image processing function according to the present invention, and may be realized in the form of a PC.
  • each embodiment it is possible to greatly edit the feature values of the original image using the first machine learning model, and then generate an image with fewer harmful effects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

This image processing method includes: a step for acquiring a first latent variable on the basis of a first image; a step for acquiring a second latent variable different from the first latent variable on the basis of the first latent variable; a step for generating a second image by inputting the second latent variable into a first machine learning model; a step for acquiring a third latent variable different from the second latent variable on the basis of the second image; a step for acquiring a fourth latent variable different from the third latent variable on the basis of the third latent variable; and a step for generating a third image by inputting the fourth latent variable into the first machine learning model.

Description

画像処理方法、画像処理装置、及びプログラムImage processing method, image processing device, and program
 本発明は、機械学習モデルを用いた画像処理方法に関する。 The present invention relates to an image processing method using a machine learning model.
 画像処理の対象となる画像(対象画像)に基づいて取得された潜在変数を操作して機械学習モデルに入力することで、画像における任意の特徴値を編集する画像処理方法が知られている。機械学習モデルは、敵対的生成ネットワーク(GAN:Generative Adversarial Network)を用いて生成される。 An image processing method is known that edits arbitrary feature values in an image by manipulating latent variables acquired based on an image to be processed (target image) and inputting the manipulated data to a machine learning model. A machine learning model is generated using a generative adversarial network (GAN).
 非特許文献1には、対象画像に対応する潜在変数のうち、機械学習モデルの学習に用いた画像の特徴値が最も高密度に分布する位置に存在する潜在変数を用いて対象画像の編集を行う方法が開示されている。 Non-Patent Document 1 describes that, among the latent variables corresponding to the target image, the target image is edited using a latent variable that exists in the position where the feature values of the image used for learning the machine learning model are most densely distributed. A method for doing so is disclosed.
 しかしながら、非特許文献1に開示された画像処理方法では、特徴値を大きく編集することで、編集後の画像に弊害(アーティファクト)が発生しやすくなる。 However, in the image processing method disclosed in Non-Patent Document 1, by editing the feature values to a large extent, harmful effects (artifacts) are likely to occur in the edited image.
 そこで本発明は、機械学習モデルを用いて弊害の少ない画像を生成することを目的とする。 Therefore, the present invention aims to generate images with fewer harmful effects using a machine learning model.
 本発明の画像処理方法は、第1の画像に基づいて第1の潜在変数を取得するステップと、第1の潜在変数に基づいて第1の潜在変数と異なる第2の潜在変数を取得するステップとを有する。また、第2の潜在変数を第1の機械学習モデルに入力することで、第2の画像を生成するステップと、第2の画像に基づいて第2の潜在変数とは異なる第3の潜在変数を取得するステップとを有する。また、第3の潜在変数に基づいて第3の潜在変数とは異なる第4の潜在変数を取得するステップと、第4の潜在変数を第1の機械学習モデルに入力することで、第3の画像を生成するステップとを有する。 The image processing method of the present invention includes a step of obtaining a first latent variable based on a first image, and a step of obtaining a second latent variable different from the first latent variable based on the first latent variable. and has. Further, the step of generating a second image by inputting the second latent variable into the first machine learning model, and the step of generating a third latent variable different from the second latent variable based on the second image. and a step of obtaining. In addition, the step of obtaining a fourth latent variable different from the third latent variable based on the third latent variable, and inputting the fourth latent variable into the first machine learning model, and generating an image.
 本発明によれば、機械学習モデルを用いて弊害の少ない画像を生成することができる。 According to the present invention, an image with fewer harmful effects can be generated using a machine learning model.
潜在空間を示す概略図である。FIG. 2 is a schematic diagram showing a latent space. 実施例1における画像処理システムのブロック図である。1 is a block diagram of an image processing system in Example 1. FIG. 実施例1における画像処理システムの外観図である。1 is an external view of an image processing system in Example 1. FIG. 実施例1における潜在変数の遷移を示す図である。3 is a diagram showing transitions of latent variables in Example 1. FIG. 実施例1における推定フェーズに関するフローチャートである。7 is a flowchart regarding the estimation phase in Example 1. FIG. 第1の機械学習モデルの学習フェーズに関するフローチャートである。5 is a flowchart regarding the learning phase of the first machine learning model. 第1の機械学習モデルの学習フェーズの流れを示す図である。FIG. 3 is a diagram showing the flow of the learning phase of the first machine learning model. 実施例1における第1の潜在変数の生成のフローチャートである。7 is a flowchart of generation of a first latent variable in Example 1. FIG. 実施例2における画像処理システムのブロック図である。3 is a block diagram of an image processing system in Example 2. FIG. 実施例2における潜在変数の遷移を示す図である。7 is a diagram showing transitions of latent variables in Example 2. FIG. 第2の機械学習モデルの学習フェーズに関するフローチャートである。7 is a flowchart regarding the learning phase of the second machine learning model. 第2の機械学習モデルの学習フェーズの流れを示す図である。FIG. 6 is a diagram showing the flow of the learning phase of the second machine learning model. 実施例3における画像処理システムのブロック図である。3 is a block diagram of an image processing system in Example 3. FIG. 実施例3における推定フェーズに関するフローチャートである。7 is a flowchart regarding the estimation phase in Example 3.
 以下、本発明の実施形態について、図面を参照しながら詳細に説明する。各図において、同一の部材については同一の参照番号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In each figure, the same reference numerals are given to the same members, and duplicate explanations will be omitted.
 まず、各実施例の具体的な説明を行う前に、本発明の実施形態の要旨を説明する。本実施形態における画像処理方法では、敵対的生成ネットワーク(GAN:Generative Adversarial Networks)を用いて生成された機械学習モデルを用いて画像における特徴値(属性)を編集することで推定画像を生成する。GANには、多次元のテンソルである潜在変数が入力される。また、潜在変数が存在する空間を潜在空間と称する。また、画像に基づいて潜在変数を取得する処理を潜在空間への埋め込み(反転)と称する。潜在空間において、任意の特徴値の変化に対応する向きが存在し、その向きのベクトルを用いて特徴値が編集された画像に対応する潜在変数を取得(算出)する。ベクトルを用いて特徴値が編集された画像に対応する潜在変数を取得する処理を潜在変数の操作と称する。特徴値は、画像における特徴を示す値(量)であり、例えば人の顔の画像を編集する場合、特徴は例えば表情、顔の向き、年齢、性別、髪型などである。操作された潜在変数を機械学習モデルに入力することで画像を生成することができる。本実施形態における弊害は、アーティファクト(偽構造)が出現することである。アーティファクトは画像の写実性を低減させ知覚的に違和感を与える歪みであり、例えば、編集後の画像に意図せず発生したしみである。 First, before giving a specific explanation of each example, the gist of the embodiment of the present invention will be explained. In the image processing method in this embodiment, an estimated image is generated by editing feature values (attributes) in an image using a machine learning model generated using a generative adversarial network (GAN). A latent variable that is a multidimensional tensor is input to the GAN. Further, the space in which latent variables exist is called a latent space. Furthermore, the process of acquiring latent variables based on an image is referred to as embedding (inversion) in a latent space. In the latent space, there is a direction corresponding to a change in an arbitrary feature value, and a vector of that direction is used to obtain (calculate) a latent variable corresponding to an image whose feature values have been edited. The process of acquiring latent variables corresponding to images whose feature values have been edited using vectors is called latent variable manipulation. A feature value is a value (amount) indicating a feature in an image. For example, when editing an image of a person's face, the feature includes, for example, facial expression, facial orientation, age, gender, hairstyle, and the like. An image can be generated by inputting the manipulated latent variables into a machine learning model. A disadvantage of this embodiment is that artifacts (false structures) appear. Artifacts are distortions that reduce the realism of an image and give a perceptually strange feeling, and are, for example, stains that are unintentionally generated in an edited image.
 本実施形態に係る機械学習モデル(第1の機械学習モデル)は、GANによって生成される。GANは、潜在変数(ノイズ)に基づいて画像を生成する生成器と、生成器が作成した画像(フェイク画像)か正解画像(リアル画像)かを識別する識別器とを有する。また、生成器は、識別器が識別を間違えるような画像を生成するために、識別器における識別の結果に基づいて学習する。また、識別器はリアル画像と生成器が生成したフェイク画像とを識別できるように学習する。例えば機械学習モデルを生成するためにStyleGANやStyleGAN2、StyleGAN3を用いてもよい。 The machine learning model (first machine learning model) according to this embodiment is generated by GAN. A GAN includes a generator that generates an image based on latent variables (noise), and a classifier that identifies whether the image created by the generator is a fake image (fake image) or a correct image (real image). Further, the generator learns based on the classification result of the classifier in order to generate an image that the classifier misidentifies. The classifier also learns to distinguish between real images and fake images generated by the generator. For example, StyleGAN, StyleGAN2, and StyleGAN3 may be used to generate a machine learning model.
 本実施形態におけるGANの生成器は、マッピングネットワーク(Mapping network)と、合成ネットワーク(Synthesis network)とを有する。マッピングネットワークは、初期潜在空間を写像潜在空間に非線形変換することで、初期潜在変数に基づいて写像潜在変数を生成する。初期潜在空間及び写像潜在空間の詳細は後述する。また、合成ネットワークは、生成する画像の解像度に応じた個数に複製された写像潜在変数に基づいて画像を生成する。合成ネットワークを用いて解像度が1024px×1024pxの画像を生成するとき、例えば512次元の写像潜在変数を複製することで18個とし、18個の写像潜在変数を合成ネットワークに入力することで画像を生成する。 The GAN generator in this embodiment includes a mapping network and a synthesis network. The mapping network generates mapped latent variables based on the initial latent variables by nonlinearly transforming the initial latent space into a mapped latent space. Details of the initial latent space and the mapping latent space will be described later. Further, the synthesis network generates an image based on the mapped latent variables that are duplicated in a number corresponding to the resolution of the image to be generated. When generating an image with a resolution of 1024px x 1024px using a synthesis network, for example, by duplicating a 512-dimensional mapping latent variable, it becomes 18, and by inputting the 18 mapping latent variables to the synthesis network, the image is generated. do.
 初期潜在変数は任意の次元数を有すテンソルであるため、例えばガウス分布などに基づいてサンプリングされるテンソルを初期潜在変数として用いることができる。初期潜在変数が存在する空間が初期潜在空間であり、初期潜在空間に含まれる次元と、機械学習モデルの学習に用いた画像の特徴値との間の相関は低い。一方で、マッピングネットワークによって得られた写像潜在空間に含まれる次元と、機械学習モデルの学習した画像の特徴値との間には、高い相関がある。写像潜在空間は、第1の機械学習モデルが学習に用いた画像に基づいて得られた写像潜在変数が存在する領域である。写像潜在空間において複数の写像潜在変数によって定められる重心に近い潜在変数ほど、潜在変数の操作によって大きく特徴値を編集した場合でも編集後の画像に弊害が発生する確率が小さい。 Since the initial latent variable is a tensor with an arbitrary number of dimensions, for example, a tensor sampled based on a Gaussian distribution can be used as the initial latent variable. The space in which the initial latent variables exist is the initial latent space, and the correlation between the dimensions included in the initial latent space and the feature values of the images used for learning the machine learning model is low. On the other hand, there is a high correlation between the dimensions included in the mapping latent space obtained by the mapping network and the feature values of the image learned by the machine learning model. The mapping latent space is an area in which mapping latent variables obtained based on images used for learning by the first machine learning model exist. The closer a latent variable is to the center of gravity defined by a plurality of mapping latent variables in the mapping latent space, the lower the probability that an adverse effect will occur in the edited image even if the feature value is significantly edited by manipulating the latent variables.
 また、潜在変数に基づいて画像を生成する際、中間潜在空間に存在する中間潜在変数又は拡張潜在空間に存在する拡張潜在変数を用いてもよい。写像潜在空間を拡張したものが中間潜在空間であり、中間潜在空間さらに拡張したものが拡張潜在空間である。中間潜在変数及び拡張潜在変数には、複数の写像潜在変数によって定められる重心から離れた潜在変数が含まれる。複数の写像潜在変数によって定められる重心から離れた潜在変数に基づいて生成される画像ほど、弊害が発生する確率が高い。一方で、複数の写像潜在変数によって定められる重心から離れた潜在変数に基づいて生成される画像ほど、第1の機械学習モデルの学習に用いた画像にあまり含まれない特徴値を有する画像を生成することができる。複数の写像潜在変数によって定められる重心から離れた潜在変数を用いることで、学習画像が十分に得られない(第1の機械学習モデルが学習に用いた画像の特徴値が低密度である)場合においても精度よく画像を編集することが可能である。 Furthermore, when generating an image based on latent variables, intermediate latent variables existing in the intermediate latent space or extended latent variables existing in the extended latent space may be used. An extension of the mapping latent space is an intermediate latent space, and an extension of the intermediate latent space is an extended latent space. The intermediate latent variables and extended latent variables include latent variables that are distant from the center of gravity defined by the plurality of mapped latent variables. The farther an image is generated based on a latent variable from the center of gravity defined by a plurality of mapping latent variables, the higher the probability that an adverse effect will occur. On the other hand, the farther an image is generated based on a latent variable from the center of gravity determined by a plurality of mapped latent variables, the more the image has feature values that are less included in the image used for learning the first machine learning model. can do. When a sufficient learning image cannot be obtained by using a latent variable located far from the center of gravity determined by multiple mapping latent variables (the feature values of the image used for learning by the first machine learning model are low density) It is also possible to edit images with high precision.
 具体的には合成ネットワークを用いて解像度が1024px×1024pxの画像を生成するとき、中間潜在変数を用いる場合、例えば512次元の中間潜在変数を複製することで18個とし、18個の中間潜在変数を合成ネットワークに入力することで画像を生成する。ただし、中間潜在変数は写像潜在変数とは異なる潜在変数である。また、拡張潜在変数を用いる場合、例えば18種類の512次元のテンソルに基づいて得られた512×18次元のテンソルである拡張潜在変数を合成ネットワークに入力する。 Specifically, when using intermediate latent variables to generate an image with a resolution of 1024px x 1024px using a synthesis network, for example, by duplicating a 512-dimensional intermediate latent variable to 18, An image is generated by inputting it into a synthesis network. However, the intermediate latent variable is a latent variable different from the mapping latent variable. Further, when using an extended latent variable, the extended latent variable, which is a 512×18-dimensional tensor obtained based on, for example, 18 types of 512-dimensional tensors, is input to the synthesis network.
 ここで図1を参照して、潜在空間について説明する。図1は、潜在空間の一部を模式的に示したグラフである。なお、18種類のテンソル(拡張潜在変数)のうち写像潜在変数及び中間潜在変数と同じ次元を有する任意の二つのテンソルを第1及び第2の拡張潜在変数とする。図1のXは第1の拡張潜在変数であり、Yは第2の拡張潜在変数である。なお、Y=Xの直線上に存在する黒点は写像潜在変数を示し、Y=Xの直線上において写像潜在変数が存在する領域が写像潜在空間である。また、等高線は第1の機械学習モデルが学習した画像の特徴値の密度を示し、色の濃い領域ほど特徴値の密度が高いことを示す。したがって、潜在空間において第1の機械学習モデルが学習した画像の特徴値がより高密度な位置に存在する潜在変数に基づいて生成される画像ほど大きく特徴値を編集した場合でも編集後の画像に弊害が発生する確率が小さい。つまり、複数の写像潜在変数によって定められる重心の近くに存在する潜在変数に基づいて画像を生成することで、弊害の少ない画像を取得することができる。さらに、中間潜在空間は図1において、Y=Xの直線上に示される直線である。 Here, the latent space will be explained with reference to FIG. FIG. 1 is a graph schematically showing a part of the latent space. Note that among the 18 types of tensors (extended latent variables), arbitrary two tensors having the same dimensions as the mapping latent variable and the intermediate latent variable are defined as the first and second expanded latent variables. In FIG. 1, X is the first expanded latent variable, and Y is the second expanded latent variable. Note that the black dots existing on the straight line of Y=X indicate mapping latent variables, and the area where the mapping latent variables exist on the straight line of Y=X is the mapping latent space. Furthermore, the contour lines indicate the density of the feature values of the image learned by the first machine learning model, and the darker the area, the higher the density of the feature values. Therefore, even if the feature values of the image generated based on the latent variables that exist in the higher density positions of the image learned by the first machine learning model in the latent space are edited, the edited image will not change. The probability that harm will occur is small. That is, by generating an image based on latent variables existing near the center of gravity defined by a plurality of mapping latent variables, it is possible to obtain an image with fewer harmful effects. Furthermore, the intermediate latent space is a straight line shown on the Y=X straight line in FIG.
 本実施形態において、原画像(第1の画像)に基づいて潜在変数(第1の潜在変数)を取得し、第1の潜在変数を操作すること潜在変数(第2の潜在変数)を取得し、第2の潜在変数を機械学習モデルに入力することで画像(第2の画像)を生成する。また、第2の画像)に基づいて潜在変数(第3の潜在変数)を取得し、第3の潜在変数を操作すること潜在変数(第4の潜在変数)を取得し、第4の潜在変数を機械学習モデルに入力することで画像(第3の画像)を生成する。このとき、第3の潜在変数は、第2の潜在変数に対して、複数の写像潜在変数によって定められる重心の近くに存在している。 In this embodiment, a latent variable (first latent variable) is acquired based on the original image (first image), and a latent variable (second latent variable) is acquired by manipulating the first latent variable. , an image (second image) is generated by inputting the second latent variable to the machine learning model. Also, by obtaining a latent variable (third latent variable) based on the second image) and manipulating the third latent variable, obtaining a latent variable (fourth latent variable), An image (third image) is generated by inputting this into a machine learning model. At this time, the third latent variable exists near the center of gravity defined by the plurality of mapping latent variables with respect to the second latent variable.
 このように、画像の編集において潜在変数の操作を複数回に分け、潜在変数の操作する工程と、操作された潜在変数に基づいて画像を生成する工程と、生成された画像を潜在空間への埋め込む工程とを順番に行う。このような構成とすることで、第1の機械学習モデルを用いて第1の画像から特徴値を大きく編集した上で弊害の少ない画像を生成することができる。 In this way, when editing an image, the operation of the latent variable is divided into multiple steps: the process of manipulating the latent variable, the process of generating an image based on the manipulated latent variable, and the process of transferring the generated image to the latent space. The steps of embedding are performed in order. With such a configuration, it is possible to greatly edit the feature values from the first image using the first machine learning model, and then generate an image with fewer harmful effects.
 [実施例1]
 図2及び図3を参照して、実施例1に係る画像処理システム100に関して説明する。図2は、本実施例における画像処理システム100のブロック図である。図3は、画像処理システム100の外観図である。
[Example 1]
The image processing system 100 according to the first embodiment will be described with reference to FIGS. 2 and 3. FIG. 2 is a block diagram of the image processing system 100 in this embodiment. FIG. 3 is an external view of the image processing system 100.
 画像処理システム100は、学習装置101、画像処理装置(画像推定装置)102、表示装置103、記録媒体104、出力装置105、及びネットワーク106を有する。学習装置101及び画像処理装置102は、ネットワーク106を介して互いに通信可能である。 The image processing system 100 includes a learning device 101, an image processing device (image estimation device) 102, a display device 103, a recording medium 104, an output device 105, and a network 106. The learning device 101 and the image processing device 102 can communicate with each other via the network 106.
 学習装置(第1の学習装置)101は、記憶部101a、取得部101b、生成部101c、及び更新部101dを有し、第1の機械学習モデルのウエイトを決定する。 The learning device (first learning device) 101 includes a storage section 101a, an acquisition section 101b, a generation section 101c, and an updating section 101d, and determines the weight of the first machine learning model.
 画像処理装置102は、記憶部102a、取得部102b、変換部102c、生成部102d、及び推定部102eを有し、第1機械学習モデルを用いて推定画像(出力画像)を生成する。画像処理装置102による推定画像の生成は、1以上のCPU等のプロセッサ(処理手段)によりその機能を実装することができる。 The image processing device 102 includes a storage unit 102a, an acquisition unit 102b, a conversion unit 102c, a generation unit 102d, and an estimation unit 102e, and generates an estimated image (output image) using a first machine learning model. The function of generating an estimated image by the image processing device 102 can be implemented by one or more processors (processing means) such as a CPU.
 出力画像は、表示装置103、記録媒体104、又は出力装置105の少なくとも1つに出力される。表示装置103は、液晶ディスプレイやプロジェクタなどである。ユーザは表示装置103を介して、処理途中の画像を確認しながら編集作業などを行うことができる。記録媒体104は半導体メモリ、ハードディスク、若しくはネットワーク上のサーバなどであり、出力画像を保存する。出力装置105は、プリンタなどである。なお、例えばマウスやキーボードなどの入力装置は不図示である。 The output image is output to at least one of the display device 103, the recording medium 104, or the output device 105. The display device 103 is a liquid crystal display, a projector, or the like. The user can edit the image while checking the image being processed through the display device 103. The recording medium 104 is a semiconductor memory, a hard disk, a server on a network, or the like, and stores the output image. The output device 105 is a printer or the like. Note that input devices such as a mouse and a keyboard are not shown.
 次に、図4及び図5を参照して本実施例の推定フェーズの流れに関して述べる。図4は、図1と同様の潜在空間の一部を模式的に示したグラフにおいて、本実施例における潜在変数の挙動を示す。図5は、推定フェーズに関するフローチャートである。 Next, the flow of the estimation phase of this embodiment will be described with reference to FIGS. 4 and 5. FIG. 4 is a graph schematically showing a part of the latent space similar to FIG. 1, and shows the behavior of the latent variables in this example. FIG. 5 is a flowchart regarding the estimation phase.
 まず、ステップS101において、取得部102bは原画像(第1の画像)を取得する。なお、第1の画像は、記憶部102aにあらかじめ保存されている画像でもよい。必要に応じて第1の画像は、前処理されていてもよい。第1の画像に施される前処理は、特徴値に対する補正である。例えば、第1の画像が人の顔の画像である場合、その人の顔の目、口、鼻などの主要な器官の位置の調整(補正)である。以下、本実施例において、人の顔における年齢を第1の特徴値とする。 First, in step S101, the acquisition unit 102b acquires an original image (first image). Note that the first image may be an image stored in advance in the storage unit 102a. The first image may be preprocessed if necessary. The preprocessing performed on the first image is correction of the feature values. For example, when the first image is an image of a person's face, the positions of major organs such as the eyes, mouth, and nose of the person's face are adjusted (corrected). Hereinafter, in this embodiment, the age of a person's face will be used as the first feature value.
 ステップS102において、変換部102cは第1の画像を第1の潜在変数に変換する。本実施例において、第1の潜在変数は、第1の機械学習モデルを用いた逆解析によって取得される。ただし、これに限定されない。本実施例におけるステップS102では、後述する第1の機械学習モデルを用いた逆解析によって、第1の機械学習モデルの学習に用いた画像の特徴値が高密度に分布する位置に埋め込むこととで、第1の画像に基づいて第1の潜在変数を取得する。このような構成とすることで、弊害の発生する確率が低い第1の潜在変数を取得することができる。なお、本実施例において、第1の潜在変数は、中間潜在変数である。 In step S102, the conversion unit 102c converts the first image into a first latent variable. In this example, the first latent variable is obtained by inverse analysis using the first machine learning model. However, it is not limited to this. In step S102 in this embodiment, feature values of the image used for learning the first machine learning model are embedded in positions where they are densely distributed by inverse analysis using the first machine learning model described later. , obtain a first latent variable based on the first image. With such a configuration, it is possible to obtain a first latent variable with a low probability of causing an adverse effect. Note that in this embodiment, the first latent variable is an intermediate latent variable.
 ステップS103において、生成部102dは第1の潜在変数に基づいて第2の潜在変数を生成する。第2の潜在変数は、第1の潜在変数を操作することで取得される。このように取得された第2の潜在変数は、図4に示すように潜在空間において第1の潜在変数に対して第1の機械学習モデルの学習に用いた画像の特徴値が低密度に分布する位置に存在する。なお、本実施例において第2の潜在変数は、中間潜在変数である。ただし、これに限定されない。例えば中間潜在変数を複製することで得られた複数の潜在変数が合成ネットワークに入力された際、複数の潜在変数に互いに異なるアフィン変換が施されることで生成される潜在変数(スタイル潜在変数)を用いてもよい。その場合第1の特徴値をスタイル潜在空間において遷移させた潜在変数を算出する。また、拡張潜在変数を用いてもよい。 In step S103, the generation unit 102d generates a second latent variable based on the first latent variable. The second latent variable is obtained by manipulating the first latent variable. The second latent variable obtained in this way has a low density distribution of feature values of the image used for learning the first machine learning model with respect to the first latent variable in the latent space, as shown in Figure 4. be in a position to do so. Note that in this embodiment, the second latent variable is an intermediate latent variable. However, it is not limited to this. For example, when multiple latent variables obtained by duplicating intermediate latent variables are input into a synthesis network, latent variables (style latent variables) are generated by applying different affine transformations to the multiple latent variables. may also be used. In that case, a latent variable is calculated by transitioning the first feature value in the style latent space. Also, extended latent variables may be used.
 本実施例において、潜在変数の操作は、潜在変数の分離超平面の法線方向への遷移に基づいて算出される。分離超平面は、中間潜在空間で第1の特徴値に関するラベルを分割する平面であり、分離超平面で分割された中間潜在空間は、第1の特徴値に関して異なる特徴を有する。例えば、分離超平面で分割された片方の中間潜在空間には「若い」に相当する特徴値を有する潜在変数が分布しており、もう片方の中間潜在空間には「老い」に相当する特徴値を有する潜在変数が分布している。なお、分離超平面は、例えばSVM(Support Vector Machin)などを用いて推定される。また、分離超平面の法線ベクトル方向が、第1の特徴値の変化に対応する向きである。なお、第1の特徴値に対応する向きを決定する方法はこれに限定されない。 In this embodiment, the operation of the latent variable is calculated based on the transition of the latent variable in the normal direction of the separating hyperplane. The separating hyperplane is a plane that divides the label related to the first feature value in the intermediate latent space, and the intermediate latent space divided by the separating hyperplane has different features regarding the first feature value. For example, in one intermediate latent space divided by a separating hyperplane, latent variables with feature values corresponding to "young" are distributed, and in the other intermediate latent space, feature values corresponding to "old" are distributed. A latent variable with . Note that the separation hyperplane is estimated using, for example, SVM (Support Vector Machine). Further, the normal vector direction of the separating hyperplane is the direction corresponding to the change in the first feature value. Note that the method for determining the orientation corresponding to the first feature value is not limited to this.
 第1の潜在変数を操作する量が多くなるほど、操作された潜在変数に基づいて得られる画像に弊害が発生する確率が高まる。したがって、必要に応じて第1の潜在変数を操作する量に閾値を設定し、生成部102dは、第1の潜在変数を操作する量をこの閾値未満に設定する。 The more the first latent variable is manipulated, the higher the probability that an adverse effect will occur in the image obtained based on the manipulated latent variable. Therefore, a threshold is set for the amount by which the first latent variable is manipulated as needed, and the generation unit 102d sets the amount by which the first latent variable is manipulated to be less than this threshold.
 ステップS104において、推定部102eは第2の潜在変数に基づいて第2の画像を推定(生成)する。第2の画像は、第1の機械学習モデルを用いて第2の潜在変数に基づいて推定される。なお、第1の機械学習モデルのウエイトの情報は、学習装置101において学習されており、記憶部102aに記憶されている。 In step S104, the estimation unit 102e estimates (generates) a second image based on the second latent variable. A second image is estimated based on the second latent variable using the first machine learning model. Note that the information on the weights of the first machine learning model has been learned in the learning device 101 and is stored in the storage unit 102a.
 ステップS105において、変換部102cは第2の潜在変数に基づいて第3の潜在変数を生成する。第3の潜在変数は、ステップS102と同様の方法で第2の潜在変数を変換することで取得することができる。第3の潜在変数は、第2の潜在変数と互いに異なる潜在変数であり、潜在空間において第2の潜在変数に対して、複数の写像潜在変数によって定められる重心の近くに存在する。なお、第3の潜在変数を第2の潜在変数に対して、第1の機械学習モデルの学習に用いた画像の特徴値がより高密度に分布する位置に設定してもよい。なお、本実施例において第3の潜在変数は、中間潜在変数である。 In step S105, the conversion unit 102c generates a third latent variable based on the second latent variable. The third latent variable can be obtained by converting the second latent variable using a method similar to step S102. The third latent variable is a latent variable different from the second latent variable, and exists near the center of gravity defined by the plurality of mapping latent variables with respect to the second latent variable in the latent space. Note that the third latent variable may be set at a position where the feature values of the image used for learning the first machine learning model are more densely distributed with respect to the second latent variable. Note that in this embodiment, the third latent variable is an intermediate latent variable.
 ステップS106において、生成部102dは第3の潜在変数に基づいて第4の潜在変数を生成する。第4の潜在変数は、ステップS103と同様の方法で取得することができる。なお、ステップS106において、第1の特徴値と互いに異なる第2の特徴値を編集することで第4の潜在変数を取得してもよい。このように取得された第4の潜在変数は、図4に示すように拡張潜在空間において第3の潜在変数に対して第1の機械学習モデルの学習に用いた画像の特徴値が低密度に分布する位置に存在する。なお、本実施例において第4の潜在変数は、中間潜在変数である。 In step S106, the generation unit 102d generates a fourth latent variable based on the third latent variable. The fourth latent variable can be obtained using a method similar to step S103. Note that in step S106, the fourth latent variable may be acquired by editing a second feature value that is different from the first feature value. The fourth latent variable obtained in this way has a lower density of feature values of the image used for learning the first machine learning model than the third latent variable in the extended latent space, as shown in Figure 4. Exist in distributed locations. Note that in this embodiment, the fourth latent variable is an intermediate latent variable.
 ステップS107において、推定部102eは第4の潜在変数に基づいて第3の画像を推定(生成)する。第3の画像は、ステップS104と同様の方法で取得することができる。第3の画像は、第2の画像に対して、第1の特徴値(又は第2の特徴値)が編集された画像である。なお、必要に応じて第3の画像を新たな第2の画像として、新たな第3の画像を生成するためにステップS104からステップS107までのステップを1回以上繰り返し実行してもよい。 In step S107, the estimation unit 102e estimates (generates) the third image based on the fourth latent variable. The third image can be acquired using a method similar to step S104. The third image is an image obtained by editing the first feature value (or second feature value) with respect to the second image. Note that, if necessary, the third image may be used as a new second image, and the steps from step S104 to step S107 may be repeated one or more times to generate a new third image.
 このように、画像の編集において潜在変数の操作を複数回に分け、潜在変数の操作する工程と、操作された潜在変数に基づいて画像を生成する工程と、生成された画像を潜在空間への埋め込む工程とを順番に行う。このような構成とすることで、第1の機械学習モデルを用いて第1の画像から特徴値を大きく編集した上で弊害の少ない画像を生成することができる。さらに各潜在変数を中間潜在変数とすることで、拡張潜在変数を用いる場合に対して、弊害の発生する確率が低い潜在変数を用いて画像処理方法を行うことができる。また、写像潜在空間において低密度な特徴値であっても、精度よく編集可能な画像処理方法を提供することができる。 In this way, when editing an image, the operation of the latent variable is divided into multiple steps: the step of manipulating the latent variable, the step of generating an image based on the manipulated latent variable, and the step of transferring the generated image to the latent space. The steps of embedding are performed in order. With such a configuration, it is possible to greatly edit the feature values from the first image using the first machine learning model, and then generate an image with fewer harmful effects. Furthermore, by using each latent variable as an intermediate latent variable, the image processing method can be performed using latent variables that have a lower probability of causing harmful effects than when using expanded latent variables. Furthermore, it is possible to provide an image processing method that allows accurate editing even for low-density feature values in the mapping latent space.
 次に、図6及び図7を参照して、第1の機械学習モデルの学習フェーズ(学習済みモデルの製造方法)に関して述べる。図6は第1の機械学習モデルのウエイトの更新(学習)のフローチャートである。図6の各ステップは、主に、取得部101b、生成部101c、又は更新部101dにて実施される。図7は、本実施例におけるGANの構成を示す図である。本実施例におけるGANは、画像を生成する生成器10と生成された画像を識別する識別器11とを有する。 Next, the learning phase of the first machine learning model (method for manufacturing a learned model) will be described with reference to FIGS. 6 and 7. FIG. 6 is a flowchart of updating (learning) the weights of the first machine learning model. Each step in FIG. 6 is mainly performed by the acquisition unit 101b, the generation unit 101c, or the update unit 101d. FIG. 7 is a diagram showing the configuration of the GAN in this example. The GAN in this embodiment includes a generator 10 that generates an image and a classifier 11 that identifies the generated image.
 まずステップS201において、取得部101bは記憶部101aから正解画像12を取得する。正解画像12は複数の画像であり、撮像装置によって取得された撮像画像でもよいし、CG(Computer Graphics)画像でもよい。また、正解画像12に含まれる画像は、画像における特徴値が段階的に変化する複数の画像を含むことが好ましい。例えば人の顔の特徴値として年齢を編集する場合、幼児期と青年期の中間にあたる学童期や少年期の人の顔を含む画像を正解画像12に用いることが好ましい。このような構成とすることで、画像の特徴値を高精度に編集可能な機械学習モデルを生成することができる。なお、正解画像12は、識別器11においてリアル画像であるため、リアルに対応する正解ラベルを有する。必要に応じて正解画像12は、前処理されていてもよい。正解画像12に施される前処理は、正解画像12から特徴値に対する調整である。例えば、正解画像12が人の顔の画像である場合、その人の顔の目、口、鼻などの主要な器官の位置の調整(補正)である。 First, in step S201, the acquisition unit 101b acquires the correct image 12 from the storage unit 101a. The correct image 12 is a plurality of images, and may be a captured image acquired by an imaging device or a CG (Computer Graphics) image. Further, it is preferable that the images included in the correct image 12 include a plurality of images in which feature values of the images change in stages. For example, when editing age as a feature value of a person's face, it is preferable to use an image containing the face of a person in school age or boyhood, which is between infancy and adolescence, as the correct image 12. With such a configuration, it is possible to generate a machine learning model that can edit the feature values of an image with high precision. Note that since the correct image 12 is a real image in the discriminator 11, it has a correct label that corresponds to reality. The correct image 12 may be preprocessed if necessary. The preprocessing performed on the correct image 12 is adjustment of the feature values from the correct image 12. For example, when the correct image 12 is an image of a person's face, the positions of major organs such as the eyes, mouth, nose, etc. of the person's face are adjusted (corrected).
 ステップS202において、生成部101cは訓練潜在変数13を生成する。本実施例において、訓練潜在変数13は、512次元のテンソルであり、例えばガウス分布などに基づいてサンプリングされる任意のテンソルを訓練潜在変数13として用いてもよい。 In step S202, the generation unit 101c generates the training latent variable 13. In this embodiment, the training latent variable 13 is a 512-dimensional tensor, and for example, any tensor sampled based on a Gaussian distribution may be used as the training latent variable 13.
 ステップS203において、生成部101cは訓練潜在変数13を生成器10に入力し推定画像14を生成する。なお、推定画像14は、識別器11においてフェイク画像であるため、フェイクに対応する正解ラベルを有する。 In step S203, the generation unit 101c inputs the training latent variable 13 to the generator 10 to generate the estimated image 14. Note that since the estimated image 14 is a fake image in the classifier 11, it has a correct label corresponding to the fake image.
 なお、ステップS201を行わず、ステップS202及びステップS203から実行されてもよい。この場合、正解画像12を用いず第1の機械学習モデルを学習することができる。また、ステップS201と、ステップS202及びステップS203とをランダムに片方ずつ行われてもよい。さらにステップS201と、ステップS202及びステップS203とが実行される回数は、同じ回数に限られない。 Note that Step S202 and Step S203 may be executed without performing Step S201. In this case, the first machine learning model can be learned without using the correct image 12. Further, step S201, step S202, and step S203 may be performed one by one at random. Furthermore, the number of times that step S201, step S202, and step S203 are performed is not limited to the same number of times.
 ステップS204において、更新部101dは識別器11のウエイトを更新する。識別器11は、正解画像12又は推定画像14を取得し、識別ラベルを生成する。更新部101dは、識別ラベルと正解ラベルの誤差に基づいて、識別器11のウエイトを更新する。必要に応じて、推定画像14及び正解画像12に反転、平行移動、又は回転などの幾何変換による処理を施した画像を識別器11に入力することで、識別器11のウエイトの更新を行ってもよい。このような構成、正解画像12の枚数が少ない場合や正解画像12の特徴に偏りがある場合においても識別器11の識別の精度を高めることができる。 In step S204, the updating unit 101d updates the weight of the discriminator 11. The classifier 11 acquires the correct image 12 or the estimated image 14 and generates an identification label. The updating unit 101d updates the weight of the classifier 11 based on the error between the identification label and the correct label. If necessary, the weights of the classifier 11 are updated by inputting to the classifier 11 an image that has been processed by geometric transformation such as inversion, parallel translation, or rotation on the estimated image 14 and the correct image 12. Good too. With this configuration, even when the number of correct images 12 is small or when the characteristics of the correct images 12 are biased, the accuracy of identification by the classifier 11 can be improved.
 ステップS205において、更新部101dは識別ラベルに基づいて生成器10のウエイトを更新する。本実施例においてウエイトを更新は、識別ラベルのsigmoid cross entropyを用いる。ただし、これに限定されない。 In step S205, the updating unit 101d updates the weight of the generator 10 based on the identification label. In this embodiment, the weights are updated using the sigmoid cross entropy of the identification label. However, it is not limited to this.
 ステップS206において、更新部101dは学習が完了したか否かを判定する。学習の完了は、ウエイトの更新の反復回数が所定の回数に達したか、更新時のウエイトの編集、又は推定画像14の品質が所定の品質より高いか、などで判定することができる。推定画像14の品質が所定の品質より高いかは、例えば推定画像14と正解画像12の分布間の距離を測定するFrechet Inception Distanceなどのメトリックを用いて算出することができる。ウエイトの学習が完了していないと判定された場合、ステップS201へ戻り、取得部101bは新たな正解画像12を取得する。一方、ウエイトの学習が完了したと判定された場合、更新部101dは学習を終了し、ウエイトの情報を記憶部101aに記憶する。 In step S206, the updating unit 101d determines whether learning has been completed. Completion of learning can be determined based on whether the number of repetitions of weight updates has reached a predetermined number, whether the weights have been edited at the time of updating, or whether the quality of the estimated image 14 is higher than a predetermined quality. Whether the quality of the estimated image 14 is higher than a predetermined quality can be calculated using a metric such as Frechet Inception Distance, which measures the distance between the distributions of the estimated image 14 and the correct image 12, for example. If it is determined that weight learning is not completed, the process returns to step S201, and the acquisition unit 101b acquires a new correct image 12. On the other hand, if it is determined that the weight learning is completed, the updating unit 101d ends the learning and stores the weight information in the storage unit 101a.
 次に図8を参照して、ステップS102の第1の画像に基づいて第1の潜在変数への変換に関して述べる。図8は第1の潜在変数の生成のフローチャートである。本実施例において、第1の画像から第1の潜在変数への変換は、第1の機械学習モデルを用いた逆解析によって行う。図8の各ステップは、主に変換部102c又は生成部102dにて実施される。 Next, with reference to FIG. 8, the conversion to the first latent variable based on the first image in step S102 will be described. FIG. 8 is a flowchart of the generation of the first latent variable. In this embodiment, the conversion from the first image to the first latent variable is performed by inverse analysis using the first machine learning model. Each step in FIG. 8 is mainly performed by the converter 102c or the generator 102d.
 本実施例における第1の機械学習モデルを用いた逆解析は、第1の機械学習モデルに対して任意の潜在変数(第1の入力潜在変数)を入力することで生成される画像と第1の画像を比較し、その誤差を用いて第1の入力潜在変数を更新(最適化)する。なお、第1の機械学習モデルのウエイトの情報は予め記憶部101aから読みだされ、記憶部102aに記憶されている。第1の機械学習モデルへの第1の入力潜在変数の入力と第1の入力潜在変数を更新とを繰り返すことで、第1の機械学習モデルを用いて第1の画像に類似した画像を取得可能な第2の入力潜在変数を生成する。第1の画像と類似しているか否かは、例えば各画素値の差によって求めることができる。本実施例において、生成された複数の第2の入力潜在変数のうち、複数の写像潜在変数によって定められる重心との距離が最も近い第2の入力潜在変数を第1の潜在変数とする。上述した方法にて第2の入力潜在変数を第1の潜在変数に設定することで、偽構造が発生する確率の少ない第1の潜在変数を取得することができる。なお、第1の機械学習モデルの学習に用いた画像の特徴値がより高密度に分布する位置に存在する第2の入力潜在変数を第1の潜在変数に設定してもよい。 The inverse analysis using the first machine learning model in this example is based on the image generated by inputting an arbitrary latent variable (first input latent variable) to the first machine learning model and the first machine learning model. images are compared, and the first input latent variable is updated (optimized) using the error. Note that information on the weights of the first machine learning model is read out in advance from the storage unit 101a and stored in the storage unit 102a. By repeating inputting the first input latent variable to the first machine learning model and updating the first input latent variable, an image similar to the first image is obtained using the first machine learning model. Generate possible second input latent variables. Whether or not the image is similar to the first image can be determined, for example, from the difference in each pixel value. In this embodiment, among the plurality of generated second input latent variables, the second input latent variable having the closest distance to the center of gravity determined by the plurality of mapping latent variables is set as the first latent variable. By setting the second input latent variable to the first latent variable using the method described above, it is possible to obtain a first latent variable with a low probability of generating a false structure. Note that a second input latent variable that exists in a position where the feature values of the image used for learning the first machine learning model are more densely distributed may be set as the first latent variable.
 ステップS301において、第1の入力潜在変数を設定する。潜在空間内の任意の潜在変数を第1の入力潜在変数とすることができる。例えばガウス分布などに基づいてサンプリングされる任意のテンソルを第1の入力潜在変数として用いてもよい。 In step S301, a first input latent variable is set. Any latent variable in the latent space can be the first input latent variable. For example, any tensor sampled based on a Gaussian distribution or the like may be used as the first input latent variable.
 ステップS302において、第1の機械学習モデルを用いて第1の入力潜在変数に基づいて推定画像14を生成する。また、第1の機械学習モデルはあらかじめ学習されており、ウエイトの情報は、記憶部102aに記憶されている。 In step S302, the estimated image 14 is generated based on the first input latent variable using the first machine learning model. Further, the first machine learning model is trained in advance, and the weight information is stored in the storage unit 102a.
 ステップS303において、第1の画像と推定画像14との誤差に基づいて第1の入力潜在変数を更新する。このとき、損失関数は、第1の画像と埋め込み画像の画素値の差のユークリッドノルムや、第1の画像及び埋め込み画像に基づいて変換された特徴マップの要素ごとに算出したユークリッドノルムを用いる。ただし、損失関数はこれに限定されるものではない。例えば誤差逆伝播法(Backpropagation)を用いてもよい。 In step S303, the first input latent variable is updated based on the error between the first image and the estimated image 14. At this time, the loss function uses the Euclidean norm of the difference between the pixel values of the first image and the embedded image, or the Euclidean norm calculated for each element of the feature map converted based on the first image and the embedded image. However, the loss function is not limited to this. For example, an error backpropagation method may be used.
 ステップS304において、第1の入力潜在変数の更新が完了したか否かを判定する。更新の完了は、第1の入力潜在変数の更新の反復回数が所定の回数に達したか、更新時の第1の入力潜在変数の編集量が所定値より小さいか、又はステップS303で算出する損失関数の値が所定値より小さいか、などで判定することができる。第1の入力潜在変数の更新が完了していないと判定された場合、ステップS302へ戻り、更新された第1の入力潜在変数を第1の機械学習モデルに適用し、新たな埋め込み画像を生成する。一方、第1の入力潜在変数の更新が完了したと判定された場合、更新された第1の入力潜在変数を第2の入力潜在変数として、ステップS305に進む。 In step S304, it is determined whether the update of the first input latent variable is completed. The completion of the update is determined in step S303 when the number of repetitions of updating the first input latent variable reaches a predetermined number of times, the amount of editing of the first input latent variable at the time of update is smaller than a predetermined value, or when the update is completed in step S303. The determination can be made based on, for example, whether the value of the loss function is smaller than a predetermined value. If it is determined that the update of the first input latent variable is not completed, the process returns to step S302, and the updated first input latent variable is applied to the first machine learning model to generate a new embedded image. do. On the other hand, if it is determined that the update of the first input latent variable is completed, the updated first input latent variable is set as the second input latent variable and the process proceeds to step S305.
 ステップS305において、第2の入力潜在変数の生成が完了したか否かを判定する。生成の完了は、生成された第2の入力潜在変数の個数が所定の個数に達したか、などで判定することができる。第2の入力潜在変数の生成が完了していないと判定された場合、ステップS301へ戻り、第1の入力潜在変数を設定する。このとき、それまでのステップS301において設定されていない潜在変数を第1の入力潜在変数に設定する。一方、第2の入力潜在変数の生成が完了したと判定された場合、第2の入力潜在変数の生成を終了し、ステップS306に進む。 In step S305, it is determined whether the generation of the second input latent variable is completed. Completion of generation can be determined based on whether the number of generated second input latent variables has reached a predetermined number. If it is determined that the generation of the second input latent variable is not completed, the process returns to step S301 and the first input latent variable is set. At this time, the latent variable that has not been set in step S301 up to that point is set as the first input latent variable. On the other hand, if it is determined that the generation of the second input latent variable is completed, the generation of the second input latent variable is finished and the process advances to step S306.
 ステップS306において、第1の潜在変数を生成する。複数の写像潜在変数によって定められる重心の位置から複数の第2の入力潜在変数までの距離をそれぞれ算出し、最も距離が近い第2の入力潜在変数を第1の潜在変数として記憶部102aに記憶する。複数の写像潜在変数によって定められる重心の位置からの距離は、例えばユークリッドノルムなどで算出することができる。 In step S306, a first latent variable is generated. Distances from the position of the center of gravity determined by the plurality of mapping latent variables to the plurality of second input latent variables are respectively calculated, and the second input latent variable having the closest distance is stored in the storage unit 102a as the first latent variable. do. The distance from the center of gravity determined by the plurality of mapping latent variables can be calculated using, for example, the Euclidean norm.
 [実施例2]
 次に、図9を参照して、実施例2に係る画像処理システム200に関して説明する。図9は、本実施例における画像処理システム200のブロック図である。画像処理システム200は、学習装置201が、第1の学習部(第1の学習手段)211及び第2の学習部(第2の学習手段)212を有する点において、実施例1と異なる。
[Example 2]
Next, with reference to FIG. 9, an image processing system 200 according to a second embodiment will be described. FIG. 9 is a block diagram of the image processing system 200 in this embodiment. The image processing system 200 differs from the first embodiment in that a learning device 201 includes a first learning section (first learning means) 211 and a second learning section (second learning means) 212.
 画像処理システム200は、学習装置201、画像処理装置(画像推定装置)202、表示装置203、記録媒体204、出力装置205、及びネットワーク206を有する。学習装置201及び画像処理装置202は、ネットワーク206を介して互いに通信可能である。 The image processing system 200 includes a learning device 201, an image processing device (image estimation device) 202, a display device 203, a recording medium 204, an output device 205, and a network 206. The learning device 201 and the image processing device 202 can communicate with each other via the network 206.
 学習装置201は、第1の学習部211及び第2の学習部212を有する。第1の学習部211は、記憶部211a、取得部211b、生成部211c、及び更新部211dを有し、第1の機械学習モデルを生成する。なお、第1の学習部211は、実施例1における学習装置101に相当する。第2の学習部212は、記憶部212a、取得部212b、生成部212c、及び更新部212dを有し、第2の機械学習モデルのウエイトを決定する。第2の機械学習モデルは、画像に基づいて潜在変数を取得可能である。学習装置201は、1以上のCPU等のプロセッサ(学習手段)によりその機能を実装することができる。なお、学習装置201はサーバでもよい。また、第1及び第2の学習部は、別々の装置でもよい。 The learning device 201 has a first learning section 211 and a second learning section 212. The first learning unit 211 includes a storage unit 211a, an acquisition unit 211b, a generation unit 211c, and an update unit 211d, and generates a first machine learning model. Note that the first learning unit 211 corresponds to the learning device 101 in the first embodiment. The second learning unit 212 includes a storage unit 212a, an acquisition unit 212b, a generation unit 212c, and an update unit 212d, and determines the weight of the second machine learning model. The second machine learning model can obtain latent variables based on images. The learning device 201 can implement its functions using one or more processors (learning means) such as a CPU. Note that the learning device 201 may be a server. Further, the first and second learning sections may be separate devices.
 画像処理装置202は、実施例1における画像処理装置102と同様であるため説明を省略する。また、出力画像は、表示装置203、記録媒体204、又は出力装置205の少なくとも1つに出力される。表示装置203、記録媒体204、及び出力装置205は、実施例1における表示装置103、記録媒体104、及び、出力装置105と同様である。 The image processing device 202 is the same as the image processing device 102 in the first embodiment, so the description thereof will be omitted. Further, the output image is output to at least one of the display device 203, the recording medium 204, or the output device 205. The display device 203, the recording medium 204, and the output device 205 are the same as the display device 103, the recording medium 104, and the output device 105 in the first embodiment.
 次に、図5及び図10を参照して本実施例の推定フェーズの流れに関して述べる。図10は、図1と同様の潜在空間の一部を模式的に示したグラフであり、本実施例における潜在変数の挙動を示す。なお、本実施例における第1の機械学習モデルは、実施例1と同様の方法にて学習されており、ウエイトの情報は、記憶部102aに保存されている。 Next, the flow of the estimation phase of this embodiment will be described with reference to FIGS. 5 and 10. FIG. 10 is a graph schematically showing a part of the latent space similar to FIG. 1, and shows the behavior of the latent variables in this example. Note that the first machine learning model in this example is trained in the same manner as in Example 1, and the weight information is stored in the storage unit 102a.
 本実施例は、S102において変換部102cが第1の画像を後述する第2の機械学習モデルを用いて第1の潜在変数に変換する点で実施例1と異なる。また、本実施例において、第3の潜在変数は、第2の潜在変数と互いに異なる潜在変数であり、潜在空間において第2の潜在変数に対して、複数の写像潜在変数によって定められる重心の近くに存在する。なお、第3の潜在変数を第2の潜在変数に対して、第1の機械学習モデルの学習に用いた画像の特徴値がより高密度に分布する位置に設定してもよい。 This example differs from Example 1 in that in S102, the conversion unit 102c converts the first image into a first latent variable using a second machine learning model to be described later. Further, in this example, the third latent variable is a latent variable different from the second latent variable, and is near the center of gravity defined by the plurality of mapping latent variables with respect to the second latent variable in the latent space. exists in Note that the third latent variable may be set at a position where the feature values of the image used for learning the first machine learning model are more densely distributed with respect to the second latent variable.
 図10に示すように本実施例における第1乃至第4の潜在変数は拡張潜在変数である。また、第1の潜在変数は、第2の潜在変数に対して第1の機械学習モデルの学習に用いた画像の特徴値が高密度に分布する位置に存在する。さらに、同様に第3の潜在変数は、第4の潜在変数に対して第1の機械学習モデルの学習に用いた画像の特徴値が高密度に分布する位置に存在する。 As shown in FIG. 10, the first to fourth latent variables in this example are extended latent variables. Further, the first latent variable exists at a position where the feature values of the image used for learning the first machine learning model are distributed with high density with respect to the second latent variable. Furthermore, similarly, the third latent variable exists at a position where the feature values of the image used for learning the first machine learning model are distributed with high density with respect to the fourth latent variable.
 このように、画像の編集において、潜在変数の操作を複数回に分け、潜在変数の操作する工程と、操作された潜在変数に基づいて画像を生成する工程と、生成された画像を潜在空間への埋め込む工程とを順番に行う。このような構成とすることで、第1の機械学習モデルを用いて第1の画像から特徴値を大きく編集した上で弊害の少ない画像を生成することができる。さらに各潜在変数を拡張潜在変数とすることで、中間潜在変数を用いる場合に対して、第1の機械学習モデルの学習に用いた画像の特徴値とは異なる特徴値を有する画像を高精度に生成することができる。 In this way, when editing an image, the manipulation of the latent variable is divided into multiple steps: the process of manipulating the latent variable, the process of generating an image based on the manipulated latent variable, and the process of transferring the generated image to the latent space. The process of embedding is performed in order. With such a configuration, it is possible to greatly edit the feature values from the first image using the first machine learning model, and then generate an image with fewer harmful effects. Furthermore, by making each latent variable an extended latent variable, images with feature values different from the feature values of the image used for learning the first machine learning model can be detected with high accuracy compared to the case where intermediate latent variables are used. can be generated.
 次に図11及び図12を参照して、第2の機械学習モデルのウエイトの学習に関して述べる。図11の各ステップは、主に、取得部212b、生成部212c、又は更新部212dにて実施される。図11は第2の機械学習モデルのウエイトの学習のフローチャートである。図12は第2の機械学習モデルの学習の流れを示す図である。本実施例において生成器20及び識別器21を有するGANを用いて画像に基づいて潜在変数を生成する第2の機械学習モデルを生成する。 Next, learning of weights in the second machine learning model will be described with reference to FIGS. 11 and 12. Each step in FIG. 11 is mainly performed by the acquisition unit 212b, the generation unit 212c, or the update unit 212d. FIG. 11 is a flowchart of learning weights of the second machine learning model. FIG. 12 is a diagram showing the learning flow of the second machine learning model. In this embodiment, a second machine learning model that generates latent variables based on images is generated using a GAN having a generator 20 and a classifier 21.
 本実施例における第2の機械学習モデルは、複数の写像潜在変数によって定められる重心の近くの潜在変数に変換する。また、第1の機械学習モデルの学習に用いた画像の特徴値がより高密度に分布する位置に存在する第2の入力潜在変数を第1の潜在変数に設定してもよい。なお、第2の機械学習モデルは、潜在変数を生成(推定)する際に、第1及び第2の拡張潜在変数を含む複数のテンソルの分散が小さくなるように学習されてもよい。必要に応じて、第2の機械学習モデルによって推定された潜在変数が、GANの識別器によって写像潜在変数と判定されるように学習を行ってもよい。上記のような学習を行うことで、偽構造が発生する確率の少ない潜在変数を推定可能な第2の機械学習モデルを生成することができる。 The second machine learning model in this example converts latent variables near the center of gravity defined by a plurality of mapped latent variables. Further, a second input latent variable that exists in a position where the feature values of the image used for learning the first machine learning model are more densely distributed may be set as the first latent variable. Note that the second machine learning model may be trained so that the variance of a plurality of tensors including the first and second extended latent variables becomes small when generating (estimating) latent variables. If necessary, learning may be performed so that the latent variable estimated by the second machine learning model is determined to be a mapped latent variable by the GAN classifier. By performing the above-described learning, it is possible to generate a second machine learning model that can estimate latent variables with a low probability of generating a false structure.
 まず、ステップS501において、取得部212bは正解画像22を取得する。正解画像22は複数の画像であり、撮像装置によって取得された撮像画像でもよいし、CG(Computer Graphics)画像でもよい。また、正解画像22は、第1の機械学習モデルの学習に用いた画像(正解画像12)、又は第1の機械学習モデルによって生成された画像(推定画像14)が含まれていてもよい。 First, in step S501, the acquisition unit 212b acquires the correct image 22. The correct image 22 is a plurality of images, and may be a captured image acquired by an imaging device or a CG (Computer Graphics) image. Further, the correct image 22 may include an image used for learning the first machine learning model (correct image 12) or an image generated by the first machine learning model (estimated image 14).
 ステップS502において、生成部212cは推定潜在変数23を生成(推定)する。生成部212cは、生成器20に正解画像22を入力することで、推定潜在変数23を生成する。なお、推定潜在変数23は、識別器21においてフェイクに対応する正解ラベルを有する。 In step S502, the generation unit 212c generates (estimates) the estimated latent variable 23. The generation unit 212c generates the estimated latent variable 23 by inputting the correct image 22 to the generator 20. Note that the estimated latent variable 23 has a correct label corresponding to a fake in the classifier 21.
 ステップS503において、生成部212cは正解潜在変数25を取得する。本実施例において、512次元のテンソル(初期潜在変数に相当)を第1の機械学習モデルのマッピングネットワークに入力することで、正解潜在変数25(写像潜在変数に相当)を生成する。なお、正解潜在変数25は、識別器21においてリアルに対応する正解ラベルを有する。なお、ステップS501及びステップS502と、ステップS503とをランダムに片方のみ行われてもよい。さらにステップS501及びステップS502と、ステップS503とが実行される回数は、同じ回数に限られない。 In step S503, the generation unit 212c acquires the correct latent variable 25. In this embodiment, a 512-dimensional tensor (corresponding to an initial latent variable) is input to the mapping network of the first machine learning model to generate a correct latent variable 25 (corresponding to a mapping latent variable). Note that the correct answer latent variable 25 has a correct answer label that corresponds to reality in the discriminator 21. Note that only one of steps S501 and S502 and step S503 may be performed at random. Furthermore, the number of times that step S501, step S502, and step S503 are executed is not limited to the same number of times.
 ステップS504において、更新部212dは識別器21のウエイトを更新する。識別器21は、推定潜在変数23又は正解潜在変数25を取得し、識別ラベルを生成する。識別器21は識別ラベルと正解ラベルとの誤差に基づいて更新される。 In step S504, the updating unit 212d updates the weight of the discriminator 21. The classifier 21 obtains the estimated latent variable 23 or the correct latent variable 25 and generates an identification label. The classifier 21 is updated based on the error between the identification label and the correct label.
 ステップS505において、生成部212cは推定画像24を生成する。推定画像24は、推定潜在変数23を第1の機械学習モデルに入力することで生成される。なお、第1の機械学習モデルはあらかじめ学習されており、ウエイトの情報は、記憶部212aに記憶されている。 In step S505, the generation unit 212c generates the estimated image 24. The estimated image 24 is generated by inputting the estimated latent variable 23 into the first machine learning model. Note that the first machine learning model has been trained in advance, and the weight information is stored in the storage unit 212a.
 ステップS506において、更新部212dは生成器20のウエイトを更新する。本実施例において、損失関数は、例えば識別器21の識別ラベルに関する損失関数、推定潜在変数23に関する損失関数、及び正解画像22と推定画像24に関する損失関数などを用いることができる。識別器21の識別ラベルに関する損失関数は、識別ラベルのsigmoid cross entropyに基づいてウエイトの更新を行う。推定潜在変数23に関する損失関数は、推定潜在変数23に含まれる特定の次元を有する複数のテンソルの分散である。正解画像22と推定画像24に関する損失関数は、各画像の画素値の差のユークリッドノルムや各画像に基づいて変換された特徴マップの要素ごとに算出したユークリッドノルムなどである。 In step S506, the updating unit 212d updates the weight of the generator 20. In this embodiment, the loss function may be, for example, a loss function related to the identification label of the classifier 21, a loss function related to the estimated latent variable 23, a loss function related to the correct image 22 and the estimated image 24, or the like. The loss function related to the identification label of the classifier 21 updates the weight based on the sigmoid cross entropy of the identification label. The loss function for the estimated latent variable 23 is the variance of a plurality of tensors having a specific dimension included in the estimated latent variable 23. The loss function regarding the correct image 22 and the estimated image 24 is the Euclidean norm of the difference between the pixel values of each image, the Euclidean norm calculated for each element of the feature map converted based on each image, or the like.
 ステップS507において、更新部212dは学習が完了したか否かを判定する。学習の完了は、ウエイトの更新の反復回数が所定の回数に達したか、更新時のウエイトの編集量が所定値より小さいか、などで判定することができる。ウエイトの学習が完了していないと判定された場合、ステップS501へ戻り、取得部212bは新たな正解画像22を取得する。一方、ウエイトの学習が完了したと判定された場合、更新部212dは学習を終了し、ウエイトの情報を記憶部212aに記憶する。 In step S507, the updating unit 212d determines whether learning has been completed. Completion of learning can be determined based on whether the number of repetitions of weight updating has reached a predetermined number of times, or whether the amount of weight editing at the time of updating is smaller than a predetermined value. If it is determined that weight learning is not completed, the process returns to step S501, and the acquisition unit 212b acquires a new correct image 22. On the other hand, if it is determined that the weight learning is completed, the updating unit 212d ends the learning and stores the weight information in the storage unit 212a.
 [実施例3]
 次に、図13及び図14を参照して、実施例3に係る画像処理システム300に関して説明する。本実施例の画像処理システム300は、画像処理装置302に第1の画像に対する画像処理に関する要求を行う制御装置303を有する点で実施例1と異なる。
[Example 3]
Next, an image processing system 300 according to the third embodiment will be described with reference to FIGS. 13 and 14. The image processing system 300 of this embodiment differs from the first embodiment in that it includes a control device 303 that requests the image processing device 302 regarding image processing for the first image.
 図13は、本実施例における画像処理システム300のブロック図である。画像処理システム300は、学習装置301、画像処理装置(画像推定装置)302、及び制御装置303を有する。本実施例において、学習装置301及び画像処理装置302はサーバである。制御装置303は、例えばパーソナルコンピュータ若しくはスマートフォンのようなユーザ端末である。制御装置303はネットワーク304を介して画像処理装置302に接続されている。画像処理装置302はネットワーク305を介して学習装置301に接続されている。つまり、制御装置303及び画像処理装置302並びに画像処理装置302及び学習装置301は互いに通信可能に構成されている。 FIG. 13 is a block diagram of the image processing system 300 in this embodiment. The image processing system 300 includes a learning device 301, an image processing device (image estimation device) 302, and a control device 303. In this embodiment, the learning device 301 and the image processing device 302 are servers. The control device 303 is, for example, a user terminal such as a personal computer or a smartphone. The control device 303 is connected to the image processing device 302 via a network 304. The image processing device 302 is connected to the learning device 301 via a network 305. That is, the control device 303 and the image processing device 302 as well as the image processing device 302 and the learning device 301 are configured to be able to communicate with each other.
 画像処理システム300における学習装置301は、学習装置101と同様の構成のため説明を省略する。 The learning device 301 in the image processing system 300 has the same configuration as the learning device 101, so a description thereof will be omitted.
 画像処理装置302は、通信部(受信手段)302fを有する点において画像処理装置102と異なる。 The image processing device 302 differs from the image processing device 102 in that it includes a communication section (receiving means) 302f.
 制御装置303は、通信部(送信手段)303a、表示部(表示手段)303b、入力部(入力手段)303c、処理部(処理手段)303d、及び記録部303eを有する。通信部303aは、第1の画像に対する処理を画像処理装置302に実行させるための要求を画像処理装置302に送信することができる。また、画像処理装置302によって処理された出力画像を受信することができる。表示部303bは、種々の情報を表示する。表示部303bによって表示される種々の情報は、例えば画像処理装置302への入力画像及び画像処理装置302が生成した出力画像である。入力部303cは、ユーザから画像処理を開始する指示などを入力できる。処理部303dは、画像処理装置302から受信した出力画像に対して任意の画像処理を施すことができる。記録部303eは、画像処理装置302から受信した出力画像を保存する。 The control device 303 includes a communication section (transmission means) 303a, a display section (display means) 303b, an input section (input means) 303c, a processing section (processing means) 303d, and a recording section 303e. The communication unit 303a can transmit a request to the image processing device 302 for causing the image processing device 302 to perform processing on the first image. Additionally, output images processed by the image processing device 302 can be received. The display section 303b displays various information. Various information displayed by the display unit 303b is, for example, an input image to the image processing device 302 and an output image generated by the image processing device 302. The input unit 303c allows the user to input an instruction to start image processing. The processing unit 303d can perform arbitrary image processing on the output image received from the image processing device 302. The recording unit 303e stores the output image received from the image processing device 302.
 なお、処理対象である第1の画像を画像処理装置302に送信する方法は問わず、例えば第1の画像はS601と同時に画像処理装置302にアップロードされてもよいし、S601以前に画像処理装置302にアップロードされていてもよい。また、第1の画像は画像処理装置302とは異なるサーバ上に保存された画像でもよい。 Note that the method for transmitting the first image to be processed to the image processing apparatus 302 does not matter; for example, the first image may be uploaded to the image processing apparatus 302 at the same time as S601, or the first image may be uploaded to the image processing apparatus 302 before S601. 302 may be uploaded. Further, the first image may be an image stored on a server different from the image processing device 302.
 次に、本実施例における出力画像(第3の画像)の生成に関して説明する。図14は、本実施例における推定フェーズに関するフローチャートである。 Next, generation of the output image (third image) in this embodiment will be explained. FIG. 14 is a flowchart regarding the estimation phase in this embodiment.
 制御装置303の動作について説明する。本実施例における画像処理は、制御装置303を介してユーザにより画像処理開始の指示によって処理が開始される。 The operation of the control device 303 will be explained. The image processing in this embodiment is started by a user's instruction to start image processing via the control device 303.
 ステップS601(第1の送信ステップ)において、通信部303aは第1の画像する処理の要求を画像処理装置302へ送信する。なお、ステップS601において、制御装置303は第1の画像に対する処理の要求と共に、編集に関する情報、ユーザを認証するIDなどを送信してもよい。編集に関する情報は、編集対象となる特徴値とその特徴値の編集の程度を含む。例えばユーザが第1の画像の被写体の年齢を10歳増やした画像を生成する場合。編集に関する情報は、ユーザが指定した「年齢」、「10歳」、及び「老い」などの情報が含まれる。 In step S601 (first transmission step), the communication unit 303a transmits a first image processing request to the image processing device 302. Note that in step S601, the control device 303 may transmit information regarding editing, an ID for authenticating the user, and the like together with a request for processing the first image. The information regarding editing includes the feature value to be edited and the degree of editing of the feature value. For example, when a user generates an image in which the age of the subject in the first image is increased by 10 years. The information regarding editing includes information such as "age", "10 years old", and "old age" specified by the user.
 ステップS602(第1の受信ステップ)において、通信部303aは画像処理装置302によって生成された第3の画像を受信する。 In step S602 (first reception step), the communication unit 303a receives the third image generated by the image processing device 302.
 次に、画像処理装置302の動作について説明する。 Next, the operation of the image processing device 302 will be explained.
 まず、ステップS701(第2の受信ステップ)において、通信部302fは通信部303aから送信された第1の画像に対する処理の要求を受信する。画像処理装置302は、第1の画像に対する処理が指示を受けることによって、ステップS702以降の処理を実行する。 First, in step S701 (second reception step), the communication unit 302f receives a request to process the first image transmitted from the communication unit 303a. The image processing device 302 executes the processing from step S702 upon receiving the instruction to process the first image.
 ステップS702において、取得部302bは、編集に関する情報及び第1の画像を取得する。本実施例において、編集に関する情報及び第1の画像は、制御装置303から送信されたものである。なお、ステップS701及びステップS702の処理は同時に行われてもよい。また、ステップS702乃至S708は、ステップS101乃至S107と同様であるため、説明を省略する。 In step S702, the acquisition unit 302b acquires information regarding editing and the first image. In this embodiment, the information regarding editing and the first image are transmitted from the control device 303. Note that the processes in step S701 and step S702 may be performed simultaneously. Further, steps S702 to S708 are the same as steps S101 to S107, so their explanation will be omitted.
 ステップS709(第2の送信ステップ)において、通信部302fは第3の画像を制御装置303へ送信する。 In step S709 (second transmission step), the communication unit 302f transmits the third image to the control device 303.
 このように、画像の編集において、潜在変数の操作を複数回に分け、潜在変数の操作する工程と、操作された潜在変数に基づいて画像を生成する工程と、生成された画像を潜在空間への埋め込む工程とを順番に行う。このような構成とすることで、第1の機械学習モデルを用いて第1の画像から特徴値を大きく編集した上で弊害の少ない画像を生成することができる。さらに、本実施例において、制御装置303は特定の画像に対する処理を要求するのみであり、実際の画像処理は画像処理装置302によって行われる。したがって、制御装置303をユーザ端末とすれば、ユーザ端末による処理負荷を低減することが可能となる。したがって、ユーザ側は低い処理負荷で出力画像を得ることが可能となる。 In this way, when editing an image, the manipulation of the latent variable is divided into multiple steps: the process of manipulating the latent variable, the process of generating an image based on the manipulated latent variable, and the process of transferring the generated image to the latent space. The process of embedding is performed in order. With such a configuration, it is possible to greatly edit the feature values from the first image using the first machine learning model, and then generate an image with fewer harmful effects. Furthermore, in this embodiment, the control device 303 only requests processing for a specific image, and the actual image processing is performed by the image processing device 302. Therefore, by using the control device 303 as a user terminal, it is possible to reduce the processing load on the user terminal. Therefore, the user can obtain an output image with a low processing load.
 (その他の実施形態)
 本発明は、上述の実施例の1以上の機能を実現するプログラムを、ネットワークまたは記憶媒体を介してシステム、又は装置に供給し、そのシステムまたは装置のコンピュータにおける1つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、1以上の機能を実現する回路(例えば、ASIC)によっても実現可能である。本発明における画像処理装置は本発明の画像処理機能を有する装置であればよく、PCの形態で実現され得る。
(Other embodiments)
The present invention provides a program that implements one or more of the functions of the embodiments described above to a system or device via a network or a storage medium, and one or more processors in a computer of the system or device reads the program. This can also be achieved by executing a process. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions. The image processing device according to the present invention may be any device having the image processing function according to the present invention, and may be realized in the form of a PC.
 各実施例によれば、第1の機械学習モデルを用いて元の画像の特徴値を大きく編集した上で弊害の少ない画像を生成することができる。 According to each embodiment, it is possible to greatly edit the feature values of the original image using the first machine learning model, and then generate an image with fewer harmful effects.
 以上、本発明の好ましい実施形態及び実施例について説明したが、本発明はこれらの実施形態及び実施例に限定されず、その要旨の範囲内で種々の組合せ、変形及び変更が可能である。従って、本発明の範囲を公にするために以下の請求項を添付する。 Although the preferred embodiments and examples of the present invention have been described above, the present invention is not limited to these embodiments and examples, and various combinations, modifications, and changes can be made within the scope of the gist. Therefore, the following claims are appended to set forth the scope of the invention.
 本願は、2022年8月26日提出の日本国特許出願である特願2022-135296を基礎として優先権を主張するものであり、その記載内容の全てをここに援用する。 This application claims priority based on Japanese Patent Application No. 2022-135296, which is a Japanese patent application filed on August 26, 2022, and the entire content thereof is incorporated herein by reference.

Claims (15)

  1.  第1の画像に基づいて第1の潜在変数を取得するステップと、
     前記第1の潜在変数に基づいて前記第1の潜在変数と異なる第2の潜在変数を取得するステップと、
     前記第2の潜在変数を第1の機械学習モデルに入力することで、第2の画像を生成するステップと、
     前記第2の画像に基づいて前記第2の潜在変数とは異なる第3の潜在変数を取得するステップと、
     前記第3の潜在変数に基づいて前記第3の潜在変数とは異なる第4の潜在変数を取得するステップと、
     前記第4の潜在変数を前記第1の機械学習モデルに入力することで、第3の画像を生成するステップとを有ることを特徴とする画像処理方法。
    obtaining a first latent variable based on the first image;
    obtaining a second latent variable different from the first latent variable based on the first latent variable;
    generating a second image by inputting the second latent variable into a first machine learning model;
    obtaining a third latent variable different from the second latent variable based on the second image;
    obtaining a fourth latent variable different from the third latent variable based on the third latent variable;
    An image processing method comprising the step of generating a third image by inputting the fourth latent variable to the first machine learning model.
  2.  前記第1の潜在変数及び前記第3の潜在変数の少なくとも一方は、前記第1の機械学習モデルを用いた逆解析によって取得されることを特徴とする請求項1に記載の画像処理方法。 The image processing method according to claim 1, wherein at least one of the first latent variable and the third latent variable is obtained by inverse analysis using the first machine learning model.
  3.  前記第1の潜在変数及び前記第3の潜在変数の少なくとも一方は、第2の機械学習モデルを用いて取得されることを特徴とする請求項1に記載の画像処理方法。 The image processing method according to claim 1, wherein at least one of the first latent variable and the third latent variable is obtained using a second machine learning model.
  4.  前記第2の機械学習モデルは、入力された画像に基づいて潜在変数を生成することを特徴とする請求項3に記載の画像処理方法。 The image processing method according to claim 3, wherein the second machine learning model generates latent variables based on the input image.
  5.  前記第1の画像の被写体は、人の顔であることを特徴とする請求項1乃至4の何れか一項に記載の画像処理方法。 The image processing method according to any one of claims 1 to 4, wherein the subject of the first image is a human face.
  6.  前記第3の画像を新たな第2の画像として、新たな第3の画像を生成するステップを1回以上繰り返すことを特徴とする請求項1乃至5の何れか一項に記載の画像処理方法。 The image processing method according to any one of claims 1 to 5, characterized in that the step of generating a new third image is repeated one or more times by using the third image as a new second image. .
  7.  前記第1の機械学習モデルの学習に用いられた画像に基づいて得られた複数の潜在変数の分布である潜在空間において、前記第3の潜在変数は前記第2の潜在変数よりも前記複数の潜在変数によって定められる重心の近くに位置することを特徴とする請求項1乃至6の何れか一項に記載の画像処理方法。 In a latent space that is a distribution of a plurality of latent variables obtained based on the images used for learning the first machine learning model, the third latent variable is larger than the second latent variable. The image processing method according to any one of claims 1 to 6, characterized in that the image processing method is located near a center of gravity defined by a latent variable.
  8.  前記第2の潜在変数は、前記第1の潜在変数よりも前記第1の機械学習モデルの学習に用いられた画像の特徴値が低密度に分布する位置に存在することを特徴とする請求項1乃至7の何れか一項に記載の画像処理方法。 The second latent variable is located at a position where feature values of the image used for learning the first machine learning model are distributed at a lower density than the first latent variable. 8. The image processing method according to any one of 1 to 7.
  9.  前記第4の潜在変数は、前記第3の潜在変数よりも前記第1の機械学習モデルの学習に用いられた画像の特徴値が低密度に分布する位置に存在することを特徴とする請求項1乃至8の何れか一項に記載の画像処理方法。 The fourth latent variable is located at a position where the feature values of the image used for learning the first machine learning model are distributed at a lower density than the third latent variable. 9. The image processing method according to any one of 1 to 8.
  10.  請求項1乃至9の何れか一項に記載の画像処理方法をコンピュータに実行させることを特徴とするプログラム。 A program that causes a computer to execute the image processing method according to any one of claims 1 to 9.
  11.  請求項10に記載のプログラムを記憶していることを特徴とする記憶媒体。 A storage medium storing the program according to claim 10.
  12.  請求項1乃至9の何れか一項に記載の画像処理方法を実行可能な処理手段を有することを特徴とする画像処理装置。 An image processing device comprising processing means capable of executing the image processing method according to any one of claims 1 to 9.
  13.  請求項12に記載の画像処理装置と該画像処理装置と通信可能な制御装置とを有する画像処理システムであって、
     前記制御装置は、前記第1の画像に対する処理の実行に関する要求を前記画像処理装置へ送信する手段を有することを特徴とする画像処理システム。
    An image processing system comprising the image processing device according to claim 12 and a control device capable of communicating with the image processing device,
    An image processing system characterized in that the control device includes means for transmitting a request regarding execution of processing on the first image to the image processing device.
  14.  訓練潜在変数を取得するステップと、
     生成器において前記訓練潜在変数に基づいて推定画像を生成するステップと
     識別器において入力された画像が前記推定画像であるかを識別するステップと、
     前記識別器における識別の結果に基づいて前記生成器を学習するステップとを有することを特徴とする学習済みモデルの製造方法。
    obtaining training latent variables;
    generating an estimated image based on the training latent variable in a generator; and identifying whether the input image is the estimated image in a discriminator;
    A method for manufacturing a trained model, comprising the step of learning the generator based on the result of classification by the classifier.
  15.  請求項14に記載の学習済みモデルの製造方法を実行可能な第1の学習手段を有することを特徴とする学習装置。 A learning device comprising a first learning means capable of executing the learned model manufacturing method according to claim 14.
PCT/JP2023/029203 2022-08-26 2023-08-10 Image processing method, image processing device, and program WO2024043109A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022135296A JP2024031629A (en) 2022-08-26 2022-08-26 Image processing method, image processing apparatus, and program
JP2022-135296 2022-08-26

Publications (1)

Publication Number Publication Date
WO2024043109A1 true WO2024043109A1 (en) 2024-02-29

Family

ID=90013182

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/029203 WO2024043109A1 (en) 2022-08-26 2023-08-10 Image processing method, image processing device, and program

Country Status (2)

Country Link
JP (1) JP2024031629A (en)
WO (1) WO2024043109A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021082118A (en) * 2019-11-21 2021-05-27 キヤノン株式会社 Learning method, program, learning device, and method for manufacturing learned weight
CN113297933A (en) * 2021-05-11 2021-08-24 广州虎牙科技有限公司 Image generation method and related device
US20220121876A1 (en) * 2020-10-16 2022-04-21 Adobe Inc. Non-linear latent filter techniques for image editing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021082118A (en) * 2019-11-21 2021-05-27 キヤノン株式会社 Learning method, program, learning device, and method for manufacturing learned weight
US20220121876A1 (en) * 2020-10-16 2022-04-21 Adobe Inc. Non-linear latent filter techniques for image editing
CN113297933A (en) * 2021-05-11 2021-08-24 广州虎牙科技有限公司 Image generation method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALALUF YUVAL; PATASHNIK OR; COHEN-OR DANIEL: "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 10 October 2021 (2021-10-10), pages 6691 - 6700, XP034093009, DOI: 10.1109/ICCV48922.2021.00664 *
WEI TIANYI, CHEN DONGDONG, ZHOU WENBO, LIAO JING, ZHANG WEIMING, YUAN LU, HUA GANG, YU NENGHAI: "E2Style: Improve the Efficiency and Effectiveness of StyleGAN Inversion", IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE, USA, vol. 31, 1 January 2022 (2022-01-01), USA, pages 3267 - 3280, XP093143329, ISSN: 1057-7149, DOI: 10.1109/TIP.2022.3167305 *

Also Published As

Publication number Publication date
JP2024031629A (en) 2024-03-07

Similar Documents

Publication Publication Date Title
JP7373554B2 (en) Cross-domain image transformation
WO2021254499A1 (en) Editing model generation method and apparatus, face image editing method and apparatus, device, and medium
CN110084193B (en) Data processing method, apparatus, and medium for face image generation
Suetens et al. Statistically deformable face models for cranio-facial reconstruction
CN114925748B (en) Model training and modal information prediction method, related device, equipment and medium
CN113039816B (en) Information processing device, information processing method, and information processing program
CN111226258A (en) Signal conversion system and signal conversion method
KR20210147507A (en) Image generation system and image generation method using the system
CN111524216A (en) Method and device for generating three-dimensional face data
JP2020181240A (en) Data generation device, data generation method and program
CN117292041B (en) Semantic perception multi-view three-dimensional human body reconstruction method, device and medium
CN110546687A (en) Image processing device and two-dimensional image generation program
WO2024043109A1 (en) Image processing method, image processing device, and program
WO2021171384A1 (en) Clustering device, clustering method, and clustering program
JP7148078B2 (en) Attribute estimation device, attribute estimation method, attribute estimator learning device, and program
JP2019082847A (en) Data estimation device, date estimation method, and program
JP7437918B2 (en) Information processing device, information processing method, and program
CN112837318A (en) Method for generating ultrasound image generation model, method for synthesizing ultrasound image generation model, medium, and terminal
EP4111420A1 (en) Face mesh deformation with detailed wrinkles
US20200175376A1 (en) Learning Method, Learning Device, Program, and Recording Medium
CN113936320B (en) Face image quality evaluation method, electronic device and storage medium
KR102678473B1 (en) Automatic Caricature Generating Method and Apparatus
JP2004062721A (en) Image identifying device
JP2024071870A (en) Learning device, learning method, and program
JP2024040662A (en) Learning data generation device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23857222

Country of ref document: EP

Kind code of ref document: A1