CN115909002A

CN115909002A - Image translation method based on contrast learning

Info

Publication number: CN115909002A
Application number: CN202211232833.1A
Authority: CN
Inventors: 邢志强; 董小舒; 郭博; 辛付豪; 余思尧; 张典; 王杨红; 吴欢
Original assignee: Nanjing Laisi Electronic Equipment Co ltd
Current assignee: Nanjing Laisi Electronic Equipment Co ltd
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-04-04

Abstract

The invention discloses an image translation method based on contrast learning, which comprises the following steps: inputting an input image into a generator; inputting the image generated by the generator and the real image of the target domain into a discriminator; calculating the loss of the generated countermeasure network; re-inputting the input image and the output image of the generator into an encoder in the generator, inputting the encoding vectors of the input image and the output image into a mapping network to obtain the characteristic vectors of the input image and the output image in the same characteristic space, and calculating the contrast loss between the characteristic vectors of the input image and the output image; optimizing contrast loss using focus loss; and (4) performing back propagation on the antagonistic network loss and the optimized contrast loss, and optimizing the network. The invention can greatly reduce the memory occupation and the training time by utilizing the model generated by the contrast loss, and simultaneously achieves the image conversion effect with more obvious details than the unidirectional image conversion and the bidirectional image conversion.

Description

Image translation method based on contrast learning

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an image translation method based on contrast learning.

Background

Image translation is a widely applied technology in computer vision, and aims to learn a mapping relation and realize conversion from a source domain image to a target domain image. The generation of the countermeasure network has a strong image generation capability by virtue of a strong expression capability of the neural network, and is a mainstream technology of image translation.

Nowadays, with the increasing popularization of the internet, application scenarios based on image translation technology for generating an anti-network are becoming more common, including a lot of application scenarios such as image coloring, image high-resolution conversion, and image editing. In the field of automatic driving, a high-definition city scene graph is converted into a semantic label graph, and then the semantic label graph is input into a recognition system for further analysis. In short video applications, various types of video conversion effects need to be added to the video, and technical support for image translation is needed. Meanwhile, the artistic style image generated according to the real photo also provides creation reference for designers, and the artistic style image and the designer represent wide application value and huge commercial value of the image translation technology.

Image translation often employs a cyclic consistency penalty or employs a predefined content-aware penalty to ensure correlation between domains. However, the loss of cycle consistency needs an additional symmetric network, and the model is large and not beneficial to training; the content perception loss needs to be predefined, and the deviation exists in measurement, so that the generating capacity of the generator is limited.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the image translation method based on the contrast learning is provided, the content relevance before and after the image is converted can be effectively measured, the problems that the model is large in image translation and the consistency of the measured content has large deviation are solved, and the image translation method has a better generation effect compared with the traditional model.

The technical scheme is as follows: in order to achieve the above object, the present invention provides an image translation method based on contrast learning, comprising the steps of:

s1: inputting an input image into a generator, the generator comprising an encoder and a decoder; the encoder is mainly used for encoding the characteristics of an input image into a characteristic vector, and the decoder is mainly used for decoding the characteristic vector into an image in a target domain;

s2: inputting the image generated by the generator and the real image of the target domain into a discriminator to obtain the output of the discriminator, namely generating an image prediction probability and a real image prediction probability;

s3: calculating the loss of the generated countermeasure network according to the generated image prediction probability and the real image prediction probability;

s4: re-inputting the input image and the output image of the generator into an encoder in the generator to obtain the encoded output of the input image and the output image, inputting the encoded vectors of the input image and the output image into a mapping network to obtain the feature vectors of the input image and the output image in the same feature space, defining a contrast learning method in image translation according to a definition concept of contrast learning, sampling image blocks on the input image and the output image, dividing positive and negative samples, proposing contrast loss, and calculating the contrast loss between the feature vectors of the input image and the output image;

s5: optimizing the contrast loss by using the focus loss, and solving the problem that positive and negative samples are unevenly sampled;

s6: and carrying out back propagation on the anti-network loss and the optimized contrast loss, optimizing the network, and using the optimized network to realize image translation.

Further, the generator in step S1 has a structure that: adopts a U-Net network structure and comprises an encoder G _enc And decoder G _dec Therein, the encoder G _enc Is composed of 3 convolution layers; a conversion module transformation is arranged between the encoder and the decoder and is used for image conversion between the two fields; the encoder consists of n layers of upsampling; the encoder and the decoder are in jump connection corresponding to the convolution layer, so that the characteristic information of the input image can be effectively prevented from being lost after convolution calculation, and the transmission efficiency of the information is improved.

Further, the network structure of the discriminator in step S2 is: the method is characterized in that a discrimination network structure with an attention module is constructed on the basis of a discriminator of the cycleGAN, and the expression capability of the discriminator of the cycleGAN in a generated network after improvement is enhanced, so that the capability of the discriminator of the network structure is too weak, and the quality of a generated image is poor. Therefore, in order to achieve better balance between the two, the discriminant network PatchGAN in the original cycleGAN is improved, and the size of the receptive field of the original network is maintained, and the dense residual block and the attention mechanism are added, so that the judgment capability of the discriminant network is obviously enhanced.

Further, in step S4 of the present invention: first, the correlation definition is introduced, with the optimization objective being to make the image in the input domain

Passed model conversion and image from the target field->

The performance is similar. Given an example image that contains no pairing +>

The image translation model aims to learn an image translation model to realize the conversion from an X-domain image to a Y-domain image, and the judgment result is based on the judgment result>

The idea of contrast learning is to correlate two signals, the link between one query instance and its positive instance, against other points in the dataset (called negative instances). The query and the positive and N negative examples are mapped to K-dimensional vectors respectively v,

and &>

Representing the nth negative number. Thus, an (N + 1) classification problem is created in which the distance between the query and other examples is passed as logits using a multiplicative approach. The probability that the positive example was chosen is represented by calculating the cross entropy loss.

Wherein v is the sum of the values of,

and &>

Representing a query vector, a positive example vector, and a negative example vector, respectively.

The goal is to correlate input and output data. In contrast learning, a query refers to an output image block. Positive and negative examples are corresponding and non-corresponding input image blocks.

Contrast loss is introduced in image translation, and in an unsupervised learning environment, the whole image should share content, and corresponding characteristics should exist between input and output image blocks. Given an image block displaying the output zebra head, it should be possible to associate it more strongly with an image block of the input horse head than with other image blocks of the horse image. Even at the pixel level, the color of the zebra body correlates more closely with the horse body color than with the background tone of grass. Thus for the input and output images the image blocks correspond to each other at the same location, which is a positive example. And image blocks in other positions are non-corresponding negative examples. Meanwhile, the image block corresponds to a certain point on the feature map, and the smaller the feature map is, the larger the size of the image block is. Sampling positive and negative samples of the multi-scale image block, and enabling a learning target to be based on the image block and the multilayer characteristic diagram.

The specific process of the encoding output in the step S4 is as follows:

using the encoder G in the generator G _enc Extracting high-order semantic information of the image; g _enc Each spatial position on one of the intermediate feature maps represents an image block of the input image, with deeper layers corresponding to larger image blocks; for reference to the SimCLR model (Chen T, kornblith S, norouzi M, et al. A simple frame for coherent aspects of visual representation [ C]Pmlr,2020 _l Passing a feature map, generating a set of features

Wherein +>

Represents the output of the ith selected layer, where L ∈ {1,2, \8230;, L }, S ∈ {1,2, \8230;, S ∈ _l In which S is _l Is the number of spatial locations in each layer, and the corresponding feature (positive example) is called +>

And other non-corresponding features (negative examples) are referred to as +>

Wherein C is _l Is the number of channels per layer;

in the same way, the image is output

Coded as>

The calculation of the contrast loss in the step S4 is specifically:

the optimization goal is to match the corresponding input-output image blocks at specific positions; the other Image blocks in the same input were named negative samples and are expressed as NCEIT Loss (NCE Loss for Image transformation), and the contrast Loss is expressed as follows:

wherein H is a two-layer MLP network, G is a generator, X is a source domain, S _t The number of feature points on a certain feature map is represented, and corresponds to the number of images, namely image blocks; l represents the number of intermediate layers.

It is noted that the invention may also utilize image blocks of other images in the dataset as negative examples; encoding a random negative image in a data set x into

And using outer coding in which an auxiliary moving average encoder is used to maintain a large, consistent negative sample dictionary, and MoCo (He K, fan H, wu Y, et al. Momentum constellation for unsupervised visual representation learning C]//Proceedings ofthe IEEE/CVF conference on computerision and scatter recording.2020: 9729-9738), pictures can be sampled from a longer history, which is more efficient than end-to-end updates and banks; the comparative losses are expressed as follows:

wherein, the data set negative film

From Z in the source domain from an external dictionary ^- Samples whose data calculation uses a moving average encoder->

And a moving average MLP->

For ease of calculation, image blocks on the same input are taken as negative examples.

Further, in step S5: the problem of unbalanced sampling usually exists during sampling in contrast learning, so that the capability of a mapping network H for distinguishing positive and negative samples is reduced, and the characteristics of the positive samples are difficult to learn by the H due to a large number of negative samples, and the training of a generator and a discriminator is not facilitated. To alleviate this problem of negative and positive sample maldistribution. The target loss is optimized in the method of the invention by using the focus loss.

Focal Loss (FL) is an improved version of Cross-Entropy Loss (CE) that addresses the class imbalance problem by assigning more weight to the hard-to-classify or easy-to-misclassify instance (i.e., background with noise texture or partial object or object of interest), and down-weights the simple instance (i.e., background object).

The known optimization goal is to match the corresponding input-output image block at a specific location. The other Image blocks in the same input are taken as negative samples and named NCEIT Loss (NCE Loss for Image transformation).

Wherein S is _t The number of feature points on a certain feature map is represented, and corresponds to the number of images, namely image blocks; l represents the number of intermediate layers. The NCEIT Loss is computed over multiple feature maps, because the represented semantic information of the multiple feature maps and the image blocks corresponding to the input image are different in size, computing the noise contrast estimation Loss over the multiple feature maps is beneficial for H-networks to learn more information, the H-networks being used to map the input and output image blocks to the same embedding space, while mapping the relevant image blocks to similar feature spaces and mapping the irrelevant image blocks to feature spaces that are further apart. For each output image block, only the input image block with the same position is the relevant signal, while the image blocks at other positions are negative excitation signals, for feature maps with the size of tens of the output image blocks, the number of positive samples is far smaller than that of negative samples, and the gradient information of a large number of negative samples covers the unique gradient information of the positive samples, so that the focus loss is introduced to solve the problem. The method comprises the following specific steps:

note book

Then the

Wherein, gamma is the weight decay rate of the simple sample;

the resulting comparison Loss, NCEIT Loss, equation is as follows:

wherein G denotes a generator, H denotes a mapping network, X denotes a source domain, L denotes the number of intermediate layers, S _l The corresponding feature point number of each layer is represented,

represents the output feature vector, after input to the encoding network and the mapping network, for an output image block, is->

Represents the output feature vector, which is associated with (in the same position as) the output image block, after having passed through the encoder network and the mapping network>

Representing the output feature vector of an input tile that is not relevant (not identical in location) to the output tile after it has passed through the encoder network and the mapping network.

Further, in step S6:

back Propagation (BP) is a short term for "error back propagation" and is a common method used in conjunction with optimization methods (such as gradient descent) to train artificial neural networks. The method calculates the gradient of the loss function for all weights in the network. This gradient is fed back to the optimization method for updating the weights to minimize the loss function.

Back propagation requires a known output to be obtained for each input value to compute the gradient of the loss function. It is therefore generally considered to be a supervised learning approach, although it is also used in some unsupervised networks (e.g. auto-encoders). It is a generalization of the Delta rule of multi-layer feed-forward networks and can calculate gradients for each layer iteration using a chain-wise rule. Back propagation requires that the excitation function of the artificial neuron (or "node") be differentiable.

The back propagation algorithm (BP algorithm) is mainly composed of two phases: excitation propagation and weight updating.

Stage 1: propagation of excitation

The propagation link in each iteration comprises two steps: (forward propagation phase) sending training inputs into the network to obtain an excitation response; and (in a back propagation stage), the difference between the excitation response and the target output corresponding to the training input is obtained, so that the response errors of the hidden layer and the output layer are obtained.

And 2, stage: weight update

For the weight on each synapse, updating is performed as follows: multiplying the input excitation and response errors, thereby obtaining a gradient of the weight; this gradient is multiplied by a proportion and added to the weight after inversion.

This ratio (percentage) will affect the speed and effectiveness of the training process and thus become the "training factor". The direction of the gradient indicates the direction of error propagation and therefore needs to be inverted when updating the weights, thereby reducing the weight-induced errors.

Stages 1 and 2 may be iteratively iterated until the network's response to the input reaches a satisfactory predetermined target range.

In image translation, a plurality of feature maps and a plurality of feature points on the feature maps are sampled by contrast learning. If the sampling ratio of the positive and negative samples is set to 1. In the parameter updating process, the simple samples have no influence on the model, so that the memory occupation in the model training process can be greatly increased by still storing the simple samples in the memory; at the same time, the same penalty is calculated in the back propagation process, which further increases the burden of model training. Therefore, in order to accelerate the training, the mapping network H is updated by adopting an architecture similar to OHEM;

obtaining a feature vector of a point in a feature map of an intermediate layer of an encoder; for a certain point on a certain characteristic diagram of an output image after passing through an encoder, assuming that n negative samples and 1 positive sample are obtained after sampling, taking the n negative samples and the 1 positive sample as a batch, transmitting the batch into a mapping network H, and obtaining the loss of the n +1 samples through forward propagation; the n +1 losses are then sorted from large to small; then, (n + 1)/gamma samples with large loss are selected and input into a copy version H _ copy of the lower mapping network for forward and backward propagation, and the gradient of the H _ copy is copied to the upper mapping network; and finally, mapping the network H to update parameters, in order to reduce the oscillation during training, updating H _ copy for N times by using a gradient accumulation mode, and transmitting the accumulated gradient to the network H after dividing the accumulated gradient by N.

The back propagation process of optimizing the contrast loss uses an online difficult sample mining method, which specifically comprises the following steps:

two copies of the ROI network are stored in memory, including a read-only ROI network that allocates memory only for forward transfers of all ROIs and a standard ROI network that allocates memory for forward and backward transfers. For SGD iteration, given a certain convolution feature map, the read-only ROI network performs forward transfer and calculates the loss of all input ROIs; then, the ROI sampling module sequences the ROIs according to the loss values, selects the ROI samples with the first K loss values being the largest, and inputs the ROI samples into a conventional ROI network; the network provides only forward and backward delivery of the selected ROI samples, the gradient values generated by the network are delivered to the read-only ROI network, and finally the read-only ROI network performs parameter update according to the gradient values, and all the ROIs of the N images are recorded as R, so that the effective batch size of the read-only ROI network is R, while the effective batch size of the conventional ROI network is K.

Firstly, contrast learning is introduced into image translation, image blocks with the same or different positions in a source domain image and a target domain image are selected as positive and negative samples of the contrast learning, then a coder in a generator is used for extracting high-order semantic features, an auxiliary mapping network is used for mapping feature vectors to the same projection space, and then contrast loss is calculated on the projection space. The provided model does not need a dual generator and a dual discriminator, and the occupation of the model training memory and the training time are greatly reduced. Meanwhile, the auxiliary network is used for measuring the similarity degree between the source domain image and the target domain image, and the method is beneficial to learning more general information between the fields in the training process of the generator. And secondly, optimizing contrast learning by using an online difficult sample mining and focus loss method. And (4) sampling positive and negative samples of the comparative learning on the multilayer characteristic diagram. The characteristic graph is large, and the problem of uneven sampling exists, so that the characterization capability of the mapping network is limited. In order to improve the problem of uneven comparative learning sampling, the method firstly uses an online difficult sample mining technology to preferentially perform back propagation on the difficult samples; in addition, the loss is improved by using the focus loss, so that the weight of the loss of the fewer positive samples is larger. Finally, a cycleGAN network structure based on an attention mechanism is provided. A spatial attention module is added to the generator for learning class weights between feature maps. When a space attention module is added into the discriminator, a dense residual block is introduced to carry out jump connection so as to improve the transmission efficiency of the input image characteristics.

Has the advantages that: compared with the prior art, the method introduces specific steps of contrast learning in image translation and how to determine positive and negative samples in the contrast learning, improves based on cycleGAN, and adds an MLP mapping network H which maps related examples to similar positions in a feature space and maps unrelated examples to positions with longer distance in the feature space. By doing so, the intrinsic relevance of the source domain and the target domain can be effectively measured. Second, the proposed contrast loss is optimized. When the feature map is large when the image is generated, the number of the obtained positive samples is far smaller than that of the negative samples, and therefore the contrast loss is improved by adopting the online difficult sample mining and focus loss technology. The loss of the difficult samples is preferentially propagated reversely in the difficult sample mining; the loss weight of the difficult samples is made larger by changing the weight of the difficult simple samples loss in the focal loss. The model generated by using the contrast loss can greatly reduce the occupation of the training memory and the training time, simultaneously achieves the image conversion effect with more obvious details than the one-way image conversion and the two-way image conversion, and solves the problems that the existing image translation model is large, is difficult to train and the field correlation measurement index is inaccurate.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a generator structure in the present invention;

FIG. 3 is a schematic diagram of the structure of the discriminator in the present invention;

FIG. 4 is a schematic diagram of the structure of the generation of a countermeasure network in the present invention;

FIG. 5 is a back propagation schematic of the on-line difficult sample mining of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The invention provides an image translation method based on contrast learning, which comprises the following steps as shown in figure 1:

s5: optimizing the contrast loss by using the focus loss, and solving the problem that the positive and negative samples are not uniformly sampled;

As shown in fig. 2, the structure of the generator in step S1 is: adopts a U-Net network structure and comprises an encoder G _enc And decoder G _dec Wherein the encoder G _enc Is composed of 3 convolution layers; a conversion module transformation is arranged between the encoder and the decoder and is used for image conversion between the two fields; the encoder consists of n layers of upsampling; the encoder and the decoder are in jump connection corresponding to the convolution layer, so that the characteristic information of the input image can be effectively prevented from being lost after convolution calculation, and the transmission efficiency of the information is improved.

As shown in fig. 3, the network structure of the discriminator in step S2 is: a discriminant network structure with an attention module is constructed on the basis of a discriminant of the CycleGAN, and the discriminant of the CycleGAN is enhanced in the generated network after improvement, so that the discriminant of the network structure is weak in capability, and the quality of the generated image is poor. Therefore, in order to achieve better balance between the two networks, the PatchGAN in the original CycleGAN is improved, and the judgment capability of the judgment network is obviously enhanced by adding the dense residual block and the attention mechanism while keeping the size of the receptive field of the original network.

In the step S3, calculating the loss of the generated countermeasure network according to the generated image prediction probability and the real image prediction probability; generating the confrontation loss as the minimum loss, and the formula is as follows:

where D is the discriminator, G is the generator, X represents the source domain, and Y represents the target domain. x denotes the source domain image and y denotes the target domain image. D (y) is the prediction probability for the real image, and D (G (x)) is the prediction probability for the generated image G (x). This loss is expressed as a cross-entropy loss between the prediction probability and the true probability.

In step S4: first, the correlation definition is introduced, with the optimization objective being to make the image in the input domain

Passed model conversion and image from the target field->

The performance is similar. Given an example image containing no pairings

and &>

Wherein v is the sum of the values of,

and &>

Encoder G in model usage generator G _enc Extracting high-order semantic information of the image; g _enc Each spatial position on one of the intermediate feature maps represents an image block of the input image, with deeper layers corresponding to larger image blocks; by referring to the SimCLR model (Chen T, kornblith S, norouzi M, et al. A simple frame for coherent learning of visual representations [ C)]Pmlr,2020 _l Passing a feature map, generating a set of features

Wherein->

Represents the output of the ith selected layer, where L ∈ {1,2, \8230;, L }, S ∈ {1,2, \8230;, S ∈ _l In which S is _l Is the number of spatial locations in each layer, and the corresponding feature (positive example) is called £ er>

And other non-corresponding features (negative examples) are referred to as +>

Wherein C is _l Is the number of channels per layer;

in the same way, the image is output

Coded as->

The optimization goal is to match the corresponding input-output image blocks at specific positions; other Image blocks in the same input are taken as negative samples and named as NCEIT Loss (NCE Loss for Image transformation), and the expression of the contrast Loss is as follows:

It is noted that the invention may also utilize image blocks of other images in the dataset as negative examples; encoding a random negative image in a data set x as

And using outer coding in which an auxiliary moving average encoder is used to maintain a large, consistent negative sample dictionary, and MoCo (He K, fan H, wu Y, et al. Momentum constellation for unsupervised visual representation learning C]// Proceedings of the IEEE/CVF conference on computing and dpattern registration. 2020: 9729-9738), pictures can be sampled from a longer history, which is more efficient than end-to-end updates and banks; the expression of the comparative losses is as follows:

wherein, the data set negative film

And a moving average MLP->

For simplicity of calculation, image blocks on the same input are taken as negative examples.

In step S5:

the known optimization goal is to match the corresponding input-output image block at a specific location. The other Image blocks in the same input are used as negative samples and named NCEIT Loss (NCE Loss for Image transformation).

Wherein S is _t The number of feature points on a certain feature map is represented, and corresponds to the number of images, namely image blocks; l represents the number of intermediate layers. The NCEIT Loss is computed over multiple feature maps, because the represented semantic information of the multiple feature maps and the image blocks corresponding to the input image are different in size, computing the noise contrast estimation Loss over the multiple feature maps is beneficial for H-networks to learn more information, the H-networks being used to map the input and output image blocks to the same embedding space, while mapping the relevant image blocks to similar feature spaces and mapping the irrelevant image blocks to feature spaces that are further apart. For each output image block, only the input image block with the same position is the relevant signal, and the image blocks at other positions are negative excitation signals, so that for feature maps with the size of tens of the output image blocks, the number of positive samples is far smaller than that of negative samples, and the gradient information of a large number of negative samples covers the unique gradient information of the positive samples, thereby introducing focus loss to solve the problems. The method comprises the following specific steps:

note book

Then

Wherein gamma is the weight decay rate of the simple sample;

the resulting comparison Loss, NCEIT Loss, equation is as follows:

In step S6:

in image translation, a plurality of feature maps and a plurality of feature points on the feature maps are sampled by contrast learning. If the sampling ratio of the positive and negative samples is set to 1. In the parameter updating process, the simple samples have no influence on the model, so that the memory occupation in the model training process can be greatly increased by still storing the simple samples in the memory; at the same time, the same computational penalty is incurred in the back-propagation process, which further increases the burden of model training. Therefore, in order to accelerate the training, the mapping network H is updated by adopting an architecture similar to OHEM;

Based on the above, the present invention performs example application and analysis on the above scheme, specifically as follows:

1) Residual structure in instance normalization and network

The example standardization is a standardization mode commonly used in image style conversion, and is specifically expressed as that the image is normalized at a channel level based on BN, and then the image is subjected to 'denormalization' by using the mean value and standard deviation of the channel corresponding to the target style picture so as to obtain the style of the target picture.

The residual block is used to improve the learning ability of the network. A hopping connection is used between the input and the output. In mathematical statistics, residual refers to the difference between the actual observed value and the estimated value (fitted value). "residual" implies important information about the basic assumptions of the model. The residual can be considered as an observed value of error if the regression model is correct. It should conform to the assumptions of the model and have some properties of error. The residual analysis refers to the process of using the information provided by the residual to investigate the reasonableness of model assumptions and the reliability of data.

2) Generating a countermeasure network principle and loss function definition

The generation of the confrontation network is based on the game theory, and both sides of the game are a generator and a discriminator. The GAN includes two modules, a Generator (Generator, G) and a Discriminator (Discriminator, D). The task of G is to randomly generate synthetic data to be spurious. In order to satisfy the randomness, a plurality of random numbers are usually used as the input of G, for example, 100 random numbers obtained by random sampling in a standard normal Distribution (random normal Distribution) are denoted as random noise (random noise) z, and the output is a picture with the same resolution as the real picture. The task of D is to distinguish between true and false, namely to judge whether a picture is a true picture or a false picture synthesized by G. Thus, the input of D is a picture, the output is a score, and a higher score indicates a truer input picture. During training, the generator generates false samples judged by the confusion discriminator as much as possible, the discriminator judges the false samples correctly as much as possible, high scores are output for all real pictures, and low scores are output for all false pictures. Repeating the steps, along with the optimization, the discrimination capability of the D is stronger and the generation capability of the G is stronger, and the two are mutually played and jointly advanced. In an ideal case, G can eventually generate a false picture that is indistinguishable from a real picture.

The principle of generating the countermeasure network is shown in fig. 4, in which the Generator and the Discriminator are constructed by a multi-layer neural network, and are differentiable functions. The generator maps the random noise vector into an image, and the generated image is as same as a real image as possible, so that the judgment of the discriminator fails, namely for G (z) input, the output of the discriminator D is real; the decision device has an optimization goal to decide as correctly as possible for the input. That is, real image input is determined to be real, and generated image input is determined to be fake. The two are continuously confronted in training and are improved together, so that the discriminator learns the essential characteristics of a real image, and the generator generates a false sample which is almost the same as the real sample.

The optimization process for generating the countermeasure network is as follows: in each iteration, a batch of real pictures x is randomly selected, a batch of z is randomly generated, and then z is input into G to obtain a batch of dummy pictures x' = G (z). The loss function of D includes two aspects: the first is that the fraction D (x) for x should be relatively high, i.e. as close as possible to 1; secondly, the fraction D (x ') corresponding to x' should be relatively low, i.e. as close as possible to 0. And adjusting each parameter of the D according to the direction of the reduction of the loss function, and completing the optimization of the D. After optimization, the score difference of D output for real or false pictures becomes large. With respect to G, the goal is to let D misinterpret x ' as a true picture, so the penalty function for G can be the difference between D (x ') and 1, the smaller this difference, indicating that x ' is more true when viewed from D. And adjusting each parameter of G according to the direction of the reduction of the loss function, and completing the optimization of G. After optimization, the false picture synthesized by G is more real in the generated image under the judgment of D.

The game of confrontation of GAN can pass discriminant function

And a generating function>

The maximum and minimum values of the objective function therebetween. The generator G will select a random sample->

Distribution p _z (z converted to generate samples G (z.) discriminator D attempts to match them with the true sample distribution p _data (x 0 is distinguished by the training samples, and G attempts to make the generated samples similar in distribution to the training samples.

/>

Where x represents the real sample data and G (z) represents the generated sample data. Intuitively, for a given generator G,

the discriminant D is optimized to discriminate the generated samples G (z) by attempting to assign higher values to the samples from the distribution p _data (x) And lower values are assigned to the generated sample G (z). Instead, for a given discriminator D,

g is optimized such that the discriminator D classifies errors, assigning a higher value to the generated sample G (z).

3) Contrast learning and loss function

The self-supervision learning is a machine learning method for network training by mining inherent characteristics of training data without depending on manually labeled labels. It is intended to learn a generic feature expression for use in downstream tasks. The basic idea of comparative learning, which is a typical self-supervised learning, is as follows: a feature expression (mapping network) is learned by constructing pairs of positive and negative samples, by which pairs of positive samples are relatively close in projection space and pairs of negative samples are as far apart as possible in projection space. Can be expressed by the following formula:

distance(f(x),f(x+))＞＞distance(f(x),f(x-))

where x + is a positive sample, x-is a negative sample, and f is the encoder network that needs learning.

In order to optimize the encoder network, a cross entropy based on softmax is constructed as a loss function, and the formula is as follows:

this loss function is also called Info noise-dependent estimation, and there is only one positive case and K negative cases in the sample, which can be essentially regarded as a classification problem of K +1 class.

4) Loss of focus and its definition

The focus loss is to solve the problem of imbalance of positive and negative samples in a single-stage target detection scene. In this scenario, there is an extreme imbalance between the foreground and background classes during training (e.g., 1.

First define the Cross Entropy (Cross Entropy) loss for the binary class:

in the above case y ∈ { ± 1} specifies the target list, -1 represents the negative class sample label and +1 represents the positive class sample label. And p ∈ [0,1] is the probability that the model estimates that the input is positive.

Wherein p is _t To predict the probability that the input is correct, i.e. that the input y is +1 is p _t Close to 1 and vice versa.

CE(p,y)＝CE(p _t )＝-log(p _t )

Classification of CE loss in simple samples (i.e., predicted p) _t Values close to 1) also produce a significant loss, and the cumulative value of the gradient values for a large number of simple samples will exceed the gradient values for a complex sample.

To solve the above problem, focus loss (FocalLoss) (Lin TY, goyal P, girshickr, actual. Focal local for dense object detection [ C)]I/Proceedings of the IEEE international conference on component.2017: 2980-2988) to which a regulator (1-p) is added _t ) ^γ As a cross entropy loss, γ can be adjusted as a hyper-parameter. The loss of focus is defined as:

FL(p _t )＝-(1-p _t ) ^γ log(p _t )

the focal loss has two properties: (1) When a sample is misclassified, i.e., y = +1 and p _t Very small, the modulation factor is close to 1 and the losses are not affected. With p _t Value close to 1, coefficient (1-p) _t ) ^γ Going to 0, the loss weight is decreasing as the example classification approaches good. (2) Smooth adjustment of weights of simple samples by hyper-parameter gammaThe rate of decline. FL corresponds to CE when γ =0, and the factor (1-p) is adjusted as γ increases _t ) ^γ The influence of (c) will also increase.

5) On-line difficult sample mining (OHEM) and network architecture thereof

An Online difficult sample Mining algorithm (OHEM) based on a back propagation algorithm is an algorithm which is commonly used for relieving the problem of imbalance of positive and negative samples in the field of target detection and is evolved from difficult sample Mining (Hard sample Mining). Algorithm flow as described below, for the input image of the t-th iteration of the SGD (Stochastic gradientDescriptent random gradient descent) optimization method, a convolution feature map is first calculated using a convolution network. The ROI network then uses the feature map and all input ROIs (regions of interest of regionoInterest) for forward pass and reverse update. This step only involves ROI pooling, several fc layers and loss calculation per ROI. The loss represents the performance of the current network on each ROI, the greater the loss, the worse the performance. The input ROIs are then sorted by loss and the K ROIs that perform the worst for the current network are chosen as difficult examples. Most forward calculations are shared between ROIs by convolution feature mapping, so the additional calculations required to assign all ROIs are relatively small. Furthermore, since only a few ROIs are selected to update the model, and all ROI losses are propagated backwards, the cost of backward pass is not higher than before.

In the FastRCNN family of target detectors, there are many ways to implement OHEM, such as modifying the lossy layer. The loss layer can calculate the loss of all ROIs, then select the difficult ROI according to the sorting result by sorting the loss, i.e. the ROI with larger loss is input, and finally set the loss of the non-difficult ROI to 0. This method is simple but its implementation efficiency is low because even if most of the ROIs are lost to 0, the ROI network can still allocate memory for all ROIs and perform backward transfer, which seriously affects the training efficiency of the model.

To overcome this problem, OHEM (Shrivastava A, gupta A, girshick R. Tracking region-based object detectors with online hard amplified [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognition.2016: 761-769) proposes the architecture shown in FIG. 5. Two copies of the ROI network are maintained in memory, including a read-only ROI network that allocates memory only for forward transfers of all ROIs and a standard ROI network that allocates memory for forward and backward transfers. For the SGD iteration, given a certain convolution feature map, the read-only ROI network performs forward pass and computes the loss of all input ROIs (indicated by the green arrows in the figure). Then, the ROI sampling module sorts the ROIs according to the loss values, selects ROI samples with the largest first K loss values, and inputs the ROI samples into a conventional ROI network (red arrows in the figure). The network provides only forward and backward delivery of the selected ROI samples and the gradient values generated by the network will be delivered to the read-only ROI network (grey block at a in the figure). And finally, updating the parameters by the read-only ROI network according to the gradient value. All the ROIs for the N images are noted as R, so the effective batch size for the read-only ROI network is R, while the effective batch size for the conventional ROI network is K.

In order to verify the actual effect of the image translation model provided by the invention, the image translation model provided by the invention is compared with the existing image translation model, and the specific data table 1 shows that:

TABLE 1 training of relevant indices

According to the experimental results, under the condition that the batch size is the same, the contrast loss occupies less memory than the cycle consistency loss, and the training time length is only 1/3 of the cycle consistency loss. This is because the model proposed in the present invention only has the generator and the arbiter of the target domain, and the added auxiliary mapping network has little computation.

The result shows that the training time of Focal local is increased, the memory occupation and the model parameter quantity are similar to those of the original model, and the memory occupation and the training time of the model can be effectively reduced by the OHEM. The method shows that the OHEM improves the convergence rate of the model during training without reducing the generation effect of the original model.

Claims

1. An image translation method based on contrast learning is characterized by comprising the following steps:

s1: inputting an input image into a generator, the generator comprising an encoder and a decoder;

s4: re-inputting the input image and the output image of the generator into an encoder in the generator to obtain the encoding output of the input image and the output image, inputting the encoding vectors of the input image and the output image into a mapping network to obtain the characteristic vectors of the input image and the output image in the same characteristic space, and calculating the contrast loss between the characteristic vectors of the input image and the output image;

s5: optimizing contrast loss using focus loss;

2. The image translation method based on contrast learning according to claim 1, wherein the generator in step S1 is configured to: adopts a U-Net network structure and comprises an encoder G _enc And decoder G _dec Wherein the encoder G _enc Is composed of 3 convolution layers; a conversion module transformation is arranged between the encoder and the decoder and is used for image conversion between the two fields; the encoder consists of n layers of upsampling; the encoder and decoder are connected via a jump corresponding to the convolutional layer.

3. The image translation method based on contrast learning of claim 1, wherein the network structure of the discriminator in step S2 is: a discrimination network structure with an attention module is constructed on the basis of a discriminator of the cycleGAN, the discrimination network PatchGAN in the original cycleGAN is improved, and an intensive residual block and an attention mechanism are added while the receptive field size of the original network is kept.

4. The image translation method based on contrast learning of claim 1, wherein the calculation method for generating the loss of the countermeasure network in the step S3 is as follows:

generating the confrontation loss as the minimum loss, and the formula is as follows:

wherein D is a discriminator, G is a generator, X represents a source domain, and Y represents a target domain; x represents a source domain image and y represents a target domain image; d (y) is the prediction probability for the true image, and D (G (x)) is the prediction probability for the generated image G (x), and the loss represents the cross entropy loss between the prediction probability and the true probability.

5. The image translation method based on contrast learning of claim 1, wherein the specific process of the encoding output in step S4 is as follows:

using encoders G in generators G _enc Extracting high-order semantic information of the image; g _enc Each spatial position on one of the intermediate feature maps represents an image block of the input image, with deeper layers corresponding to larger image blocks; by using the SimCLR model for reference, the middle L layer is selected and passes through a small two-layer MLP network H _l Passing a feature map, generating a set of features

Wherein->

Represents the output of the ith selected layerWhere L is equal to {1,2, \8230;, L }, S is equal to {1,2, \8230;, S is equal to {1,2, \ 8230;, S _l In which S is _l Is the number of spatial positions in each layer, and the corresponding feature is called @>

And other non-corresponding features are called->

Wherein C _l Is the number of channels per layer;

in the same way, the image is output

Coded as->

/>

6. The image translation method based on contrast learning according to claim 5, wherein the calculation of the contrast loss in step S4 is specifically:

the optimization goal is to match the corresponding input-output image blocks at specific positions; the other image blocks in the same input were taken as negative samples and named NCEIT Loss, the expression of contrast Loss is as follows:

7. The image translation method based on contrast learning according to claim 5, wherein the calculation of the contrast loss in step S4 is specifically:

using data setsImage blocks of other images as negative examples; encoding a random negative image in a data set x into

And using the following outer coding, in this variant a large, consistent negative sample dictionary is maintained using an auxiliary moving average encoder, the expression for contrast loss is as follows:

wherein, the data set negative film

From Z in the source domain from an external dictionary ^- Sampling, the data calculation of which uses a moving average encoder

And a moving average MLP>

8. The image translation method based on contrast learning according to claim 6, wherein the step S5 specifically comprises:

note the book

Then

Wherein gamma is the weight decay rate of the simple sample;

the resulting comparison Loss, NCEIT Loss, equation is as follows:

representing output feature vectors, -based on the input into the coding network and the mapping network for the output image block>

Represents the output feature vector, based on the input image block associated with the output image block, after having passed through the encoder network and the mapping network>

Representing output feature vectors of input image blocks not associated with the output image block after they have passed through the encoder network and the mapping network.

9. The image translation method based on contrast learning according to claim 1, wherein the step S6 specifically comprises:

obtaining a feature vector of a point in a feature map of an intermediate layer of an encoder; for a certain point on a certain characteristic diagram of an output image after passing through an encoder, assuming that n negative samples and 1 positive sample are obtained after sampling, taking the n negative samples and the 1 positive sample as a batch, transmitting the batch into a mapping network H, and obtaining the loss of the n +1 samples through forward propagation; then sorting the n +1 losses from large to small; then, (n + 1)/gamma samples with large loss are selected and input into a copy version H _ copy of the lower mapping network for forward and backward propagation, and the gradient of the H _ copy is copied to the upper mapping network; and finally, mapping the network H to update parameters, in order to reduce the oscillation during training, updating H _ copy for N times in a gradient accumulation mode, and transmitting the accumulated gradient to the network H after dividing by N.

10. The image translation method based on contrast learning of claim 1, wherein the back propagation process for optimizing contrast loss in step S6 uses an online difficult sample mining method, which is as follows:

two copies of the ROI network are stored in memory, including a read-only ROI network that allocates memory only for forward transfers of all ROIs and a standard ROI network that allocates memory for forward and backward transfers. For SGD iteration, given a certain convolution feature map, the read-only ROI network performs forward transfer and calculates the loss of all input ROIs; then, the ROI sampling module sequences ROIs according to the loss values, selects the first K ROI samples with the largest loss values and inputs the ROI samples into a conventional ROI network; the network only provides forward and backward delivery of the selected ROI samples, the gradient values generated by the network are delivered to the read-only ROI network, and finally the read-only ROI network performs parameter update according to the gradient values, and all ROIs of the N images are recorded as R, so the effective batch size of the read-only ROI network is R, while the effective batch size of the conventional ROI network is K.