CN115170399A

CN115170399A - Multi-target scene image resolution improving method, device, equipment and medium

Info

Publication number: CN115170399A
Application number: CN202211092795.4A
Authority: CN
Inventors: 郭金林; 老松杨; 汤俊; 李欣炜
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2022-10-11

Abstract

The application relates to a method, a device, equipment and a medium for improving the resolution of multi-target scene images, wherein the method comprises the following steps: acquiring a multi-target scene image with low resolution; calling a super-resolution network model based on a DRN architecture; inputting the multi-target scene image with low resolution into the trained super-resolution network model, and generating the multi-target scene image with high resolution by updating a forward mapping network in the super-resolution network model; inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into a truncated VGG19 network for feature extraction to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images; substituting the characteristic diagram into a loss function of the super-resolution network model to calculate a mean square error; and outputting the high-resolution multi-target scene image after the mean square error. The resolution of the generated multi-target scene images is greatly improved.

Description

Multi-target scene image resolution improving method, device, equipment and medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a medium for improving a resolution of a multi-target scene image.

Background

Among the generation problems of multi-target scene images, the conditional simple graph generation countermeasure network (conditional simple graph generation countermeasure network) improved based on the SinGAN has successfully solved the controllable generation problem of multi-target scene images, which can train according to a given multi-target scene image and generate pseudo multi-target scene images along the direction desired by a user under the guidance of control conditions, and the number, distribution, etc. of the pseudo-images more conforming to the target are controllable. The pseudo multi-target scene image is more in line with human visual cognition, has rich application scenes and especially has important application value in the field of news public opinion and intelligence.

However, in the process of implementing the present invention, the inventor finds that in the foregoing conventional multi-target scene image generation method, the connectinal singan is trained to generate an antagonistic network model in a single-image block training manner, so that the generated image has a small size, a low resolution, coarse details and an influence on visual experience, thereby reducing the fidelity of a forged image, and thus, the technical problem of insufficient resolution of the generated image exists.

Disclosure of Invention

Accordingly, it is necessary to provide a method, a device and a computer apparatus for improving the resolution of a multi-target scene image, which can greatly improve the resolution of the generated multi-target scene image.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in one aspect, an embodiment of the present invention provides a method for improving resolution of a multi-target scene image, including:

acquiring a multi-target scene image with low resolution;

calling a super-resolution network model based on a DRN architecture; the super-resolution network model comprises a low-resolution to high-resolution forward mapping network and a high-resolution to low-resolution reverse mapping network;

inputting the multi-target scene image with low resolution into the trained super-resolution network model, and generating the multi-target scene image with high resolution by updating a forward mapping network in the super-resolution network model;

inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into a truncated VGG19 network for feature extraction to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images;

substituting the characteristic diagram into a loss function of the super-resolution network model to calculate a mean square error;

and outputting the high-resolution multi-target scene image after the mean square error.

On the other hand, a multi-target scene image resolution improving device is also provided, which includes:

the image acquisition module is used for acquiring a multi-target scene image with low resolution;

the model calling module is used for calling a super-resolution network model based on the DRN architecture; the super-resolution network model comprises a forward mapping network from low resolution to high resolution and an inverse mapping network from high resolution to low resolution;

the super-generation module is used for inputting the multi-target scene images with low resolution into the trained super-resolution network model and generating the multi-target scene images with high resolution by updating the forward mapping network in the super-resolution network model;

the characteristic extraction module is used for inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into the truncated VGG19 network for characteristic extraction to obtain characteristic graphs of the low-resolution multi-target scene images and the high-resolution multi-target scene images;

the loss calculation module is used for substituting the characteristic diagram into a loss function of the super-resolution network model to calculate a mean square error;

and the image output module is used for outputting the high-resolution multi-target scene image after the mean square error.

In another aspect, a computer device is further provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements any one of the steps of the above multi-target scene image resolution enhancement method when executing the computer program.

In another aspect, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the method for increasing the resolution of a multi-target scene image as described above is implemented.

One of the above technical solutions has the following advantages and beneficial effects:

according to the method, the device, the equipment and the medium for improving the resolution of the multi-target scene image, after the low-resolution multi-target scene image to be reconstructed is obtained, the trained super-resolution network model based on the DRN architecture is called, the low-resolution multi-target scene image is input into the super-resolution network model for processing, and the super-resolution network model generates the high-resolution multi-target scene image (namely, the reconstructed image) through updating the forward mapping. And then, extracting the characteristics of the low-resolution multi-target scene images and the high-resolution multi-target scene images by using a truncated VGG19 network to obtain corresponding characteristic graphs, and substituting the characteristic graphs into a loss function of the model to calculate the mean square error. Therefore, in both the forward mapping and the reverse mapping of the model, the feature extraction plays a role, so that the model pays more attention to the acquisition of the inherent information of the image in the training process, namely, the inherent features of the reconstructed image (the high-resolution multi-target scene image) and the real image (the low-resolution multi-target scene image) are calculated by the loss function, the semantic modification of the reconstructed image to the real image is reduced, the quality of the reconstructed image is improved, the purpose of super-resolution reconstruction of the multi-target scene image is really achieved, the effect of greatly improving the resolution of the generated multi-target scene image is realized, and the fidelity of the generated image is obviously improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the descriptions of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following descriptions are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for resolution enhancement of multi-target scene images in an embodiment;

FIG. 2 is a diagram of a DRN-based network architecture in one embodiment;

FIG. 3 is a schematic diagram of the steps of closed loop training in one embodiment;

FIG. 4 is a schematic diagram illustrating an exemplary process flow for feature map difference loss based processing;

FIG. 5 is a diagram illustrating sample reconstructed images obtained by different VGG truncation schemes in an embodiment; wherein, (a) is an original forged image, (b) is an image obtained in scheme 1, and (c) is an image obtained in scheme 2;

FIG. 6 is a diagram illustrating details of output results of super-resolution reconstruction methods according to an embodiment; wherein, (a) is an original image, (b) is an image of Bicubic, (c) is an image of SRResNet, (d) is an image of SRGAN, and (e) is an image of a network based on a feature map difference value;

fig. 7 is a schematic block diagram of an embodiment of a multi-target scene image resolution enhancement apparatus.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of the technical solutions by those skilled in the art, and when the technical solutions are contradictory to each other or cannot be realized, the combination of the technical solutions should be considered to be absent and not to be within the protection scope of the present invention.

In the application, high-quality super-resolution reconstruction of multi-target scene images with insufficient resolution is realized, and the multi-target scene images to be reconstructed are, for example and without limitation, pseudo multi-target scene images generated by conventional SinGAN or other low-resolution multi-target scene images in the military field. By performing super-resolution reconstruction, the resolution of the generated pseudo multi-target scene image can be improved, so that the pseudo multi-target scene image has better puzzlement, and the practical application effect of the pseudo multi-target scene image in the fields of intelligence and public opinion is improved.

Although the existing super-resolution networks SRGAN and SRResNet can be used for improving the image resolution, the two network models need a large amount of data pairs during training, image intelligence in the application is usually difficult to obtain, and a pseudo multi-target scene image generated by a multi-target scene image generation technology only has single resolution. If a predefined image degradation method is used to downsample a high-resolution image, an image pair is obtained, and the actual image degradation mode is unknown, so that the robustness of the model is reduced.

The Depth Residual Network (DRN) can solve the problem of unknown image degradation by means of learning, and can be trained under the condition that image pairs are complete or missing. The DRN has the defect that the reconstructed super-resolution image is low in quality, and the inventor finds out that the mean square error of the difference value of the original image and the pixel of the reconstructed image is directly used as a loss function in the calculation of the loss function, so that the high-frequency details of the reconstructed image are lost, the target contour is not clear, and the visual experience of people is influenced. And in the SRGAN, the truncated VGG19 is adopted to extract the characteristics of the real image and the reconstructed image, and then the extracted characteristic graph is substituted into the loss function for calculation, so that the problem of high-frequency information loss of the image is solved to a great extent, and the generation quality of the image is improved. Therefore, the two models can be fused for reference, the requirement of the models on data pair is reduced, and the quality of the reconstructed image is improved. The super-resolution reconstruction technology based on the difference characteristic diagram network is based on the theoretical basis.

In preparing the training material, pairs of images, i.e., low resolution and high resolution versions of the same picture, that exist in the real-world environment can be found, but this approach is difficult to implement because it is difficult to find such pairs in some fields, such as the military. The low-resolution image of the real image is obtained through a down-sampling mode, the mode is simple to implement, but on the premise that the image degradation method is unknown, the low-resolution image is obtained through a predefined image down-sampling mode, the training effect of the model can be influenced, and the trained model has a good testing effect only on a small part of images and does not have universality.

Referring to fig. 1, in one aspect, the present invention provides a method for increasing a resolution of a multi-target scene image, including the following steps S12 to S22.

S12, acquiring a multi-target scene image with low resolution;

s14, calling a super-resolution network model based on a DRN architecture; the super-resolution network model comprises a low-resolution to high-resolution forward mapping network and a high-resolution to low-resolution reverse mapping network;

s16, inputting the multi-target scene image with low resolution into the trained super-resolution network model, and generating the multi-target scene image with high resolution by updating an order mapping network in the super-resolution network model;

s18, inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into a truncated VGG19 network for feature extraction to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images;

s20, substituting the characteristic diagram into a loss function of the super-resolution network model to calculate a mean square error;

and S22, outputting the high-resolution multi-target scene image after the mean square error.

It can be understood that, in order to make the method of the present application easier to understand, the network model of the present application is introduced first in this embodiment, and the construction of the new model provided by the present application is based on the DRN architecture, and the structure of the new model is shown in fig. 2.

As can be seen from FIG. 2, the network model is based on a U-Net structure and is mainly divided into two parts: a forward mapping from low resolution to high resolution and an inverse mapping from high resolution to low resolution. The specific operations of the up-sampling module and the down-sampling module in the network model on the image are as follows: in the model order mapping, a low-resolution image LR input into a network is firstly amplified to the size of a high-resolution image HR through Bicubic (Bicubic) interpolation, a feature map is extracted through convolution, the feature map is reduced by 4 times after being subjected to convolution with two step lengths of 2 to obtain a feature map with the size of 1/4HR, the reduced feature map is sequentially sampled to the size of 4 times through a pixel cleaning and RCA (depth residual error channel attention) module, and the finally obtained high-resolution image is compared with a real high-resolution image, so that the order mapping is updatedFThus, the work of the forward mapping network is completed. Wherein, RCAB is a residual channel attention module.

In the inverse mapping process, the high-resolution image is reduced by 4 times through the convolution module to obtain a reduced image, the image is processed through the dual network and then compared with the original low-resolution image, and therefore inverse mapping is updatedR. The convolution module is composed of two groups of conv (stride = 2) -LeakyReLU-conv modules, and the modules are CB for reducing the image by two times. The RCA module is relatively complex and comes from a traditional RCAN network, and the network adaptively adjusts channel characteristics based on an attention mechanism and enhances the characterization capability of inherent information of the image.

The series of operations performed in the RCA module are: the method comprises the steps of further feature extraction through two RCA modules, image size enlargement through two times of pixel cleaning, global average pooling of obtained feature maps to obtain a channel descriptor containing rough information, division on a channel by a certain proportion, namely down sampling, and up sampling to obtain a weight coefficient of each channel. And finally multiplying the original characteristics obtained through the residual error to obtain new characteristics with the channel weight redistributed. The new feature is added to the original 1/4 feature map, and the new feature is the output of the RCA network. The purpose of this is to improve the characterization ability of the network for low-frequency information and high-frequency information by adaptively adjusting the channel characteristics through a channel attention mechanism.

The network model of the application adopts a closed-loop training mode, the training process can be as shown in fig. 3, in each step of training, the training from left to right is sequential mappingFThe method is used for customizing training of paired data. Training from right to left is then inverse mappingRAnd order mappingFA closed-loop training mode is formed.

In some embodiments, the training process of the super-resolution network model includes:

acquiring paired image pairs and inputting the paired image pairs into a super-resolution network model; the image pair comprises a low resolution image and a high resolution image;

improving the resolution of the low-resolution image through a forward mapping network of a super-resolution network model to obtain a first pseudo high-resolution image with the same size as the high-resolution image;

comparing the first pseudo high-resolution image with the high-resolution image, and improving the similarity of the first pseudo high-resolution image to the high-resolution image;

and updating model parameters of the forward mapping network by a gradient descent method according to the high-resolution image and the first pseudo high-resolution image with the improved similarity degree, and completing pre-training of the super-resolution network model.

Specifically, can be provided with

Representing the original low-resolution image LR,

represents the pseudo-LR of the light beam,

representing a true high-resolution image HR,

representing a pseudo HR.

Which represents the original loss of the material,

which represents a loss of the dual-pair,

represents the first

The loss of the steps is caused,

representing the function of the cis-mapping,

represents parameters that are updatable therein;

represents the inverse of the mapping function and,

representing parameters that can be updated therein, with an apostrophe added to the parameters after the update. When different data are aimed at, the training mode is different, and (1), (2) and (3) in each step in the figure represent the sequential order of the execution of the mapping. The specific training steps are as follows (1) and (2):

(1) When training the pair of images, the basic process of super-resolution reconstruction is followed, namely, the images are firstly mapped by the sequential mapping

Enhancing low resolution images

I.e. upsampling the image to the original high resolution image

Same size, high resolution image is obtained

An image is formed

And images

Making a comparison so that

And

more similarly, and then the gradient descent method is used to map the cis direction

The model parameters in (1) are adjusted so that the order maps are in an upsampled manner. The mathematical expression of this step is as follows:

（1）

（2）

(by min

Updating a forward mapping function

) The principle of this step is shown as the first step in fig. 3.

In some embodiments, the training process of the super-resolution network model further includes:

improving the resolution of the low-resolution image by using a forward mapping network of the pre-trained super-resolution network model to obtain a second pseudo high-resolution image;

reducing the resolution of the second pseudo high-resolution image by adopting an inverse mapping network of the super-resolution network model to obtain a pseudo low-resolution image with the same size as the low-resolution image;

and comparing the pseudo low-resolution image with the low-resolution image, and updating model parameters of an inverse mapping network and a forward mapping network of the super-resolution network model to obtain the trained super-resolution network model.

Specifically, the method comprises the following steps: (2) When training unpaired data, firstly using the order mapping trained in the step (1)

For original low resolution image

To obtain a high-resolution image

. However, since the images are not paired and the high-resolution images cannot be compared with each other, and the model parameters cannot be corrected, the high-resolution images are obtained

Semantic and original image of

There is a difference in that the domain of the mapping is not defined

Within the domain of (c). Next, the high resolution image is mapped using inverse mapping

Resolution reduction of (2), i.e. down-sampling to and from low resolution images

Same size, resulting in low resolution images

Then images are taken

And the original low resolution image

Performing comparison to further perform inverse mapping

Heshun mapping

The model parameters in (1) are adjusted so that

And

more similarly. So that the inverse mapping is converted into a down-sampling mode, the parameters of the forward mapping are updated accordingly, and the updated forward mapping is obtained by inputting the original low-resolution image

The 'natural' high-resolution image with better quality and closer semantic meaning to the original low-resolution image can be obtained

（

Is limited to

In the field); low resolution image at this time

Can be mixed with

An image pair is formed. The mathematical expression for this step is as follows:

（3）

（4）

（5）

(using a minimum of

Updating a proper mapping function

And inverse mapping function

）

（6）

The principle of this step is shown in the second step of fig. 3. Using the product obtained in step (2)

And with

Make up an image pair, this time

I.e. a high resolution image, having an intensity equal to

The semantics are the same, the resolution is higher, and the like. The forward mapping at this time also has the ability to convert a low resolution image to a high resolution image.

Step (1) is equivalent to pre-training the model, and the training material may be an easily-obtained, paired public data set. Because the model is pre-trained in step (1), only the mapping relation from low resolution to high resolution needs to be learned, and the data of a special field does not need to be learned.

Step (2) is equivalent to manufacturing low resolution data, except that the manufacturing method differs from the conventional method, which obtains paired data by predefined down-sampling, while step (2) obtains "native" low resolution data by learning the corresponding inverse mapping. In practical application scenarios, it is most common to use step (2) in which a high resolution image is generated by updating the cis mapping, which constitutes an image pair with a "native" low resolution image. Therefore, the generation process of the high-resolution multi-target scene image in the above step S16 can be understood similarly with reference to the aforementioned step (2).

By adopting the training mode based on the closed-loop mapping, the dilemma of few data in reality can be overcome, and especially when the images are rare and the resolution ratio is generally low for military multi-target scenes, the training mode has greater practical significance. Only low resolution images need to be provided and the trained model input to obtain the corresponding high resolution image.

The closed-loop training mode can solve the problems of image data scarcity and low natural resolution, such as military field and some public opinion field. However, because the loss function of the DRN does not perform further feature extraction on the reconstructed image and the real image, the inherent features of the reconstructed image and the original image (such as the real image or the initially provided pseudo image with low resolution) cannot be well expressed, so that the reconstructed image has the characteristics of low quality, poor visual effect and the like, and cannot meet the required super-resolution reconstruction standard.

Loss function based on feature map difference: it can be understood that, regarding the loss function of the super-resolution network model, the processing concept of the loss function of the conventional SRGAN network is combined, and the improvement is performed, so that the loss function is used for calculating the inherent characteristics of the reconstructed image and the real image, so as to improve the quality of the reconstructed image and realize the super-resolution reconstruction of the real multi-target scene image.

Specifically, when a traditional learning-based super-resolution reconstruction model including the DRN is used for super-resolution model training, the mean square error of the pixel difference between a real image and a reconstructed image is commonly used as a loss function, and then model parameters are updated. However, such conventional loss focuses on comparing pixel information, ignores semantic differences between a reconstructed image and a real image, and often loses high-frequency information such as target edges, boundaries and the like in the image in such a loss function. If the edge information of the super-resolution image is lost in the reconstruction process, the boundary of the target in the reconstructed super-resolution image is blurred, so that the quality of the reconstructed image is greatly reduced. In order to enable the model to pay more attention to the difference between the reconstructed image and the real image during training, the loss function of the model is improved on the basis of a DRN overall framework by combining the thought of the loss function of the SRGAN, so that the inherent characteristics of the reconstructed image and the real image are calculated, and the quality of the reconstructed image is improved.

In order to extract the inherent characteristics of the reconstructed image and the real image in the training process, the network-VGG 19 special for extracting the image characteristics is adopted to extract the characteristics of the reconstructed image and the real image, and then the extraction result is substituted into the loss function to be calculated. It should be noted that the VGG19 employed in the network model of the present application does not employ the entire network layer of the VGG19, but rather employs a truncated VGG19. When the VGG19 is intercepted, the VGG network before a certain pooling layer and a certain convolutional layer is intercepted. The deeper the network interception, the greater the computation overhead and the stronger the feature extraction capability, so that in practical application, specific network interception of the VGG19 can be performed according to the limitations of computation resources, computation overhead and the like.

Fig. 4 is a schematic diagram of a processing flow based on a feature map difference loss, which is newly proposed in the present application. As can be seen from fig. 3, the reconstructed image (i.e., the high-resolution multi-target scene image generated in step S16) and the real image (i.e., the low-resolution multi-target scene image in step S12) processed by the DRN network are not directly substituted into the loss function to calculate the mean square error, but the reconstructed image and the real image are input into the truncated VGG19 to obtain the feature maps, and then the feature maps of the reconstructed image and the real image are substituted into the loss function to calculate, where the loss functions are both mean square errors, which is the greatest innovation of the network of the present application. In the process, no matter the output results of the forward mapping and the reverse mapping of the closed-loop training, or the real low-resolution image and the high-resolution image, the feature extraction is carried out on the truncated VGG19.

In some embodiments, using the same variable definitions as before, the loss function of the super-resolution network model may include

And

two parts. Process abstraction for feature extraction with truncated VGG19 networks as a function

Then the formula of the process flow is expressed as follows:

（7）

（8）

(by min

Updating a forward mapping function

）；

（9）

（10）

（11）

(using a minimum of

Updating a forward mapping function

And inverse mapping function

）。

Wherein the content of the first and second substances,

which represents the original loss of the image,

the dual loss is represented by the loss in pairs,

a function representing a process abstraction for feature extraction with a truncated VGG19 network,

representing a real high-resolution image of the scene,

representing a pseudo high-resolution image of the image,

representing a true low-resolution image of the image,

representing a pseudo low resolution image.

As can be seen from the above equations (8) and (11), the greatest improvement of the new loss function compared to the conventional loss function is that the change of the independent variable of the loss function is changed from the mean square error between the pixel points between the two images to the mean square error between the feature maps between the two images. Feature extraction works in both the forward and reverse mappings. The improvement can enable the model to pay more attention to the acquisition of the inherent information of the image in the training process, reduce the modification of the original semantics of the image by the heavy model and improve the visual effect of the reconstructed image.

In conclusion, the newly proposed network model can train the network under the condition of data unpaired through closed-loop mapping, and the super-resolution reconstruction quality is improved through the feature map difference loss function, so that the fidelity of the generated image is improved.

According to the multi-target scene image resolution improvement method, after the multi-target scene image with low resolution to be reconstructed is obtained, the trained super-resolution network model based on the DRN architecture is called, the multi-target scene image with low resolution is input into the super-resolution network model for processing, and the super-resolution network model generates the multi-target scene image with high resolution (namely, the reconstructed image) through updating the sequential mapping. And then, extracting the characteristics of the low-resolution multi-target scene images and the high-resolution multi-target scene images by using a truncated VGG19 network to obtain corresponding characteristic graphs, and substituting the characteristic graphs into a loss function of the model to calculate the mean square error. Therefore, in both the forward mapping and the reverse mapping of the model, the feature extraction plays a role, so that the model pays more attention to the acquisition of the inherent information of the image in the training process, namely, the inherent features of the reconstructed image (the high-resolution multi-target scene image) and the real image (the low-resolution multi-target scene image) are calculated by the loss function, the semantic modification of the reconstructed image to the real image is reduced, the quality of the reconstructed image is improved, the purpose of super-resolution reconstruction of the multi-target scene image is really achieved, the effect of greatly improving the resolution of the generated multi-target scene image is realized, and the fidelity of the generated image is obviously improved.

In one embodiment, in order to more intuitively and fully describe the multi-target scene image resolution enhancement method, the following is an experimental example of the method: the experiment was trained on public data sets and self-constructed data.

The public data sets may include Set5, set14, BSD100, and COCO2014, which collectively contain a total of 82902 paired images, where the low resolution images are obtained by predefined downsampling of the true high resolution images. Military scene image data in a self-built data set are collected from the Internet, and comprise 5653 images of four major categories, namely airplanes, ships, tanks and missiles, and no paired images exist in the data set.

Because of the lack of military field paired "low-resolution-high-resolution" datasets on the internet, this experiment employed a closed-loop training mode. And (3) sequentially executing the steps (1) to (2) to complete the experiment. During training, the model is trained by using the paired public data sets, and then the self-built military scene image data is trained.

The number of layers of the truncated VGG19 can be adjusted during training. In order to analyze the influence of VGG networks with different depths on the model reconstruction effect, two truncated VGG19 schemes are provided in the experiment: scheme 1 is to perform feature extraction with the first 8 convolutional layers and the first 3 pooling layers as truncated VGGs 19, and scheme 2 is to perform feature extraction with the first 16 convolutional layers and the first 5 pooling layers as truncated VGGs 19. The networks of the two truncation schemes are respectively embedded into the DRN model, and the generation results of the two truncation schemes are compared. The network structure of the feature extraction network adopted by the two schemes is shown in table 1.

TABLE 1

During training, the parameter batch size of the network is set to 400, the epoch is set to 130, and the image is uniformly upgraded to 4 times of the resolution of the forged image.

In order to highlight the superiority of the network based on the difference profile. The experiment adopts 4 representative image super-resolution reconstruction methods to carry out contrast experiments. In the experiment, set5, set14, BSD100, COCO2014 and self-built military scene image datasets were used to train the conventional SRGAN and SRResNet, as well as the new model of the present application.

In the testing stage, in order to compare the model performance, a low-resolution forged image generated by Conditional SinGAN is used as a testing material, and the trained SRGAN and SRResNet, a new network model based on the difference loss of the feature map and a traditional Bicubic method (without training) are tested. After test results of the four methods are obtained, super-resolution reconstruction effects of the 4 methods are compared. Besides qualitatively evaluating and comparing the reconstruction effect of the 4 methods through human visual perception, the images are also subjected to the highest peak signal-to-noise ratioPNSRAnd structural similaritySSIMQuantitative evaluation and comparison (all are common image quality evaluation indexes).

PNSRIs defined as follows: give two sheets

One image is an original high-resolution image HR, the other image is a high-resolution image SR obtained after super-resolution processing, and the mean square error between the two images is defined as:

（12）

then thePNSRIs defined as:

（13）

wherein the content of the first and second substances,

the maximum pixel value possible for the original high resolution image. Provided that each pixel is used

Bit binary representation, then

。

Representing the mean square error.

PSNRIs dB, and a larger value indicates a smaller distortion.PSNRIs the most common and extensive image resolution evaluation index,PSNRthe calculation of (2) is based on the error between corresponding pixel points, namely based on an error-sensitive image quality evaluation method. Due to the fact thatPSNRThe visual characteristics of human eyes (the human eyes have higher sensitivity to contrast difference with lower spatial frequency and higher sensitivity to brightness contrast difference, the human eyes have higher chroma, and the perception result of the human eyes to one area is influenced by the surrounding adjacent areas) are not considered, so that the situation that the evaluation result is inconsistent with the subjective feeling of the human is often generated. Empirical studies have shown a correspondence between the value of PSNR and the quality of the reconstructed image, e.g.Shown in Table 2:

TABLE 2

Therefore, the PNSR is used for evaluating the quality of the super-resolution reconstruction image, and the proximity degree and the distortion degree of the reconstruction image and the original image can be reflected to a certain degree.

SSIMIs defined as follows: give two sheets

One is an original image

One is a high-resolution image obtained by the super-resolution processing

. Let the image brightness contrast function be

The contrast function is

The structure contrast function is

And then:

（14）

（15）

（16）

（17）

（18）

wherein the content of the first and second substances,

，

and

are constants that are used to stabilize, avoid dividing by 0,

and

respectively represent

And

is determined by the average value of (a) of (b),

and

respectively represent

And

the standard deviation of (a) is determined,

represents

And

the covariance of (a). Therefore, the temperature of the molten metal is controlled,SSIMthe expression of (a) is a combination of the three:

（19）

when in use

、

And

when the number of the groups is 1,SSIMis expressed as:

（20）

the value range is 0 to 1 whenSSIMIf =1, the two images are the same.

The basic idea of structural similarity is that images are highly structured, with strong correlation between adjacent pixels. Such a correlation represents structural information of the object in the image. The human visual system is used to extract such structural information when browsing images. And the structural similarity loss is combined with three indexes of brightness, contrast and structure of the two images, which reflect the structural attributes of the objects in the images, so that the image quality is analyzed and calculated, and structural information rather than pixel information is highlighted. The evaluation can be used for evaluating the quality of the multi-target scene image obtained by super-resolution reconstruction.

In summary,PNSRemphasis is placed on measuring the similarity between two image pixels,SSIMemphasis on weighing two figuresThe similarity between the image structures can be used for analyzing the images in a microscopic and macroscopic manner by combining two indexes when analyzing the images obtained by super-resolution reconstruction.

To facilitate comparison of the reconstruction effects, both horizontal and vertical pixels of the reconstructed image are uniformly magnified 4 times the lower resolution image. In the process ofPNSRAndSSIMat the time of calculation, since the original high-resolution image does not exist, an image obtained by enlarging a corresponding real image (the highest layer input of the Conditional SinGAN) through double interpolation is compared with the reconstructed image for calculation.

And (3) analysis of experimental results:

the experiments were performed according to the experimental setup described above. FIG. 5 shows an example of the results of tests with different schemes of truncated VGG19 in the new model.

It can be seen from the test sample that, for the same image, the super-resolution reconstruction effect of the truncation scheme 2 is better than that of the scheme 1. The target edges in some reconstructed images obtained by using the scheme 1 are unclear, and the situation is more obvious when pixel points are amplified. The image obtained by the scheme 2 has clear edges, and brings good visual experience to people. In the aspect of detail reconstruction, when the details of the tank crawler are reconstructed in fig. 5, it is obvious that the solution 2 is superior to the solution 1. The track profile in the scheme 1 is fuzzy, track lines cannot be seen, and the track profile in the scheme 2 is clear, and the track lines can be seen. This is because the scheme 2 adopts a deeper network for feature extraction, and extracts richer image information.

From experimental results, the visual effect of the image reconstructed based on the Bicubic method is poor, the phenomena of image blurring, target contour and boundary unsharpness and the like exist, and a hazy visual feeling is presented in the whole image. The method only mechanically uses interpolation to increase the pixel points of the low-resolution image, and the increased pixel points are only an average value of the surrounding pixel points.

The image reconstruction effect based on the SRResNet and the SRGAN method is superior to that based on the Bicubic method, the image blurring degree is reduced, and the definition of the target contour is increased. Because SRResNet is a learning-based approach that learns the mapping from low resolution to high resolution from a large number of pairs of data using a deep residual network. The SRGAN applies the thought of generating antagonism, and the reconstruction result is better and better by continuously improving the performances of a generator and a discriminator in training. Such a model has stronger reconstruction performance than an interpolation method based on the mean value idea. The super-resolution reconstruction results of SRResNet and SRGAN are not greatly different in visual perception.

The image reconstruction effect obtained by the reconstruction method based on the characteristic map difference loss network is optimal, the comparative advantage of the overall definition of the image can be obviously seen, the outline of the target in the image is clear, the boundary is clear, the image fineness degree is obviously superior to that of the other three methods, and although the specific details of the target cannot be restored, the disordered noise around the target is removed. This method presents the best visual perception and the highest fidelity among the 4 methods.

In order to show more clear details, an image is selected in the experiment, and the original image and the reconstruction result of the image are shown in fig. 6 according to the size of the reconstructed original image. By displaying details of the reconstructed image, the superiority of the image resolution improving method based on the feature map difference network can be more easily seen, and the method is obviously superior to other methods in the aspects of target contour definition, scene definition, visual perception of people and subjective aesthetics of people. Other methods all have different degrees of noise interference.

The above analyses were qualitative assessments and comparisons based on human perception. To make the evaluation of the experimental results more convincing, use was made ofPNSRAndSSIMthe experimental results of the 4 methods were evaluated quantitatively, and 50 pictures of each method were randomly selected as evaluation samples when the evaluation was performed. The results of the evaluation are shown in table 3,PNSRandSSIMthe average of 50 images obtained by each method was randomly taken. Is provided with the first

Of sheetsPNSRIs composed of

，SSIMIs composed of

Then the calculation method is as shown in equations 21 and 22:

（21）

（22）

TABLE 3

As can be seen from Table 3, among the 4 methods, based on the reconstruction results of the Bicubic methodPNSRAndSSIMis the lowest.PNSRThe value is only 26.58, and belongs to the interval with better image quality. The above method of the present applicationPNSRThe image quality is the highest, reaches more than 30, and belongs to an interval with good image quality.SSIMThe value also reaches 0.84, which shows that the reconstructed image has high structural similarity with the original image, the image structure is close, and the image semantics are close. In quantitative evaluation, the scoring of the new method (i.e., the method described above in this application) is also overall optimal. This is essentially consistent with the results of human qualitative assessments. Therefore, the high-resolution image obtained by the new method is superior to other methods in both the degree of detail fineness and the structural similarity with the original image.

In summary, the advantages of the new model are specifically as follows:

(1) The ability to produce "natural" data pairs is provided. The new model adopts a training mechanism of closed-loop mapping based on DRN, can learn down-sampling mapping by using a deep network to obtain a natural low-resolution image, thereby obtaining updated up-sampling mapping and obtaining a high-resolution image of a fixed domain. Thus imaging the low resolution image and the high resolution image. The training mechanism only needs the participation of low-resolution images, and compared with the traditional super-resolution model which needs the participation of a large number of artificially manufactured data pairs during training, the new model is more suitable for a real application scene.

(2) The super-resolution reconstruction method has strong super-resolution reconstruction performance. Before loss function calculation, the method improves the traditional method that the super-resolution loss function directly calculates the mean square error of pixel points of a reconstructed image and an original image, uses a feature extraction network to extract features of the reconstructed image and the original image, and calculates the mean square error of the extracted feature map. By the method, the model can pay more attention to the intrinsic information of the image in the learning process, and only the difference of pixel points is not paid more attention. The reconstruction capability of the model is further improved.

In addition, experimental results also show that the effect of reconstructing the multi-target scene image generated by the Conditional SinGAN by using the new method is superior to that of the traditional method and the mainstream learning-based method.

It should be understood that although the various steps in the flowcharts of fig. 1-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps of fig. 1-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

Referring to fig. 7, in an embodiment, an apparatus 100 for improving a multi-target scene image resolution is further provided, and includes an image obtaining module 11, a model invoking module 13, a super generating module 15, a feature extracting module 17, a loss calculating module 19, and an image output module 21. The image acquiring module 11 is configured to acquire a low-resolution multi-target scene image. The model calling module 13 is used for calling a super-resolution network model based on the DRN architecture; the super-resolution network model comprises a forward mapping network from low resolution to high resolution and an inverse mapping network from high resolution to low resolution. The super-generation module 15 is configured to input the low-resolution multi-target scene image into the trained super-resolution network model, and generate the high-resolution multi-target scene image by updating the forward mapping network in the super-resolution network model.

The feature extraction module 17 is configured to input the low-resolution multi-target scene images and the high-resolution multi-target scene images into the truncated VGG19 network for feature extraction, so as to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images. And the loss calculation module 19 is used for substituting the feature map into a loss function of the super-resolution network model to calculate a mean square error. The image output module 21 is configured to output the high-resolution multi-target scene image after the mean square error.

After the multi-target scene image resolution improving apparatus 100 obtains the low-resolution multi-target scene image to be reconstructed through cooperation of the modules, the trained super-resolution network model based on the DRN architecture is called, the low-resolution multi-target scene image is input into the super-resolution network model for processing, and the super-resolution network model generates the high-resolution multi-target scene image (i.e., the reconstructed image) through updating the sequential mapping. And then, extracting the characteristics of the low-resolution multi-target scene images and the high-resolution multi-target scene images by using a truncated VGG19 network to obtain corresponding characteristic graphs, and substituting the characteristic graphs into a loss function of the model to calculate the mean square error. Therefore, in both the forward mapping and the reverse mapping of the model, the feature extraction plays a role, so that the model pays more attention to the acquisition of the inherent information of the image in the training process, namely, the inherent features of the reconstructed image (the high-resolution multi-target scene image) and the real image (the low-resolution multi-target scene image) are calculated by the loss function, the semantic modification of the reconstructed image to the real image is reduced, the quality of the reconstructed image is improved, the purpose of super-resolution reconstruction of the multi-target scene image is really achieved, the effect of greatly improving the resolution of the generated multi-target scene image is realized, and the fidelity of the generated image is obviously improved.

In one embodiment, the training process of the super-resolution network model comprises the following steps:

comparing the first pseudo high-resolution image with the high-resolution image to improve the similarity of the first pseudo high-resolution image to the high-resolution image;

In one embodiment, the training process of the super-resolution network model further includes:

improving the resolution of the low-resolution image by using a pre-trained forward mapping network of the super-resolution network model to obtain a second pseudo high-resolution image;

reducing the resolution of the second pseudo high-resolution image by adopting an inverse mapping network of a super-resolution network model to obtain a pseudo low-resolution image with the same size as the low-resolution image;

In one embodiment, the loss function of the super-resolution network model comprises

And

：

wherein the content of the first and second substances,

which represents the original loss of the image,

the dual loss is represented by the loss in pairs,

representing a real high-resolution image of the scene,

representing a pseudo high-resolution image of the image,

representing a true low-resolution image of the image,

representing a pseudo low resolution image.

For specific limitations of the apparatus 100 for increasing resolution of multiple target scene images, reference may be made to the corresponding limitations of the foregoing method for increasing resolution of multiple target scene images, which are not described herein again. All or part of the modules in the multi-target scene image resolution improving apparatus 100 may be implemented by software, hardware and a combination thereof. The modules may be embedded in a hardware form or a device independent of a specific data processing function, or may be stored in a memory of the device in a software form, so that a processor can call and execute operations corresponding to the modules, where the device may be, but is not limited to, various computer devices in the art.

In still another aspect, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor executes the computer program to implement the following steps: acquiring a multi-target scene image with low resolution; calling a super-resolution network model based on a DRN architecture; the super-resolution network model comprises a low-resolution to high-resolution forward mapping network and a high-resolution to low-resolution reverse mapping network; inputting the multi-target scene image with low resolution into the trained super-resolution network model, and generating the multi-target scene image with high resolution by updating a forward mapping network in the super-resolution network model; inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into a truncated VGG19 network for feature extraction to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images; substituting the characteristic diagram into a loss function of the super-resolution network model to calculate a mean square error; and outputting the high-resolution multi-target scene image after the mean square error.

In an embodiment, the processor, when executing the computer program, may further implement the additional steps or sub-steps in the embodiments of the multi-target scene image resolution enhancement method.

In yet another aspect, there is also provided a computer readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing the steps of: acquiring a multi-target scene image with low resolution; calling a super-resolution network model based on a DRN architecture; the super-resolution network model comprises a forward mapping network from low resolution to high resolution and an inverse mapping network from high resolution to low resolution; inputting the multi-target scene image with low resolution into the trained super-resolution network model, and generating the multi-target scene image with high resolution by updating a forward mapping network in the super-resolution network model; inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into a truncated VGG19 network for feature extraction to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images; substituting the characteristic diagram into a loss function of the super-resolution network model to calculate a mean square error; and outputting the high-resolution multi-target scene image after the mean square error.

In one embodiment, when being executed by a processor, the computer program may further implement the additional steps or sub-steps in the embodiments of the multi-target scene image resolution enhancement method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and the computer program may include the processes of the embodiments of the methods described above when executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), rambus DRAM (RDRAM), and interface DRAM (DRDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the spirit of the present application, and all of them fall within the scope of the present application. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. A multi-target scene image resolution improving method is characterized by comprising the following steps:

acquiring a multi-target scene image with low resolution;

inputting the multi-target scene image with low resolution into the trained super-resolution network model, and generating the multi-target scene image with high resolution by updating an order mapping network in the super-resolution network model;

and outputting the multi-target scene image with high resolution after the mean square error.

2. The multi-target scene image resolution enhancement method according to claim 1, wherein the training process of the super-resolution network model comprises:

acquiring paired image pairs and inputting the super-resolution network model; the image pair comprises a low resolution image and a high resolution image;

improving the resolution of the low-resolution image through a forward mapping network of the super-resolution network model to obtain a first pseudo high-resolution image with the same size as the high-resolution image;

comparing the first pseudo high-resolution image with the high-resolution image to improve the similarity degree of the first pseudo high-resolution image to the high-resolution image;

3. The multi-target scene image resolution enhancement method according to claim 2, wherein the training process of the super-resolution network model further comprises:

4. The multi-target scene image resolution enhancement method according to any one of claims 1 to 3, wherein the loss function of the super-resolution network model comprises

And

：

wherein, the first and the second end of the pipe are connected with each other,

which represents the original loss of the image,

the dual loss is represented by the loss in pairs,

representing a real high-resolution image of the scene,

representing a pseudo high-resolution image of the image,

representing a true low-resolution image of the image,

representing a pseudo low resolution image.

5. A multi-target scene image resolution improving device is characterized by comprising:

the model calling module is used for calling a super-resolution network model based on the DRN architecture; the super-resolution network model comprises a low-resolution to high-resolution forward mapping network and a high-resolution to low-resolution reverse mapping network;

the super-generation module is used for inputting the multi-target scene images with low resolution into the trained super-resolution network model and generating the multi-target scene images with high resolution by updating an order mapping network in the super-resolution network model;

the feature extraction module is used for inputting the low-resolution multi-target scene images and the high-resolution multi-target scene images into a truncated VGG19 network for feature extraction to obtain feature maps of the low-resolution multi-target scene images and the high-resolution multi-target scene images;

and the image output module is used for outputting the multi-target scene image with high resolution after the mean square error.

6. The multi-target scene image resolution enhancement device according to claim 5, wherein the training process of the super-resolution network model comprises:

7. The multi-target scene image resolution enhancement apparatus according to claim 6, wherein the training process of the super-resolution network model further comprises:

improving the resolution of the low-resolution image by using a pre-trained sequential mapping network of the super-resolution network model to obtain a second pseudo high-resolution image;

8. The multi-target scene image resolution enhancement device according to any one of claims 5 to 7, wherein the loss function of the super-resolution network model comprises

And

：

wherein the content of the first and second substances,

which represents the original loss of the image,

which represents the loss of the dual-pair,

representing a real high-resolution image of the scene,

representing a pseudo-high-resolution image of,

representing a true low-resolution image of the image,

representing a pseudo low resolution image.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the multi-target scene image resolution enhancement method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the multi-target scene image resolution enhancement method according to any one of claims 1 to 4.