CN115861343B

CN115861343B - Arbitrary scale image representation method and system based on dynamic implicit image function

Info

Publication number: CN115861343B
Application number: CN202211590183.8A
Authority: CN
Inventors: 金枝; 何宗耀
Original assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Current assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2024-06-04
Anticipated expiration: 2042-12-12
Also published as: CN115861343A

Abstract

The invention discloses a method and a system for representing an image with any scale based on a dynamic implicit image function, wherein the method comprises the steps of obtaining an image to be processed; performing implicit coding processing on the image to be processed through a pre-trained encoder to obtain a two-dimensional feature map; inputting the two-dimensional feature map into a dynamic implicit image network, carrying out dynamic coordinate slicing processing on the two-dimensional feature map, and carrying out pixel value prediction processing through a double-stage multi-layer sensor to obtain an image pixel value. The embodiment of the invention can reduce the calculation cost of continuous image representation, improve the processing performance and can be widely applied to the technical field of artificial intelligence.

Description

Arbitrary scale image representation method and system based on dynamic implicit image function

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a system for representing an image with any scale based on a dynamic implicit image function.

Background

Digital images are two-dimensional representations of the real world in the digital world, but the continuous physical world is often quantized in a sensor while stored in a computer in the form of a discrete matrix of pixels. If the images can be expressed in a continuous form, the images of any resolution can be acquired in a continuous space, thereby ensuring the accuracy of the described scene of the images. Although the continuous representation method for the images in the related art has excellent performance in continuous image representation, the calculation cost is increased in square order along with the increase of the image magnification, so that the super-resolution reconstruction of any scale is huge in time consumption. In view of the foregoing, there is a need for solving the technical problems in the related art.

Disclosure of Invention

In view of this, the embodiment of the invention provides a method and a system for representing an image at any scale based on a dynamic implicit image function, so as to reduce the calculation cost and improve the processing performance.

In one aspect, the present invention provides a method for representing an arbitrary scale image based on a dynamic implicit image function, including:

acquiring an image to be processed;

Performing implicit coding processing on the image to be processed through a pre-trained encoder to obtain a two-dimensional feature map;

Inputting the two-dimensional feature map into a dynamic implicit image network, carrying out dynamic coordinate slicing processing on the two-dimensional feature map, and carrying out pixel value prediction processing through a double-stage multi-layer sensor to obtain an image pixel value.

Optionally, the performing dynamic coordinate slicing processing on the two-dimensional feature map includes:

inputting the magnification of the image;

Acquiring a feature vector from the two-dimensional feature map, determining the feature vector as a hidden code, and carrying out grouping processing on coordinates in the two-dimensional feature map according to the hidden code to obtain a feature coordinate set;

and carrying out slicing treatment on the characteristic coordinate set according to the image magnification factor to obtain a coordinate slice.

Optionally, the slicing processing is performed on the feature coordinate set according to the image magnification factor to obtain a coordinate slice, including:

Determining a slice interval according to the image magnification;

and dividing the characteristic coordinate set according to the slice interval to obtain a coordinate slice, wherein the coordinate slice is used for sharing the same hidden code for all coordinates in the slice.

Optionally, the pixel value prediction processing by the dual-stage multi-layer sensor includes:

Inputting a coordinate slice and a slice hidden code;

Performing first-stage processing on the coordinate slice and the slice hidden code to obtain a slice hidden vector;

obtaining a coordinate to be predicted, wherein the coordinate to be predicted is any coordinate in the coordinate slice;

And carrying out second-stage processing on the slice hidden vector according to the coordinate to be predicted to obtain a pixel value of the coordinate to be predicted.

Optionally, the dual-stage multi-layer sensor comprises a hidden layer consisting of a linear layer and an activation function.

Optionally, before the pre-trained encoder performs implicit coding processing on the image to be processed to obtain a two-dimensional feature map, the method further includes pre-training the encoder and a dynamic implicit image network, and specifically includes:

Acquiring a training image;

Performing pixel prediction processing on the training image through the encoder and the dynamic implicit image network to obtain a predicted pixel value;

determining a pixel loss value according to the pixel value of the training image and the predicted pixel value;

And updating the weight parameters of the encoder and the dynamic implicit image network according to the pixel loss value to obtain a trained encoder and dynamic implicit image network.

In another aspect, an embodiment of the present invention further provides a system, including:

The first module is used for acquiring an image to be processed;

the second module is used for carrying out implicit coding processing on the image to be processed through a pre-trained encoder to obtain a two-dimensional feature map;

and the third module is used for inputting the two-dimensional feature map into a dynamic implicit image network, carrying out dynamic coordinate slicing processing on the two-dimensional feature map, and carrying out pixel value prediction processing through a double-stage multi-layer sensor to obtain an image pixel value.

Optionally, the third module includes:

the first sub-module is used for carrying out dynamic coordinate slicing processing on the two-dimensional feature map;

and the second sub-module is used for carrying out pixel value prediction processing through the double-stage multi-layer perceptron.

Optionally, the first submodule includes:

A first unit for inputting an image magnification;

the second unit is used for acquiring the feature vector from the two-dimensional feature map and determining the feature vector as a hidden code, and carrying out grouping processing on coordinates in the two-dimensional feature map according to the hidden code to obtain a feature coordinate set;

And the third unit is used for carrying out slicing processing on the characteristic coordinate set according to the image magnification factor to obtain a coordinate slice.

Optionally, the second submodule includes:

a fourth unit for inputting a coordinate slice and a slice hidden code;

A fifth unit, configured to perform a first stage processing on the coordinate slice and the slice hidden code to obtain a slice hidden vector;

a sixth unit, configured to obtain coordinates to be predicted, where the coordinates to be predicted are arbitrary coordinates in the coordinate slice;

And a seventh unit, configured to perform a second stage processing on the slice hidden vector according to the coordinate to be predicted, to obtain a pixel value of the coordinate to be predicted.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects: according to the embodiment of the invention, the two-dimensional feature map is input into a dynamic implicit image network, the two-dimensional feature map is subjected to dynamic coordinate slicing, so that the neural network can execute many-to-many mapping from a coordinate slice to a pixel value slice, a decoder can predict all pixel values corresponding to the coordinate slice by using hidden codes only once, and the calculation cost is reduced; and the pixel value is predicted by the double-stage multi-layer perceptron to obtain the image pixel value, so that the decoder can use coordinates with non-fixed quantity as input, thereby reducing the quantity of hidden layers and improving the processing performance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an arbitrary scale image representation method based on a dynamic implicit image function provided by an embodiment of the present application;

FIG. 2 is an overall frame diagram of a dynamic implicit image function provided by an embodiment of the present application;

FIG. 3 is a diagram illustrating an example of a coordinate slice provided by an embodiment of the present application;

Fig. 4 is a block diagram of a dual-stage multi-layer sensor according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The embodiment of the application provides a method and a system for representing an image at any scale based on a dynamic implicit image function, which mainly relate to the artificial intelligence technology. Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Wherein, the artificial intelligence basic technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing technology, an operation/interaction system, electromechanical integration and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Specifically, the method and the system for representing the image at any scale based on the dynamic implicit image function provided by the embodiment of the application can analyze and process the image by adopting a computer vision technology and a machine learning/depth learning technology in the artificial intelligence field so as to obtain continuous image representation of the image. It can be understood that, for different tasks, the method provided in the embodiment of the present application may be executed in an application scenario of the corresponding artificial intelligence system; moreover, the specific execution time of the methods can be in any link in the operation flow of the artificial intelligence system.

Implicit neural representation techniques-implicit neural representations are capable of capturing details of an object with a small number of parameters, as compared to explicit representations, and their differentiable nature allows back propagation through a neural rendering model. However, implicit neural representations, when applied on a two-dimensional visual task, typically require independent predictions of each pixel, requiring significant computational costs and lengthy run-time.

A Local IMPLICIT IMAGE Function (LIIF), which is a novel implicit representation of an image, uses a multi-layer perceptron to infer the pixel values at each coordinate.

In the related art, although LIIF can provide stable performance in any scale super-resolution task of up to 30 times, its calculation cost increases rapidly with the increase of magnification.

In view of this, referring to fig. 1, an embodiment of the present invention provides a method for representing an arbitrary scale image based on a dynamic implicit image function, including:

s101, acquiring an image to be processed;

S102, performing implicit coding processing on the image to be processed through a pre-trained encoder to obtain a two-dimensional feature map;

S103, inputting the two-dimensional feature map into a dynamic implicit image network, carrying out dynamic coordinate slicing processing on the two-dimensional feature map, and carrying out pixel value prediction processing through a double-stage multi-layer sensor to obtain an image pixel value.

In the embodiment of the invention, a dynamic implicit image Function (DYNAMIC IMPLICIT IMAGE Function, DIIF) is provided, which is a fast and effective arbitrary scale image representation method. Referring to fig. 2, i _in represents an input image, and an encoder maps the input image to a two-dimensional feature map as its DIIF representation. Given the resolution of the real image, the hidden code z ^* and the coordinate slices around the hidden code can be obtained from the two-dimensional feature mapWhere X _1st represents the first coordinates of the coordinate slice and X _last represents the last coordinates of the coordinate slice. The decoding function then uses the information described above to predict all pixel values of the coordinate slice, i.e., the pixel value prediction of the coordinates is performed by the two-stage multi-layer perceptron (or referred to as coarse-to-fine multi-layer perceptron), the slice hidden vector H ^* is predicted by the first stage (coarse stage), and the pixel value I _out-i of the coordinates to be predicted is output together with the coordinates to be predicted X _i as input to the second stage (fine stage). In the training stage, the embodiment of the invention calculates a loss function by using the predicted pixel value I _out-i and the pixel value I _gt-i of the real image, the encoder and the decoding function are jointly trained in a self-supervision super-resolution task, and the learned network parameters are shared by all the images. Embodiments of the present invention enable a neural network to perform a many-to-many mapping from coordinate slices to pixel value slices by using image coordinate grouping and slicing strategies, rather than predicting pixel values for a given coordinate individually at a time. The embodiment of the invention further provides a dual-stage multi-layer perceptron (Coarse-to-FineMultilayer Perceptron, C2F-MLP) for executing image decoding based on a dynamic coordinate slicing strategy, so that the number of coordinates in each slice can be changed along with the change of magnification, and the calculation cost required by large-scale super-resolution can be obviously reduced by using DIIF of the dynamic coordinate slicing strategy. Experimental results show that DIIF achieves optimal calculation efficiency and super-resolution performance compared with the existing super-resolution method with any scale.

Further as a preferred embodiment, the performing a dynamic coordinate slicing process on the two-dimensional feature map includes:

inputting the magnification of the image;

In the embodiment of the invention, a vector is selected from the two-dimensional feature map as the hidden code, and the coordinates closer to the hidden code than other hidden codes in the two-dimensional feature map are grouped according to the hidden code, so that a feature coordinate set is obtained. The hidden code can be shared within one coordinate set by the feature coordinate set so that the decoder can use the hidden code only once to predict all pixel values corresponding to the coordinate set. The number of coordinates in a coordinate set is proportional to the magnification, so that the larger the magnification is, the more calculation cost can be saved. Coordinate grouping requires the decoder to predict all pixel values of the coordinate group at the same time, which can place a heavy burden on the decoder when performing large scale super resolution. The embodiment of the invention provides a reasonable solution that the characteristic coordinate set is sliced according to the image magnification to obtain the coordinate slice, one coordinate set is divided into a plurality of coordinate slices, and the hidden code input is shared only in the coordinate slice but not in the whole coordinate set.

Further as a preferred embodiment, the slicing processing is performed on the feature coordinate set according to the image magnification, to obtain a coordinate slice, including:

Determining a slice interval according to the image magnification;

Among them, the simplest method to set the proper slice interval to achieve the best performance and efficiency balance is to fix the coordinate slices, which in any case use a fixed slice interval. However, this strategy preserves the square order increasing nature of the computational cost as magnification increases. In addition, there are two major problems of spatial discontinuities and redundant coordinates within a coordinate slice. To address these problems, embodiments of the present invention propose dynamic coordinate slicing to adjust slice spacing as magnification changes. The first strategy that can be adopted by embodiments of the present invention is linear order coordinate slicing, which sets the slice interval to a magnification. When linear order coordinate slicing is used, the computational cost of DIIF increases linearly with increasing magnification. Another strategy is to set the slice interval to the square of the magnification, known as a constant order coordinate slice. When using a constant order coordinate slice, the computational cost of DIIF is determined only by the resolution of the input image, which remains unchanged as the magnification increases. In the embodiment of the invention, the characteristic coordinate set is divided according to the slice interval to obtain the coordinate slice, and the coordinate slice is used for sharing the same hidden code for all coordinates in the slice. Referring to fig. 3, fig. 3 is a 4-coordinate-magnification group and employs coordinate slices with a slice interval of 4, Z ^* represents a hidden code, X _1st represents the first coordinate of the coordinate slice, and X _last represents the last coordinate of the coordinate slice.

Further as a preferred embodiment, the pixel value prediction processing by the dual-stage multi-layer sensor includes:

Inputting a coordinate slice and a slice hidden code;

Wherein in order to perform a dynamic coordinate slicing strategy, the decoder needs to have scalability using a non-fixed number of coordinates as input and outputting corresponding pixel values. However, the general MLP only allows using a fixed length vector as an input. To solve this problem, an embodiment of the present invention proposes a dual-stage multi-layer perceptron (C2F-MLP) as a decoder, divided into a first stage (coarse stage) for predicting slice hidden vectors and a second stage (fine stage) for predicting pixel values. In the embodiment of the invention, the hidden layer in the rough stage takes the boundary coordinates of the coordinate slice and the corresponding hidden codes thereof as input to generate the slice hidden vector. The slice hidden vector contains information of all pixel values in the slice and is used as input for the fine phase. The computational cost of the coarse phase is determined by the number of coordinate slices, which is much smaller than the number of output coordinates due to the use of the dynamic coordinate slicing strategy. The coarse phase also allows the decoding function to exploit spatial relationships within the slice, which makes its prediction of pixel values more accurate. The concealment layer of the fine phase takes as input the slice concealment vector output by the coarse phase and any coordinates in a given coordinate slice to predict the pixel values at that coordinate. The fine phase is designed to independently predict pixel values on the coordinates to be predicted. The decoding function employed in the fine phase can be expressed as:

I(X^*)＝f_θ(z^*,[x_tl-v^*,…,x_rb-v^*])；

Where I is the pixel value, X ^*＝[x_tl,…,x_rb is the given coordinate slice, f _θ is the decoder, z ^* is the hidden code corresponding to the coordinate slice, v ^* is the coordinates of the hidden code, and X _tl and X _rb are the first and last coordinates, respectively, of the coordinate slice.

Since the length of the slice hidden vector is shorter than the length of the hidden code and the number of hidden layers in the fine phase is smaller, the computational cost required for the fine phase of DIIF is significantly lower compared to the decoder of LIIF.

Further as a preferred embodiment, the dual-stage multi-layer sensor comprises a hidden layer consisting of a linear layer and an activation function.

Referring to fig. 4, the C2F-MLP divides the decoder into a coarse phase for predicting slice hidden vectors and a fine phase for predicting pixel values. The hidden layer of the C2F-MLP consists of a linear layer with dimension 256, followed by a ReLU activation function. In the rough stage, the hidden code z ^*, the first coordinate X _1st of the coordinate slice, the last coordinate X _last of the coordinate slice and the pixel area a under the current magnification are taken as inputs, and the coordinate hidden vector H _lt～rb is obtained by outputting. In the fine stage, the coordinate hidden vector and the coordinate X _I to be predicted are input, and I _i is obtained through output. To predict the RGB values, the fine phase finally uses an output linear layer of dimension 3.

Further as a preferred embodiment, before the pre-trained encoder performs implicit encoding processing on the image to be processed to obtain a two-dimensional feature map, the method further includes pre-training the encoder and a dynamic implicit image network, specifically including:

Acquiring a training image;

In an embodiment of the invention, the training phase uses the predicted pixel values and the pixel values of the real image to calculate the pixel level loss. The encoder and decoding functions are trained jointly in a self-supervising super-resolution task, while the learned network parameters are shared by all images.

The first module is used for acquiring an image to be processed;

Optionally, the third module includes:

Optionally, the first submodule includes:

A first unit for inputting an image magnification;

Optionally, the second submodule includes:

a fourth unit for inputting a coordinate slice and a slice hidden code;

The invention provides a method and a system for representing an arbitrary-scale image based on a dynamic implicit image function, which are used for rapidly and effectively representing the arbitrary-scale image. In DIIF, the pixel-based image is represented as a two-dimensional feature map, and the decoding function takes as input the coordinate slices and local feature vectors, predicting the corresponding set of pixel values. By sharing local feature vectors inside the coordinate slices DIIF a large scale super-resolution reconstruction can be performed at very low computational cost. Experimental results show that the super-resolution performance and the calculation efficiency of DIIF are superior to those of the existing arbitrary scale super-resolution method on all scaling factors. DIIF can save up to 87% of the computational cost and always has better PSNR performance compared to LIIF. DIIF can be efficiently applied to scenes where images need to be presented in real time at any resolution. By applying the embodiment of the invention, any zooming function in image viewing/editing software can be realized, and the low-resolution image can be amplified and repaired, and the high-resolution image can be compressed and stored.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and these equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. An arbitrary scale image representation method based on a dynamic implicit image function, the method comprising:

acquiring an image to be processed;

inputting the two-dimensional feature map into a dynamic implicit image network, carrying out dynamic coordinate slicing processing on the two-dimensional feature map, and carrying out pixel value prediction processing through a double-stage multi-layer sensor to obtain an image pixel value;

The processing of the dynamic coordinate slicing on the two-dimensional feature map comprises the following steps:

inputting the magnification of the image;

slicing the characteristic coordinate set according to the image magnification factor to obtain a coordinate slice;

the processing of the feature coordinate set to obtain a coordinate slice according to the image magnification factor comprises the following steps:

Determining a slice interval according to the image magnification;

dividing the characteristic coordinate set according to the slice interval to obtain a coordinate slice, wherein the coordinate slice is used for sharing the same hidden code for all coordinates in the slice;

the pixel value prediction processing by the dual-stage multi-layer sensor comprises the following steps:

Inputting a coordinate slice and a slice hidden code;

2. The method of claim 1, wherein the dual-stage multi-layer perceptron comprises a hidden layer, the hidden layer consisting of a linear layer and an activation function.

3. Method according to any one of claims 1 to 2, characterized in that before said implicit coding of said image to be processed by means of a pre-trained encoder, obtaining a two-dimensional feature map, the method further comprises pre-training said encoder and a dynamic implicit image network, in particular comprising:

Acquiring a training image;

4. An arbitrary scale image representation system based on a dynamic implicit image function, the system comprising:

The first module is used for acquiring an image to be processed;

the third module is used for inputting the two-dimensional feature map into a dynamic implicit image network, carrying out dynamic coordinate slicing processing on the two-dimensional feature map, and carrying out pixel value prediction processing through a double-stage multi-layer sensor to obtain an image pixel value;

the third module includes:

The second sub-module is used for carrying out pixel value prediction processing through the double-stage multi-layer perceptron;

the first sub-module includes:

A first unit for inputting an image magnification;

The third unit is used for carrying out slicing treatment on the characteristic coordinate set according to the image magnification factor to obtain a coordinate slice;

The third unit is configured to perform slicing processing on the feature coordinate set according to the image magnification factor to obtain a coordinate slice, and includes:

Determining a slice interval according to the image magnification;

The second sub-module includes:

a fourth unit for inputting a coordinate slice and a slice hidden code;