GB2604898A

GB2604898A - Imaging processing using machine learning

Info

Publication number: GB2604898A
Application number: GB2103715.5A
Authority: GB
Inventors: Mrak Marta; Gorriz Blanch Marc; E O'connor Noel; F Smeaton Alan; Khalifeh Issa; Izquierdo Ebroul
Original assignee: British Broadcasting Corp
Current assignee: British Broadcasting Corp
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2022-09-21
Also published as: WO2022195285A1; GB202103715D0

Abstract

A method of processing image data comprises receiving one or more target images and at least one reference source. The method processes the at least one reference source and the one or more target images to extract features using a convolutional neural network. In addition, the method processes the features of the one or more target images and the at least one reference source using a transformer network to provide an attention output. The attention output is provided as an input to a convolutional neural network decoder. In addition, skip connections are provided from the convolutional neural network encoder to provide features of the one or more target images to the convolutional neural network decoder at one or more decoder layers. Finally, the extracted features of the one or more target images and the attention output are processed using the convolutional neural network decoder to produce one or more output images. The output images take the style of the reference source. In one example, this style is by taking colourisation from the reference source. In another example interpolation is provided between two images and the reference source itself comprises two images.

Description

Image Processing Using Machine Learning

BACKGROUND OF THE INVENTION

This invention relates to a system and method for image enhancement using machine learning.

The concept of image enhancement encompasses a variety of techniques by which pixels within images can be altered or additional pixels created. Some examples of image enhancement include colouring of black-and-white images, alteration of colours of existing coloured images and interpolation to create additional pixels. Interpolation can be both spatial, to increase resolution, or temporal to increase frame rate. Image enhancement thus includes both still images and video sequences of frames.

Various techniques exist to make use of machine learning for the purpose of image enhancement. For example, in the field of colourisation (providing colours to an image previously rendered in greyscale) there are various attempts at using a deep learning framework that uses previously computed similarity maps to perform exemplar-based colourisation. Similarly, in the field of video frame interpolation convolution neural networks have been used as part of a traditional two-step process of motion estimation and motion-based frame warping.

We will first discuss the existing known techniques for the two examples of image enhancement: colourisation and interpolation.

Colourisation refers to the process of adding colours to greyscale or other monochrome content such that the colourised results are perceptually meaningful and visually appealing. Mapping colours from a grayscale input is a complex and ambiguous task due to the large degrees of freedom to arrive to a unique solution. Thus, in order to overcome the ambiguity challenge, more conservative solutions propose the involvement of human interaction during the colour assignment process, introducing methodologies such as exemplar-based colourisation.

However, existing exemplar-based methods are either highly sensitive to the selection of references (need of similar content, position and size of related objects) or extremely complex and time consuming. For instance, most of approaches require a style transfer or similar method to compute the semantic correspondences between the target and the reference before starting the colourisation process. This fact usually increments the system complexity by requiring twofold pipelines with separate and even independent style transfer and colourisation systems.

Modern digital colourisation algorithms can be roughly grouped into three main paradigms: automatic learning-based, scribble-based and exemplar-based colourisation.

Automatic learned-based methods can perform colourisation with end-to-end architectures which learn the direct mapping of every grayscale pixel to the colour space. Such approaches require huge image datasets to train the network parameters without user intervention. However, in most of the cases they produce desaturated results due to treating the colourisation process as a regression problem. As spotted in the literature, well-designed loss functions such as adversarial loss, classification loss, perceptual loss or a combination of them with regularisation is needed to better capture the colour distribution of the input content and encourage more colourful results. Besides, the colour prediction is uncontrollable and highly sensitive to input noise, specially when applied to restoration of historical content.

Scribble-based colourisation interactively propagates initial strokes or colour points annotated by the user to the whole grayscale image. An optimisation approach is proposed to propagate the user hints by using an adaptative clustering in the high dimensional affinity space. Alternatively, a Markov Random Field for propagating the scribbles is proposed under the rational that adjacent pixels with similar intensity should have similar colours. Finally, a deep learning approach fuses low-level cues along with high-level semantic information to propagate the user hints.

Exemplar-based colourisation uses a colour reference to condition the prediction process. An early approach proposed the matching of global colour statistics, but yielded unsatisfactory results since ignored local spatial information. More accurate approaches considered the extraction of correspondences at different levels, such as pixels, super-pixels, segmented regions or deep features. Based on the extraction of deep image analogies from a pre-trained VGG-19 network, a deep learning framework uses previously computed similarity maps to perform exemplar-based colourisation.

Video frame interpolation is the process of generating new intermediate frames from existing ones. There are many applications in industry which include slow-motion generation, the adaptation of old TV context to play better on newer high frame rate TVs, gaming, video stabilization etc. Traditionally optical flow-based methods have been used for video frame interpolation using a two-step process. First, the motion between the two input frames would be computed and, second, the pixels of the input images warped to generate the new frame. In such a scenario, there are two distinct steps to frame interpolation: motion estimation and motion-based frame warping. Some methods have post-processing steps but this tends to be the general model.

4 predominant models have emerged in the video frame interpolation field. Flow-based methods build on the concept of using optical flow to warp the input images to obtain the interpolated frame. Then there is the kernel-based approach which performs the interpolation task without any explicit flow estimation. Kernel-based methods are limited in the maximum motion they can compute and flow-based methods don't consider the area around a pixel, so the approach of combining both together as seen in DAIN and MEMO-NET constitutes the third new method of interpolation. Finally, then there is the new concept of using a flexible kernel, so the kernel block doesn't have to be square and can refer to any point in the frame.

With the growth of CNNs, some methods conducted frame interpolation using these CNNs. They still tend to follow this two-step process. Examples include Super SloMo, MEMO-NET, DAIN etc. Long et al were the first to experiment with introducing a direct synthesis network with no optical flow dependency, however, the results were sub-optimal. Niklaus et al introduced and popularized the idea of using kernel based methods and removing the need for explicit motion based interpolation. This meant that optical flow values weren't necessary for frame interpolation (at least an explicit declaration). The advantages of this method meant that videos taken in the wild could be used for training and thus a much wider range of scenarios could be accounted for in the dataset. This would be especially helpful in reducing the domain gap as optical flow CNN models are trained using synthetic datasets such as KITT! for training where the optical flow values are present and can be used as ground truth. When optical flow methods are used on real-world data, performance is worse. Many newer methods have refined the initial model introduced by Niklaus et al such as Sepconv, EDSC. AdaCoF (adaptive collaboration of flows for video frame interpolation). Many models have tried combining the advantages of both methods such as DAIN (which combines flow based and kernel based methods) and Y. However, the limitations of both methods remain. With optical flow based methods, the limitations still lie with the problem of an inaccurate flow computation which would result in artefacts in the generated frame. For kernel-based methods, the main problem as noted by others would be memory usage and the inefficient usage of kernels. For example, with SepCony, a kernel size of 51 is used. This means that, even if the motion between the two input images was 5 pixels or 40, the same kernel size of 51 would be allocated. The AdaCoF method tries to reduce this problem by making the kernel-based model more efficient, but the problem still exists, albeit on a smaller scale.

SUMMARY OF THE INVENTION

We have appreciated the need to provide improved methods and apparatus for image enhancement. In particular, we have appreciated the need to provide improved image enhancement techniques using machine learning.

In broad terms, the invention provides improved image enhancement techniques using a context derived from different references (or differently processed references) such as using a Transformer encoder-decoder network. In particular, an "attention" mechanism is provided that derives a concept of attention from images and uses this attention in combination with a neural network such as a convolutional neural network backbone.

The concept of "attention" may be the same as used in the paper on the transformer "Attention is all you need" of Vaswani et al 31 Conference on Neural Information Processing Systems 2017. This paper proposes using the concept of attention with an input sequence to decide which parts of the sequence are important. The preferred implementation uses a transformer network but, in addition to transformers, one can use other related mechanisms which have been recently introduced, such as Lambda Networks or axial-attention.

We have appreciated, therefore, that the concept of attention be used with image enhancement techniques in a manner not seen before. In particular, the invention provides the extraction of features from images which may be used as an input to an attention mechanism. The input may be by concatenating features into a channel dimension.

The invention may be applied to a variety of image enhancement techniques included, but not limited to, colourisation and interpolation.

The invention may be embodied in a method for processing images to produce one or more output images, a system for processing such images and devices for processing images including, but not limited to, mobile devices, broadcast devices, studio equipment and other hardware arranged to receive one or more images and produce an image output as individual images or a video stream.

In common to the embodiments is an arrangement for receiving a target source.

The target source may be a single image or a sequence of images as part of a video clip. The target source comprises images that are to be enhanced in some way such as by colourisafion or interpolation or by other enhancement. A separate context source is arranged to provide a second source of information which is used to enhance the target. The context source may be a single image from which some "style" may be provided to the target. A style is some aspect of the image such as colour, fidelity and so on. The context source may also be a plurality of images, such as a pair of images, used to provide an input by which interpolation may be performed.

The target images and context source are provided via encoders to an attention mechanism implemented by a transformer. This provides an attention output to a decoder which may be used to then produce output images in various ways that will be described.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment of the invention will now be described in more detail by way of example with reference to the drawings, in which: Figure 1: is a schematic diagram showing a generalised example neural network arrangement that may be used for a variety of image processing purposes; Figure 2: is a diagram of an example neural network arrangement for colourisation; Figure 3: is a diagram showing an axial attention module is used in the arrangement of figure 2 for colourisation; Figure 4: is a generalised diagram of an example neural network arrangement for interpolation; Figure 5: is a diagram of the first example neural network in which an output of a network is used as an input in the channel dimension for interpolation; Figure 6: is a more detailed diagram of the arrangement shown in figure 5; Figure 7: is a diagram of a second example neural network in which an output of a network is used for a context interpolation network for interpolation; Figure 8: is a diagram of third example neural network in which an output of the network is used via another convolutional neural network to a transform decoder for interpolation, and Figure 9: is a schematic diagram of a system embodying the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention may be embodied in a method and system for processing images, including still images or video sequences of images. In particular, the invention provides for image enhancement using machine learning techniques based on a Transformer using attention.

The invention may be implemented in a variety of image processing systems and methods, but the main examples that will be given are colourisation and interpolation. For simplicity of description, these will be described as separate embodiments. However, for the avoidance of doubt, the techniques may be used in combination to provide both colour manipulation and spatial or temporal interpolation within one system.

Figure 1 is a schematic view of the main components of an arrangement embodying the invention there may be used for a variety of image processing techniques. The components shown in figure 1 and the other figures are functional components which may be implemented by dedicated hardware or by process steps in software. Each block may therefore comprise a dedicated block may be an instance of a process.

The arrangement comprises an input 2 for receiving a target source. The target source comprises a single image frame or a sequence of image frames that are to be augmented, altered or enhanced in some way. He context source input 4 is arranged to receive a context source which may also be a single image or multiple images providing some aspect that is to be used to provide enhancement to the target source. The images of the context source may be from the same sequence as the target source and, in some arrangements, may comprise the same images as the target source, particularly in interpolation arrangement.

The target frame or frames are processed by an encoded 10 which forms part of the convolutional neural network backbone to produce feature outputs. The context source is processed by an encoder 12 which may also be part of the convolutional neural network backbone or may be a separate encoder. In one arrangement, the encoder 10 and encoder 12 are instances of the same encoder used to process the target and context separately.

The outputs of the encoder 10 and encoder 12 are provided to an attention mechanism 14 in the form of a transformer network. This provides a concept of attention that is subsequently used by the decoder 16 that is also part of the convolutional neural network backbone. The decoder 16 incorporates functionality to produce output images on an image output 6 either directly or by incorporating additional image processing functionality.

A key aspect of the arrangement embodying the invention is the use of an attention mechanism, in particular a transformer, inserted as part of an encoder decoder convolutional neural network arrangement. By deriving attention from features derived from a target and context source and using these to derive an attention output, the concept of attention is thereby applied to image processing.

COLOURISATION

Overview The colourisation embodiment integrates attention modules that learn how to extract and transfer automatically selected style features from one or more reference images to one or more target images in an unsupervised way during the colourisation process. In particular, in colourisation we specialise and train the architecture to do colourisation, although different specialisation and training can be used to use the same concept for style transfer for other aspects of images.

The proposed architecture uses a backbone which can be pre-trained to extract semantic and style features at different scales from the grayscale target image and colour reference image. Then, attention modules at different resolutions extract analogies between both feature sources and automatically yield output feature maps that fuse the style of the reference to the content of the target.

Finally, a multi-scale pyramid decoder generates colour predictions at multiple resolutions, enabling the representation of higher-level semantics and robustness on the variance of scale and size of the local areas of content.

The main advantage of such end-to-end solution is that the attention modules learn how to perform style transfer based on the needs of the colourisation decoder in order to encourage high quality and realistic predictions, even if the reference mismatches the target content.

Moreover, the approach generalises the similarity computation of known image analogy approaches by not constraining with a specific local patch search (attention modules can be interpreted as a set of long-term deformable kernels) or with specific similarity metrics.

The proposed architecture introduces a novel design of the conventional transformer for style transfer and image analogy tasks, enabling a modular combination of multi-head attention layers at different resolutions.

Specific applications of the proposed exemplar-based colourisation method include the improvement of image and video colourisation pipelines based on user/producer interaction, wherein a set of colour references can be either proposed by the user or automatically retrieved, in order to automatically colourise a set of target frames (which can be the next frames for a sequence of similar content).

The proposed end-to-end neural network can be easily extended to other conditioned source-to-source translation tasks such as image-to-image translation based on a reference, image-to-audio translation, etc. Easy domain adaption can be performed by selecting a domain-specific pre-trained backbone which can produce meaningful features to lead the fusion and decoding process.

Architecture The overall architecture of an arrangement embodying the invention will first be described with respect to figures 2. This architecture may be used for both training and inference, but the arrangement for inference is simpler and involve a single output rather than the multiple outputs as shown in figure 2 as will be described.

The arrangement shown in figure 2 comprises an encoder 100 and a decoder 200. The attention mechanism from figure 1 is provided by the attention modules 133, 134 and 135 in figure 2. The two encoders from figure 1, encoder 10 and encoder 12, are provided by operating two instances of the encoder 100 in figure 2 to process the respective reference in target sources in parallel.

The encoder 100 comprises a reference input 101 for receiving one or more reference images and a target input 102 for receiving one or more target images.

The images are conventionally arranged comprising an array of pixels each with either greyscale values or colour values in a chosen format for example CIE colour space. The encoder 100 comprises a convolutional neural network backbone comprising a sequence of blocks 111 to 115, each block being arranged to receive the output of the preceding block. Each block is arranged to derive "features" by processing the pixels of the input image to produce a value at each pixel position. That value will depend upon the surrounding pixels and also depends upon the mathematical process used to derive the features. The preferred choice of processing step in each block to produce the features is a 3 x 3 convolution with filters. The choice of filter will impact the nature of the features extracted and, in consequence, will influence the features that are then given most attention in subsequent processing steps. The filters are derived according to a training process described later in which a source of "truth" is used to compute a loss function between the predicted image and truth image. The filters are then updated a known manner until that loss function is satisfied, for example by minimising a loss The blocks 111 to 115 will be described in turn. Block one 111 is arranged to receive a full resolution reference image and, separately, a full resolution target image. The processing at block one is undertaken at the full resolution (HxW) of these input images. The target and reference images are processed separately by convolution with filters to produce an output value at each pixel position. The output values after this block therefore comprise a number of feature maps which are arrays of values for each of the reference and target images having the same resolution as the respective input but with features enhanced. A. These can be considered as "images" in the sense that they can be represented visually, but "feature map" is a better description. A number of arrays will be produced, one for each filter in the first convolution layer.

In addition, block one 111 has an output 121 via a 1x1 convolution and rectified linear unit (ReLU) of the feature map of the target image that is provided to the final stage of the decoder described later. It is to be noted that the data map of values at this point has the same resolution as the input target image and the function 1x1@h provides an output at each pixel position that is a positive h dimensional vector. The output 121 accordingly has the particular purpose of providing the feature map of the target image representation at the original resolution to the final decoder stage.

Block two 112 is arranged to receive the output of block one and apply further filters but at half the resolution. Accordingly, the output of this block may also be considered to be feature maps but with half the pixel positions, of the respective reference and target images, and provided by processing to provide emphasis to features using filters with each feature map obtained by a specific filter. This block also has an output 122 for the further processed target image and which provides an input to the second from last stage of the decoder which is also processed by the function 1xl@h providing a positive h dimensional vector for each pixel position of the target image.

Blocks 113, 114 and 115 are arranged to respectively receive the output from each preceding block and process by further filtration at successively lower resolutions of one 1/4, 1/8 and 1/16 of the original resolution for each of the target and reference images. Accordingly, each block will produce a data map for the target and reference images at a lower resolution and highlighting further features depending upon the filter chosen. As before, this is performed respectively for the target image and reference image. Blocks 113, 114 and 115 have respective outputs 123, 124 and 125 also processed by the function 1x1@h providing a positive h dimensional vector for each pixel position of the target image at the respective resolutions and provided respectively to the third from last stage, fourth from last stage and the output from the encoder block.

The blocks 1 to 5 described so far may be conventional feature extraction blocks.

However, the use of attention in combination with a multiscale approach is a novel arrangement as will now be described.

An output respectively of each of block three 113, block four 114 and block 115 is provided to respective attention modules 133, 134 and 135. The input to the attention modules comprises the feature map at the respective resolution for each of the reference and target images. The attention modules 133, 134 and 135 are arranged to compute a respective attention mask describing the correspondence between the sources. The output may be considered to be a fused feature map. In order to reduce complexity, the attention modules compute axial attention, namely attention values along rows and columns as described later. The respective attention modules 133, 134 and 135 provide respective outputs 143, 144 and 145 which are provided to the decoder.

The decoder 200 comprises a multiscale pyramid decoder comprising a series of stacked decoders each operable on an input at a different resolution to produce an output. Decoder 211 is arranged to receive output 121 from block 111 of the encoder which is at the full resolution of the input image. Decoder 212 is arranged to receive output 122 from block 112 at half the resolution of the input image. Decoder 213 is arranged to receive output 123 from block 113 at one quarter of the input image resolution. Decoder 214 is arranged to receive output 124 from block 114 at 1/8 of the input resolution. Finally, decoder 214 is also arranged to receive the final output 125 of block 115 of the encoder which is at 1/16 resolution.

The decoders 211 to 214 each comprise a respective 3x3 convolution, up sampling and 3x3 convolution. The upsampling function is provided to up sample the lower resolution from each preceding block. Taking the final decoder 211 as an example, this receives an output 121 from block 111 at full resolution on input just after the upsampler. The upsampler is the fourth such upsampler in the sequence of decoding blocks and so upsamples to revert the processed encoded-decoded features back to the same original resolution as the skip connection on output 121.

Accordingly, each decoder block receives target features on respective skip lines at the same resolution as provided by the output of the upsampler within each decoder block.

A series of prediction heads 221 to 234 each comprise a 3 x 3 convolution + ReLU and 1x1 convolution and Tanh function and provide respective outputs P1 231, P2 232, P3 233 and P4 234. The outputs are respectively at full resolution, half resolution, 1/4 resolution and 1/8 resolution of the original images.

When in training mode, the decoder 200 comprises all of the modules shown in figure 1. However, when in inference mode, the decoder may comprise a single decoder and single prediction head, for example prediction head P1 221 to provide a single output at full resolution. The use of a series of prediction heads is particularly useful for the training mode by training using attention at different resolutions of features.

Considering the connectivity from encoder to decoder of each encoder block to each decoder in turn, we can see that the final block five 115 of the encoder produces an output 125 of the target image this being the feature map at 1/16 resolution. This feature map of the target image is summed in summation unit 245 with the fused features from attention module 135 on output 145. The output of summation unit 245 is then processed by decoder 214 and provided on output P4 via prediction head 224. This is a pixel by pixel output at the resolution 1/16 but upscaled by bilinear up sampling in decoder 214. The subsequent decoder blocks in turn perform the same process.

We can thus see that the overall architecture comprises the feature extractor backbone blocks 1 to 5 which extract features at different resolutions, the attention modules that produce fused feature maps at respective resolutions, summation units to sum those fused feature maps with the target feature maps at respective resolutions, decoders to convolve and up sample the resulting summation and prediction heads to produce pixel by pixel outputs at respective resolutions.

Training and Inference Process The training and inference process will be described first followed by aspects specific to the training process. It is noted for the avoidance of doubt, that much of the description of the training process is the same as for the inference process.

Overall, the training stage is a multi-loss training strategy that takes advantage of the multi-scale pyramid decoder described above. Given a pair of target and reference images, the proposed neural network decodes the target colours at different resolutions. The network is trained to minimise the "distance" between the predicted and reference colours while conserving the semantic faithfulness of the target content (or with other words, how realistic is the target image colourised with the reference colours). Such "distance" is computed by the summation of 4 metrics: (1) the hubber loss between the real and predicted target colour channels (to encourage the learning of prior colours from similar content on the training data), (2) the histogram loss between the reference colours and the predicted colours (measuring how close are the predicted colours with the ones in the reference), (3) the total variance regularisation loss between neighbouring pixels of the prediction (encouraging that neighbouring pixels in the prediction having similar colours) and (4) the adversarial loss between the real target image and the predicted (using Generative Adversarial Network's (GANs) framework to measure automatically how realistic look the predicted images compared with the real ones (or in other words, if the predicted images look realistic being able to fool a discriminator neural network trained to classify if an image is real or fake).

The process described adopts axial attention to reduce the complexity of the overall system. As introduced in the axial transformer paper noted above, attention is performed along a single axis, reducing the effective dimensionality of the attention maps and hence the complexity of the overall transformer. Such approach approximates conventional attention (as used in other fields) by focusing sequentially to each of the dimensions of the input feature maps (instead of processing all the dimensions at once as normal attention does). In other words, given an input feature map of 2 dimensions: for a certain position (x, y), axial attention would focus first to the line of pixels in the horizontal axis and then to the vertical axis (2 lines of pixels for each position), while normal attention would focus to all the pixels in the map for each position. In operating this way, complexity is reduced.

The image processing arrangement involves the following key steps: a reference image is provided which provides an exemplar of the colours desired for a target image; a target image is provided (which may be greyscale) which is to adopt the colour scheme of the reference image; a backbone network pre-trained for classification or similar tasks, that processes the target and reference images and yields features maps at different resolutions (obtained by the successive application of convolution operations and non-linear activations). Such features capture both spatial patterns of the input content (that helps to understand and recognise the objects within the images) and the inter-channel correlations modelling the style and colours.

An attention operation that fuses the style of the reference features with the spatial patterns (content) of the target features, followed by a decoder head that maps the features to the colour channels at a certain resolution. Performing this operation of attention + colourisation decoder at different resolutions enables the design of the pyramid multi-scale architecture that decodes the same prediction at multiple resolutions.

Overall, this process can be visualised with reference to the architecture of figure 2with a more detailed example of the process described in relation to figures 2 and 3 together.

The goal of this method is to enable the colourisation of a grayscale target 71 based on the colour of a reference &rah, where HxW is the image dimension in pixels, and represented in the CIE Lab colour space. To achieve this, an exemplar-based colourisation network is trained to model the mapping tit, = T(T,, IRE"h) to the target ab colour channels, conditioned to the reference RLabchannels. CIE Lab colour space is chosen as it is designed to maintain perceptual uniformity and is more perceptually linear than other colour spaces [cite]. This work assumes a normalised range of values between [-1,11 for each of the Lab channels.

Is important to note that the features concept in the mathematics part is always referent to the actual images. Overall, an image (e.g. T = target image and R= reference image) is described as a group of L feature volumes (20 maps with N channels) of different resolutions (output of the intermediate convolutional blocks of the pre-trained backbone network). So, after processing the input images T and R with the backbone we will generate L features F1, 1=1... L levels/resolutions.

Using for example a backbone for image classification, if we input an image of a dog, the network will apply several convolutional blocks at different resolutions in order to generate features that describes the patterns of a dog, and use them to classify the class of the image (dog).

As shown in Figure 2, and previously described, the proposed architecture is composed by four parts: the feature extractor backbone, the axial attention modules, the multi-scale pyramid decoder and the prediction heads.

First, both the target 71 and the reference Rua, images are fed into a pre-trained feature extractor backbone, to obtain L multi-scale activated feature maps F.jc, FA, where I = 1 and the last activated feature map only for the target input Fp, where B is the number of backbone layers. Note that the features have progressively coarser resolution with increasing levels. Without loss of generality, the example arrangement uses a VGG-19 network pre-trained on ImageNet, extracting features at the first activation of every convolutional block: with L = 5 and Fr at re/u4_3.

Then, all Fj., FA pairs and Fp are projected into a h-dimensional space by means of a 1 x 1 convolution plus Rectified Linear Unit (ReLU) activation, to obtain Fr, FAh and rth respectively. The feature maps from the backbone are thereby projected into a reduced dimensionality h (using the 1x1 convolutions, shown as lx1@h blocks in figure 2) such that each pixel in the 2D map is represented by vectors of h dimensions Next, each FP, ph pair of features is fed into N axial attention modules to compute a multi-head attention mask describing the deep correspondences between both sources. Then, the style of the reference source is transferred into the content of the target source by matrix multiplying the attention mask with the reference source. The axial attention module is described below and provides more information about the logic behind style transfer via attention. This process yields h-dimensional fused feature maps FIR.

The concept of "style transfer" refers to the transfer of visual attributes represented in the feature maps (such as colour, tone, texture, artistic style, etc). The style transfer process measures semantically-meaningful correspondences between the spatial patterns of both feature maps at a certain resolution (target and reference) and then transfers the aforementioned visual attributes (from the reference features to the target features). This operation is result of the operations performed with the attention module in combination with the encoder and decoder of the transformer.

After generating the multi-scale fused features, a multi-scale pyramid decoder composed of L -1 stacked decoders and prediction heads is employed to map Fr into L colour predictions at different scales using the corresponding fused features FIR. Thus, starting with 03 = Fr, each decoder I = {4,3,2,1} performs a fivefold operation: (1) adds FIR with the output of the previous decoder 0'-', (2) applies a 3 x 3 convolution plus ReLU activation, (3) upsamples the resultant feature map a factor of 2, (4) similar to the U-Net architecture concatenates the resultant upsampled map with the projected target feature map 6 as skip connection and (5) refines the resultant map with another 3 x 3 convolution plus ReLU activation which projects back the concatenated volume of 2h dimensions into the initial h dimensions, yielding an output feature volume 01.

Finally, the prediction heads map the decoded feature volumes 0l into the output colour channels till,. Each prediction head is composed by a e-dimensional 3 x 3 convolution plus ReLU activation and 1 x 1 convolution plus hyperbolic tangent (Tanh) activation to generate the ab colour channels.

Given two projected sources of features Fr, Fp, relative to the target and reference respectively, the goal of the axial attention module is to combine them in a way that the style codified in the reference features is transferred into similar content areas within the target features.

The style transfer from reference source to target source using attention will now be described. The proposed arrangement makes use of attention to perform the process faster than known arrangements and in an unsupervised way. The attention mechanism improves upon prior arrangements and solves the semantic analogy problem in style transfer by focusing on the most relevant areas of the style source when decoding each voxel in the content source.

Following the original definition of attention, given a projected target and reference feature maps Fj-k, Fill, the fused feature map at position o = (1,]), gRo is computed as follows (equation 1): p ih lh TR 0 -I so ftmaxp (qT pick p)vR p pEM where M is the whole 2D location lattice, the queries to the target source qP = WqFP 0 and the keys and values from the reference source Ickho = WkF/V10, vkho = 1/17,F4ho are all linear projections of the target and refence projected sources Ff.110, FIP 0 VO E M and %, W,, 9exh are all learnable parameters. The so ftmaxp denotes a softmax operation applied to all possible p positions within the 2D lattice M. As is known to the skilled person, the soft max operation is a form of logistic regression that normalises an input value into a vector of values that follows a probability distribution whose total sums up to 1. The output values are in the range [0,1].

Next, a position-sensitive learned positional encoding is adopted to encourage the attention modules to model dynamically prior to where to look at in the receptive field of the reference source (m x m region within M). Positional encodings have proven to be beneficial in computer vision tasks to exploit spatial information and capture shapes and structures within the sources of input features. Moreover" a key, query and value dependent positional encoding are applied to equation as follows (equation 2): (nth" s hT k ^IhT) o = I so ftmaxp NT onR p osp-o 1.H pip 0 L'K p I,p-o p EM",",(o) where iticnxn,(o) is the local m x n-z local region centred around location o = (1,j) and 1_0, 1_0, ric;_v 0 the learned relative positional encoding for queries, keys and values respectively. The inner products etTorpq and kTpTrpk, measure the compatibilities from location p to o within the queries and keys space, and rpv_d, guides the output FI-R 0 to retrieve content within the values space.

Finally, axial attention is adopted to reduce the complexity of the original formulation m(hwm2) to o-(hwm) by rather computing the attention operations along a 1-dimensional axial lattice 1 x m. Following the formulation as in standalone axial-DeepLab, the global attention operation is simplified by defining an axial-attention layer that propagates the information along the width-axis followed by another one along the height-axis. The equation is modified as follows to incorporate axial-attention (equation 3): = so f tmaxp (9IP:c k Ru p tetit p_o rkp-o)(vIti rv) R p R p p-o PEMIxm(0) In this embodiment, we set a span in = w) equal to the input image resolution (o-(hwfw, hp, but such value can be reduced for high resolution inputs. Finally, multi-head attention can be performed by applying N single axial attention heads with head-dependent projections Wqn, posteriorly concatenating the results of each head and projecting the final output maps by means an output 1 x 1 convolution.

As shown in Figure 3 a succession of multi-head weight-height axial attention layers are integrated to design the axial attention module for unsupervised style transfer. Given FP, FP inputs, such module performs a three-fold operation: (1) normalise the target and reference projected sources by means of batch normalisation plus ReLU activation, (2) fuse the normalised sources by means of the multi-head weight-height axial attention layers and (3) add resulting features to the target source identity FP plus activate the output with a ReLU activation.

In summary, the fused feature map between target and reference, as in equation 3, provides an attention output using the concept of axial attention (rows and columns) and thereby simplifies the concept of using attention for this image processing task. The fused feature map thereby provides relevant characteristics of a reference image On this example colour) to be transferred to a target image.

Training Process The process described above applies both to training and inference. We will now describe aspects specific to the training process.

Usually, the objective of colourisation is to encourage that the predicted tab colour channels to be as close as possible to the ground truth Tab ones in the original training dataset. However, this fact does not apply in exemplar-based colourisation, where tab should be customized by the colour reference Rbab while conserving the content of the grayscale target 71. Therefore, the definition of the training strategy is not straightforward, as comparingrab and Tab is not an accurate way of determining accuracy if the exemplar reference image is not the same image as the target image. Nonetheless, we have appreciated that a training process can be operated using exemplar images that are not the same as target images.

The goal will therefore be to encourage the reliable transfer of reference colours to the target content towards obtaining a colour prediction faithful to the reference. This embodiment takes advantage of the pyramidal decoder to combine state-of-the-art exemplar-based losses with an adversarial training at multiple resolutions. Hence, a multi-loss training strategy is proposed to combine a smooth-L1 loss, a colour histogram loss and a total variance regularisation, with a multi-scale adversarial loss by means of multiple patch-based discriminators. In order to handle multi-scale loss functions, average pooling with a factor of 2 is applied to both target and reference to successively generate the multi-scale ground truth Talb and kb Smooth-L1 loss: In order to induce dataset priors in cases when the content of reference highly mismatches with the target, Huber loss (also known as smooth-L1) is proposed to encourage realistic predictions. Huber loss is widely used in colourisation as a substitute of the standard L1 loss in order to avoid the averaging solution in the ambiguous colourisation problem. As spotted in the Fast R-CNN paper, smooth-L1 loss is less sensitive to outliers than the L2 loss and in some cases prevents exploding gradients. Then the pixel loss Lpi ixel can be summarised as follows (equation 4): 1.

Li 1 b) = pixel(Tb, a a i -11(i,11)2, 17111.b(i,1) -11(0)1 <1 ITL(i, :I) -nb (i, DI -71, otherwise Colour histogram loss: in order to fully capture the global colour distribution of the reference image Rk and penalise the differences with the predicted colour distribution D,, a colour histogram loss is considered. The computation of the colour histogram is not a differentiable process which can be integrated within the training loop. Therefore the aforementioned exemplar-based colourisation approach approximates the histogram by means of a function similar to a bilinear interpolation. Without loss of generality, the following describes how to approximate the target histogram ph, but the same formulation can be applied to R. First, the ab colour space is quantised using a step d obtaining 7,Q, with z = fa, b). Then, the histogram tf, can be approximated as follows (equation 5): ZI = -dmax(0, d -11"4 -T-'21) 1,, T/1 HW 131 A where Z = [A, B} is the per-channel histogram of the z = (a, b) colour channel. Finally, the histogram loss Lki" can defined as a symmetric X2 distance as follows (equation 6): Liust where e prevents infinity overflows and Q is the number of histogram bins dependent on d. In this work, E = 10-5, d = 0.1 and hence Q = 441.

In order to encourage low variance along neighbouring pixels of the predicted colour channels ?alb, a total variance regularisation is widely used in style transfer literature. Total variance loss.Liv is computed as follows (equation 7): L;v =111tb(i+1,i)-nb(i,j)1+Inbaj +1) -7L(0)1 i Although the histogram loss encourages the prediction to contain reference colours, it does not consider spatial information nor discriminate between how realistic different object instances are colourised.

With the aim to guide the previous losses towards realistic decisions, an adversarial strategy based on LS-GAN is proposed, using the ground truth colour targets TLaz, as real sources and a patch-based discriminator D. Note that within the GAN framework, the proposed exemplar-based colourisation network would be the generator. Both generator loss 4 and discriminator loss Lb are computed as follows (equation 8): 1, 2] 1 = -Er/ RD glab) -1) + -E.)/ [D (Vab)21 2 -Lab-PT 2 Lab-PT tG = -2 Eriab_pfRD (?iath) -1)2] The total discriminator loss LD is computed by adding the L individual multi-scale losses as follows (equation 9): L, = 4 Finally, the total multi-scale loss is computed as follows (equation 10): Ltotal =1(ApixelLpl ixet AhistLihist ATVL!Ill AGLIG) where A -pixel, Ahist, ATV and AG are the multi-loss weights which specify the contribution of each individual loss.

The loss functions described above can be used to alter the filters applied throughout the proposed network, or only in selected blocks (e.g. all blocks except in the pre-trained encoder backbone). In particular, by successively altering the filters and determining the loss, the filters that jointly minimise various considered losses may be determined. These filters are then set to provide a trained network to be used for inference.

The preferred implementation of adapting the network using the loss function is a back propagation operation. Backpropagation in neural networks is an operation in which we first feed the input to the network, compute the loss function and then the gradients (derivative of the loss function with respect each network parameter or kernels), and then apply the optimiser that finds the next value of the network parameters using the computed gradients. Backpropagation process is well-known concept in deep learning / machine learning, so is the way a network is trained in general.

INTERPOLATION

Overview The interpolation embodiment will now be described with reference to figures 4 onwards. As with the colourisation embodiment, transformers are used.

Transformers have been predominantly used for natural language processing algorithms. In this embodiment, we propose to adapt the methods proposed by others for the task of video frame interpolation. We also investigate the impact of positional embeddings on the performance of the network and introduce the concept of creating a semi-siamese network using transformers by duplicating the encoder and using that as an input to the transformer decoder. We also evaluate the impact of a transformer on the original AdaCoF (adaptive collaboration of flows for video frame interpolation) network without any of the proposed modifications.

Architecture The overall architecture of the arrangement embodying this aspect of the invention is a convolutional neural network backbone in which an attention mechanism is inserted using a Transformer. The attention mechanism effectively provides additional context to interpret pixels providing advantages in various scenarios such as the handling of occlusions and other complex interpolation requirements. The concept of attention gives context from the images separately from the convolutional neural network backbone and allows a variety of types of processing to provide the attention input thereby varying the responsiveness of the system to different types of scenario.

The arrangement shown in Figure 4 comprises an encoder A 301 arranged to receive first and second images denoted image 1 and image 2 which are a target 35 and a reference image. The encoder A 301 and decoder A 302 together form a convolutional neural network backbone. An additional feature in contrast to conventional neural network backbones is an attention mechanism 306. This is arranged to provide attention as part of a transformer network in a similar manner to the concept of attention as used in natural language processing mentioned above.

Image 1 and image 2 are two images in an input sequence of video images. When used in a training mode, images 1 and 2 could be images in a sequence of video frames for which an intervening video frame exists. The arrangement will then make a prediction of that intervening video frame and a comparison made, for example using a loss function as described later. In this way, training of the system is effectively achieved using a given sequence of video frames. In an inference mode, the trained network can take a sequence of video frames and produce an interpolated frame between each of the input image frames 1 and 2 in the sequence.

Encoder B 303, 304 is an additional network, such as a context network, though this could also apply other sort of networks, such as a depth network. The purpose of this additional network is to provide an input to the attention mechanism 306. Two instances of encoder B are shown in the figure encoder B 303 and encoder B 304 respectively receiving the same two images, image 1 and image 2, on respective inputs. The encoders are shown as separate encoders, but may simply be two instances of the same encoder arranged to process image 1 and image 2 separately. In such an arrangement, there would be two forward passes for each input image, image 1 and image 2, in which contextual features are obtained and are later concatenated in a concatenation unit 305 in the channel dimension. This concatenated output in the channel dimension is then used as input to the attention mechanism 306.

Figure 5 provides a more detailed view of the overall arrangement shown in figure 4 using like reference numerals for like items. A more specific example is given for the convolutional neural network backbone using an AdaCoF encoder 301 and AdaCoF decoder 302. The output to the attention mechanism is provided by a ResNet18 encoder 303, 304. The choice of convolutional neural network backbone and encoder input for the attention mechanism may be varied, but AdaCoF and ResNet18 will be described in greater detail by way of example.

The proposed network takes the AdaCoF model and inserts a transformer encoder-decoder network 306 in-between the CNN Encoder-Decoder. The arrangement uses a VGG CNN. The output from this VGG encoder 301 is then used as an input to the transformer encoder 306. The transformer decoder 311 receives the input from the transformer encoder 310. The other input to the transformer decoder 311 here is object queries. This is a modification to the transformer, these queries are learnt over time. The proposed model shows the effectiveness of a standard transformer on a CNN Encoder-Decoder kernel-based interpolation model. The standard transformer is manipulated by varying the inputs, which in both cases come from CNNs. Instead of using queries for the transformer decoder, we opt to use the standard transformer from the Attention is all you need paper meaning that our results are easily reproducible.

The AdaCoF encoder decoder 301, 302 arrangement will now be described in more detail. The AdaCoF encoder is a kernel-based encoder which refines the initial model introduced by Nicklaus et al. We will first describe this general kernel-based approach.

With the aim of interpolating a frame temporally between two input video frames 11 and 12 a convolution-based interpolation method takes a pair of 2D convolution kernels Ki(x,y) and K2(x,y) and uses them to convolve with h and 12to compute the value of an output pixel as follows (equation 11): where Pi(x,y) and P2(x,y) are the patches centred at (x,y) in 11 and 12 The pixel dependent kernels Ki and K2 capture both motion and resampling information required for interpolation. The size of the kernel chosen as a balance between accuracy and computation.

The above equation provides a description of the pixel by pixel approach that may be applied to each colour channel. If an image is represented in RGB format then the above equation will be applicable to each of the colour channels.

The kernel-based approach may be extended and implemented as a convolutional neural network with a contracting component that in which each layer applies a convolution to extract features and an expanding part that incorporates upsampling of layers.

The preferred final stage to convolve the input images Ii and 12 to produce the final output image uses the AdaCoF (adaptive collaboration of flows for video frame interpolation) technique. This technique takes a further step beyond the arrangement summarised by equation 11 and both convolves the input images with the adaptive kernels and offset vectors for each pixel. Unlike previous arrangements, this technique does not share kernel weights over different pixels and so the weights are represented as follows (equation 12): where alpha and beta are offset vectors and Wk,i are kernel weights for predicting a output image from an input image!.

The AdaCoF technique thus provides a kernel-based approach in which kernels are derived using the convolutional neural network backbone, but with the addition of incorporating offset vectors and kernel weights. Referring back to figure 5, the AdaCoF kernel output unit 307 provides the functionality to combine images 11 and 12 using the kernels derived by the AdaCoF encoder, decoder 301, 302 using a summation operation for images 11 and 12. Whilst AdaCoF is the proposed final stage, other networks may equally be used in a similar manner.

The ResNet18 encoder arrangement 303, 304 shown in figure 5 will now be described in more detail. ResNet18 is a convolutional neural network encoder of 18 layers. Each layer comprises a convolution, batch normalisation and max pooling operation. ResNet18 is a residual network which operates by explicitly allowing layers to fit a residual mapping using shortcut connections. Shortcut connections are those skipping one or more layers. The shortcut connections perform identity mapping and their outputs are added to the outputs of stacked layers. This arrangement is known to the skilled person.

As shown in figure 5, the ResNet18 encoder is used as two instances, encoder 303 and encoder 304 respectively on images 11 and 12. The repeated process of convolution, batch normalisation and max pooling in each layer emphasises features of the input images. The concatenation operation 305 shown in figure 5 is shown in greater detail in figure 6 showing how, for each layer, the output of each layer for 11 is concatenated with the output for 12.

The attention mechanism 306 will now be described in more detail. As shown in figure 5, the attention mechanism 306 comprises a transformer encoder 310 and a transformer decoder 311. The concatenated output of the ResNet18 encoder is provided via cony projection unit 312 to the transformer decoder 311 as an "attention" input. This is shown further expands in figure 6 showing how the concatenated output from each layer is provided as an attention input to the decoder.

In summary, the latent vector from the CNN encoder 301 undergoes reshaping (to meet requirements of transformer) and positional embedding. The modified out is then used as an input to the transformer encoder 310. The output from the transformer encoder 310 is then used as an input to the transformer decoder 311.

The other input to the transformer decoder 311 is the output from another CNN network 303,304.

Figure 6 shows a more detailed explanation of the arrangement summarised in figure 5 with like reference numerals used for like features. The arrangement is an expanded view showing the connections between layers of encoder and decoder and indicating the functions performed within each layer of average pooling, concatenation, up sampling and so on. As before, an encoder convolutional neural network 301 is arranged to receive two images of a sequence 11 and 12 and provides a succession of layers with respective filtering and average pooling operations.

Three encoders labelled A, B and C are shown, each comprising four blocks. In addition to concatenated outputs to each successive block in turn, the encoder also comprises skip connections to the decoder 302 which includes decoding layers and up sampling operations. Inserted in the network is the transformer arrangement comprising transformer encoder and decoder. The input to the transformer encoder 310 is from the encoder 301. Separately, a context input is provided to the transformer decoder, in this example being an input from and AdaCoF encoder. The final stages the output unit which, in this case, provides AdaCoF operations on image frames provided separately to that output unit using kernels derived by the encoder-transformer-decoder arrangement Figure 7 shows an alternative arrangement having similar functionality to the arrangement shown in figure 5, but incorporating a modified AdaCoF encoder as a way of providing features for the attention input.

Figure 8 shows a further alternative arrangement in which AdaCoF encoder is used in place of the ResNet18 encoder to provide features for the attention input.

Training and Inference Process The training and inference processes are similar to one another in which consecutive images of a sequence of provided as input image 1 and input image 2 and an output image derived. The training process differs in that the derived output image may be compared to a ground truth image, which may itself be an image in a video sequence between image 1 in image 2.

Three different variations of this embodiment are now described for each of the respective architectures of figures 5, 6 and 7.

The three different models tested out as transformer decoder input are the following: A. Using a set layer from the Resnet18 network to get the context maps for 11 and 12 (input images). The output is then concatenated in the channel dimension. (First option as shown in figure 5) B Using a layer from Resnet18 but introducing a context interpolation network (Second option as shown in figure 7) C. Adding transformer encoder output and output from CNN (e.g. Resnet18). This is then used as primary input to the transformer decoder. The secondary input, is the standard transformer encoder output D Not using any context network, instead using the output of an identical AdaCoF encoder as input to the transformer decoder. Different scenarios such as the encoders having shared or different weights are tested (Third option as shown in figure 8) E Testing the networks with Resnet18 but fixing the weights for the entire training session Two different types of positional embedding are used. During the experiments, different combinations of using the sine and learned embedding are used.

This sine embedding stays fixed throughout the entire training session. The learned embedding, as the name suggests, is modified as training is undertaken.

The rationale behind using a Resnet18 as a contextual network comes from prior work. However, as other provisional tests on VGG16 networks suggest that other context networks are just as effective All the main experiments use a hidden dimension of 256 channels. This takes the lead from the DETR paper, though different hidden dimension values are tested out in the ablation study. Thus, as the input in all cases has 512 channels, a convolution is used to reduce the number of channels to 256. Different experiments have been conducted to see if using shared or different weights for the projections for transformer encoder and decoder input is most efficient.

The three distinct approaches are shown in Figures 5, 7 and 8.

1 approach is to use a resnet18 network and concatenate the features at each respective level. env'

approach would be to take the output of a certain resnet layer, eg. 2nd layer, concatenate these outputs. Use this outputs as input to a new context interpolation network. This would take in an input size of 64 and output a feature map of size 256. A similar architecture to the adacof encoder is used here 3rd approach is to use to adacof encoders without resnet 18 For all these 3 approaches, the output might go through a 1x1 convolution to reduce dimensionality from 512 to 256 (the benefits of doing so still need to be examined). An embedding (sine, cos, learned) is applied to the transformer encoder and decoder inputs.

Training Process The key steps of the training process which uses the architect during the way already described will now be set out.

- The network is trained for 100 epochs - The learning rate is initially set to 0.001 and the weight halves every 20 epochs -The Adamax optimiser is used, hyperparameters are set to the default values Beta1 =0.9, Beta2 = 0.999 -Batch size is 4 Crops of 256x256 are taken and augmented -Auqmentation is one of the following: horizontal flipping vertical flipping swapping the frame order. The probability for the swapping the frames is set to 0.5 Most of this is the same training procedure as AdaCoF, a different number of epochs and different loss function is used -The Li loss function is used for training the entirety of the network As previously explained, the training data may be a sequence of video frames in which to video frames separated by an intermediate frame are provided as the inputs with the intermediate frame providing the ground truth. The L1 loss function is then computed by comparing the derived output frame and the intermediate frame.

As a result of the training, the filters and the encoder of the convolutional neural network backbone are varied until the chosen loss function is satisfied, for example by minimising the loss function. At that point, the selected filters are stored within the convolutional neural network thereby creating a pre-trained network to be used for image processing.

Figure 10 shows an apparatus that may embody the invention in its various aspects comprising a repository of images is an image source 22, and input arranged to receive and provide those images in an appropriate format to a processor 26 and memory 20 which in turn provides one or more output images on an output 28.

The processor 26 and memory 20 together may provide the functionality of each of the modules described in the embodiments.

The apparatus shown in figure 10 may be a mobile device, a device as part of a studio processing chain, a broadcast arrangement or other device arranged to receive and process still images, video images or audio video content.

Claims

Claims 1. 2. 3. 4. 5. 6. 7.
A method of processing image data, comprising: -receiving one or more target images; -receiving at least one reference source; -processing the at least one reference source and the one or more target images to extract features using a neural network; -processing the features of the one or more target images and the at least one reference source using an attention mechanism to provide an attention output; -providing the attention output and the extracted features as an input to a neural network decoder; -processing the extracted features of the one or more target images and the attention output using the neural network decoder to produce one or more output images.
The method of claim 1, wherein the attention mechanism comprises a transformer network.
The method of claim 2, wherein the transformer is an axial transformer.
The method of claim 1, 2 or 3, wherein the neural network includes a convolutional neural network.
The method of claim 1, 2 or 3, wherein the extracted features are provided to the neural network decoder at a lowest scale from the encoder and also at respective higher scales to respective higher layers of the neural network decoder via skip connections.
The method of any preceding claim, wherein the input to the neural network decoder comprises multiple inputs.
The method of claim 5, wherein the multiple inputs comprise inputs at respective scales to respective layers of the decoder for the extracted features of the one or more target images. 8. 9. 10. 11. 12. 13. 14. 15. 16.
The method of any of claims any preceding claim, wherein the attention output is provided at respective scales from the neural network encoder to respective layers of the neural network decoder.
The method of claim 7, wherein the attention output is provided at some but not all encoder levels.
The method of any preceding claim, wherein processing the at least one reference source and the one or more target images to extract features using a neural network comprises processing with multiple neural networks.
The method of any preceding claim, wherein processing the extracted features of the one or more target images and the attention output using the neural network decoder to produce one or more output images includes receiving the one or more target images and processing to provide output images.
The method of any preceding claim, wherein the reference source comprises at least one reference colour image and producing the output image comprises taking each target image and adding colour channels derived by the features and attention.
The method of claim 12, wherein the extracting features comprises producing feature maps of the one or more target images and the at least one reference image using filters.
The method of claim 13, wherein producing feature maps comprises producing feature maps using a series of convolutional neural network encoders.
The method of claim 14, wherein producing the feature map in each encoder comprises producing features at successively lower resolutions preferably at the half of the spatial resolution of the previous encoder.
The method of claim 15, wherein the feature maps from both the target and reference images, are projected at a common hidden dimensionality by 17.
18.
19.
20.
21.
22.
23.means of a 1x1 convolution and activated with a Rectified Linear Unit (ReLU) The method of any preceding claim, wherein processing the features to derive an attention output comprises deriving attention outputs at each of multiple resolutions.The method of claim 17, wherein deriving the attention outputs comprises controlling the contribution of each vector of features of the reference source to the computation of the vector of fused features for a spatial location of the target source, by normalising the target and reference projected sources of features, and applying a multi-head attention operation to the normalised features to obtain fused features.The method of claim 18, further comprising and adding the obtained fused features to the processed one or more target images at each respective resolution The method of any preceding claim, wherein deriving the attention output at a given resolution is applied a plurality N times thereby improving gain prediction capacity.The method of any preceding claim, wherein deriving the attention output comprises deriving an axial attention along rows and/ or columns of the features derived from the target and reference images.The method of any preceding, wherein the neural network decoder comprises multiple decoder layers.The method of claim 22, wherein the processing using the neural network decoder comprises processing with a pyramid-based decoder to apply multi-scale fused features from the attention output to reconstruct a colour prediction.
24. The method of claim 23, wherein the input to the pyramid-based decoder is projected to a common hidden dimensionality by means of a 1x1 convolution and activated with a Rectified Linear Unit (ReLU).
25. The method of claim 24, wherein each of the multiple decoders is arranged to sum the features from the previous decoder with fused features at the same resolution.
26. The method of claim 24 or 25, wherein each of the multiple decoders is arranged to perform a 3x3 convolution plus ReLU activation and upsample by a factor of 2 by means of a 2D bilinear interpolation.
27. The method of claim 24, 25 or 26, wherein each of the multiple decoders is arranged to concatenate upsampled features with target projected features at the same resolution increasing the hidden dimensionality by a factor of two.
28. The method of any claim 27, wherein each of the multiple decoders provides an output via a further 3x3 convolution plus ReLU activation to generate output features for the next decoder reduced back to the increased dimensionality due to concatenation.
29. The method of any of claim 25 to 28, wherein mapping to two colour channels is provided by each of the multiple decoders providing an output to a 3x3 convolution plus ReLU which yields features at a dimensionality less than the hidden dimensionality.
30. The method of claim 29, were the mapping to two colour channels further comprises a 1x1 convolution plus hyperbolic tangent (Tanh) activation. 30
31. The method of any of claim 29, wherein a colour image at each respective resolution is generated by stacking the resultant colour channels with the collocated luminance, result of downsampling N times the original luminance a factor of 2 by means of an averaged pooling.
32. A method of training a neural network to process images, comprising: -receiving one or more target images - receiving at least one reference source; -processing the one or more target images and the at least one reference source according to the processing of any preceding claim to produce one or more output images; -computing a loss function; and -updating filters of the neural network based on the computed loss function.
33. The method of claim 32, wherein the at least one reference source comprises a reference image. 10
34. The method of claim 33, wherein the loss function comprises a multi-scale multi-loss function, derived by computing a total loss by adding the individual losses at different resolutions.
35. The method of claim 33 or 34, wherein the computed loss function is computed at multiple scales as a result of using a multiscale decoder so that the learning of the decoder guides the learning of the attention modules at multiple scales.
36. The method of claim 35, wherein the multiscale loss function leads the neural network predictions towards size invariance, encouraging the attention modules to learn multi-scale patterns and gain precision in local areas.
37. The method of any of claims 33 to 36, wherein the learning modules of the neural network distillate knowledge from the pre-trained backbone, being able to process meaningful features in accordance with the semantic characteristics of both target and reference inputs.
38. The method of any preceding claim, wherein processing using the neural network decoder to produce one or more output images comprises transforming the colour of pixels of the target images to produce the output images.
39. The method of claim 38, wherein transforming the colour of pixels comprises transforming greyscale to colour, using the colours of the reference images.
40. The method of any preceding claim, wherein the reference source comprises first and second images of a sequence of images and producing the one or more output images comprises interpolating between at least two target images to produce an interpolated output image.
41. The method of claim 40, wherein the first and second images are the same as the at least two target images.
42. The method of claim 40 or 41, wherein producing the interpolated output image comprises interpolating between the at least two target images using kernels and offsets produced from the features and attention.
43. The method of claim 40, wherein the first encoder decoder is in AdaCoF encoder decoder.
44. The method of claim 40 to 43, wherein the second encoder is a ResNet18 encoder.
45. The method of any of claims 40 to 44, wherein the attention output is provided via a modified AdaCoF encoder.
46. The method of claim 40 to 45, wherein the second encoder is a second AdaCoF encoder.
47. The method of any of claims 40 to 46, wherein processing the first and second images using the neural network encoder comprises processing the first and second images separately and concatenating the output to provide the output to the transformer network to provide the attention output.
48. The method of claim 47, wherein the attention output is provided at each layer of the second convolutional neural network encoder to a respective layer of the first convolutional neural network decoder.
49. A method of training a neural network to process images, comprising: -receiving first and second images of a sequence of images and an intermediate image between the first and second images of the sequence; -processing the first and second images according to the processing of any of claims 40 to 48 to produce an interpolated image; -computing a loss function between the interpolated image and the intermediate image; and -updating filters of the convolutional neural network based on the computed loss function.
50. Apparatus for processing image data, comprising: -means for receiving one or more target images; -means for receiving at least one reference source; -means for processing the at least one reference source and the one or more target images to extract features using a neural network; -means for processing the features of the one or more target images and the at least one reference source using an attention mechanism to provide an attention output; -means for providing the attention output and the extracted features as an input to a neural network decoder; -means for processing the extracted features of the one or more target images and the attention output using the neural network decoder to produce one or more output images.
51. The apparatus of claim 50, wherein the attention mechanism comprises a transformer network.
52. An image processing arrangement comprising an input, output, processor and memory arranged to undertake the method of any of claims 1 to 49.