CN115375536A

CN115375536A - Image processing method and apparatus

Info

Publication number: CN115375536A
Application number: CN202110560593.7A
Authority: CN
Inventors: 张涛; 赵航
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-11-22

Abstract

The embodiment of the disclosure provides an image processing method and device. The image processing method comprises the following steps: acquiring an image to be processed; processing the image through a deep learning model to generate a target image, wherein the deep learning model is obtained based on loss function training including spectral loss between a sample image and a reference image; the image processing is depth estimation, the reference image is an image obtained by performing data enhancement processing on the sample image, or the image processing is super-resolution reconstruction, and the reference image is an image which has the same image content as the sample image and has a resolution greater than that of the sample image. Therefore, based on the loss function which reflects the frequency domain loss between the images, the training of the deep learning model is restrained, the training effect of the deep learning model is improved, and the image quality after the deep learning model is processed is further improved.

Description

Image processing method and apparatus

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to an image processing method and device.

Background

With the development of deep learning models, the deep learning models obtained by training through an automatic supervision or strong supervision method are widely applied to image processing technologies. The deep learning model trained based on the self-supervision method is sensitive to the camera internal parameters generally, and the deep learning model trained based on the strong supervision method is less affected by the camera internal parameters and is more suitable for processing video data.

When training for a deep learning model using a strong supervision mode, the training of the deep learning model is usually constrained based on a difference between an output of the deep learning model and a preset label corresponding to an input image. However, the model of the deep learning model trained based on the constraint mode performs poorly, resulting in poor image quality after being processed by the deep learning model.

Disclosure of Invention

The embodiment of the disclosure provides an image processing method and device, so as to improve the training effect of a deep learning model and further improve the quality of an image processed by the deep learning model.

In a first aspect, an embodiment of the present disclosure provides an image processing method, including:

acquiring an image to be processed;

performing image processing on the image through a deep learning model to generate a target image, wherein the deep learning model is obtained by training based on a loss function comprising the spectral loss between a sample image and a reference image;

the image processing is depth estimation, the reference image is an image obtained by performing data enhancement processing on the sample image, or the image processing is super-resolution reconstruction, and the reference image is an image which has the same image content as the sample image and has a resolution greater than that of the sample image.

In a second aspect, an embodiment of the present disclosure provides an image processing apparatus, including:

an acquisition unit configured to acquire an image to be processed;

the processing unit is used for processing the image through a deep learning model to generate a target image, and the deep learning model is obtained based on loss function training including frequency spectrum loss between a sample image and a reference image;

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and a memory;

the memory stores computer execution instructions;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the image processing method as set forth in various possible designs of the first aspect above.

In a fourth aspect, the embodiments of the present disclosure provide a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the image processing method according to the above possible designs of the first aspect is implemented.

In the image processing method and apparatus provided in this embodiment, an image to be processed is processed through a deep learning model obtained through training under the constraint of a loss function, so as to generate a target image. Wherein the loss function comprises a spectral loss between the sample image and the reference image. The image processing is depth estimation, the reference image is an image obtained by performing data enhancement processing on the sample image, or the image processing is super-resolution reconstruction, and the reference image is an image which has the same image content as the sample image and has a resolution greater than that of the sample image.

Therefore, when the deep learning model is trained, the loss of the image on a frequency domain can be reflected by fully considering the frequency spectrum loss, the constraint condition of model training is effectively enriched, the training effect of the deep learning model is improved, the accuracy of the deep learning model for deep estimation or super-resolution reconstruction is further improved, the quality of the image processed by the deep learning model is improved, and the accuracy of the deep learning model for deep estimation or super-resolution reconstruction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1a is a diagram illustrating an example of a process of depth estimation of an image by a deep learning model;

fig. 1b is a schematic view of an application scenario provided by the embodiment of the present disclosure;

fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the disclosure;

fig. 3 is a schematic flowchart of an image processing method according to another embodiment of the disclosure;

fig. 4 is a schematic flow chart of a single training process of a deep learning model in an image processing method according to an embodiment of the present disclosure;

FIG. 5 is a diagram illustrating a process of super-resolution reconstruction of an image by a deep learning model;

FIG. 6 is a diagram illustrating a process of super-resolution reconstruction of an image by a deep learning model;

fig. 7 is a schematic flowchart of an image processing method according to another embodiment of the disclosure;

fig. 8 is a schematic flowchart of a single training of a deep learning model in an image processing method according to another embodiment of the present disclosure;

fig. 9 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The following explains terms related to the present disclosure:

monocular video: video captured by a single camera.

Depth estimation of monocular video: the method comprises the steps of carrying out depth estimation on multiple frames of images in a video shot by a single video to obtain a depth map corresponding to the multiple frames of images, wherein the depth map corresponding to the images comprises scene depth corresponding to each pixel point in the images, and the scene depth refers to the distance between an object shot by a camera and an imaging plane of the camera.

The strongly supervised monocular video depth estimation method comprises the following steps: in training a deep learning model for depth estimation, a sample image as training data is required to be provided with a real depth map, the sample image is used as an input of the deep learning model, and the real depth map of the sample image is used as a supervision signal in training.

The self-supervision monocular video depth estimation method comprises the following steps: in training a deep learning model for depth estimation, sample images as training data are not required to carry true depth maps, but rather physical constraints between images of different frames within a video are used as a supervisory signal by a camera projection model.

Super-resolution reconstruction of images: and improving the resolution of the low-resolution image to obtain a reconstructed image. The reconstructed image is an image obtained by improving the resolution of a low-resolution image.

Optical flow: the magnitude and direction of pixel motion of a spatially moving object on the imaging plane.

Fig. 1a is a diagram illustrating a process of depth estimation of an image by a deep learning model. As shown in fig. 1a, an RGB image (i.e., a color image in which a pixel value of each pixel is represented by three channels of red, green, and blue) is input to a deep learning model, which includes an encoder and a decoder. And the deep learning model outputs a depth map corresponding to the RGB image after the RGB image is processed by the encoder and the decoder. The depth map of the RGB image comprises depths corresponding to all pixels on the RGB image.

When a deep learning model for monocular video depth estimation is trained, a large amount of rich videos under various scenes are needed to be included in the self-supervision method, otherwise, the trained deep learning model is very sensitive to internal parameters of a camera and is not suitable for video depth estimation. The strong surveillance method is less affected by the internal parameters of the camera, and is more suitable for video depth estimation compared with the self-surveillance method. Considering that it is very difficult to acquire a large number of videos with real depths and the cost is high, a deep learning model is generally trained on an image data set, and the deep learning model obtained through training is used for video depth estimation.

However, the deep learning model trained in the image data set does not learn time sequence semantic information between images of different frames in the video, and is sensitive to changes between images of adjacent frames, for example, as the time sequence of the video advances, a scene may generate various disturbances such as illumination and noise, and the deep learning model is sensitive to these disturbances. Therefore, when the deep learning model obtained in the training mode is applied to video depth estimation, the output video depth estimation result is easy to jitter, the quality is poor, and large negative effects are brought to downstream application.

In one mode, in order to solve the problem that a video depth estimation result output when a deep learning model obtained based on image data set training is applied to monocular video depth estimation is prone to jitter, an optical flow is introduced in the inference process of the deep learning model, the video depth estimation result is smoothed by the optical flow, and the jitter degree of the video depth estimation result output by the deep learning model is reduced. However, this method may reduce the quality of the depth estimation result of the single frame image, increase the time consumption of the algorithm, and affect the real-time performance of the video depth estimation.

When the method is used for solving the problem that the monocular video depth estimation result output by the depth estimation model based on the image data set training is easy to shake, the following two aspects are considered:

(1) In training a depth estimation model based on an image data set, the training of the deep learning model is generally constrained (or supervised) based on the difference between a depth map output by the deep learning model and an actual depth map of a sample image, where the difference between the depth map and the actual depth map of the sample image is typically obtained by an L1 loss function. The L1 penalty function considers only low frequency parts of the image, i.e. non-textured regions of the image that occupy most of the pixels (smoother regions of the image, in the signal processing domain such non-textured regions are generally referred to as "low frequency" parts of the image), and does not consider high frequency parts of the image (e.g. edge regions of the image, regions on the image where texture variations are significant, in the signal processing domain these regions are generally referred to as "high frequency" parts of the image). Therefore, the loss function of the model training only constrains the low frequency part of the image, and does not constrain the high frequency part of the image, so that the trained depth estimation model is sensitive to the slight change of the high frequency part of the image.

(2) In the depth estimation of the video, since the scene is affected by various factors such as illumination, noise and the like during shooting of the video, slight differences exist between the images of the adjacent frames in the video, and the slight differences are the differences between the high-frequency parts of the two adjacent frames, while the low-frequency parts of the two adjacent frames are basically the same. If the loss function only restrains the low-frequency part of the image and neglects the restraint on the high-frequency part of the image during the training of the deep learning model, the trained deep learning model is very sensitive to the slight change of the high-frequency part of the image, the difference of the depth maps of two adjacent frames of images is large, the video depth estimation result is jittered, and the quality of the video depth estimation result is poor.

In addition, when the deep learning model is applied to super-resolution reconstruction of an image, training of the deep learning model is generally restricted (or supervised) based on a difference value between a reconstructed image output by the deep learning model and a super-resolution image corresponding to a sample image, and similarly, only a low-frequency part in the image is considered, and a high-frequency part of the image is not considered, so that the quality of the reconstructed image obtained by the deep learning model is poor.

Therefore, the embodiments of the present disclosure provide an image processing method, in which a deep learning model for image processing is trained under a constraint of a loss function including a spectral loss between a sample image and a reference image, so as to implement a constraint on a difference between the sample image and the reference image in a high-frequency portion of an image through the loss function, which is beneficial to improving an image processing effect of the deep learning model.

When the image processing is depth estimation, the reference image is an image obtained by performing data enhancement processing on the sample image, so that the image after the sample image is subjected to slight change is simulated through the reference image, the method is beneficial to guiding the deep learning model to understand disturbance factors existing in a video scene in a training process, the anti-noise capability of the deep learning model is enhanced, and the sensitivity of the deep learning model to the slight change in the video scene is reduced. When the image processing is super-resolution reconstruction, the reference image is an image which has the same content as the image of the sample image and has a resolution higher than that of the sample image, namely, a high-resolution image corresponding to the sample image, so that the difference between the reconstructed image output by the deep learning model and the high-resolution image corresponding to the sample image in the high-frequency part of the image is constrained based on the spectral loss, the model effect of the deep learning model is improved, and the image quality of the image output by the deep learning model is improved.

Here, the spectral loss between images refers to a loss between images in a frequency domain (frequency domain), and the difference between images refers to a loss between images in a spatial domain (spatial domain). The loss in the frequency domain between images can represent the difference between the high frequency parts of the images compared to the loss in the spatial domain between images.

Referring to fig. 1b, fig. 1b is a schematic view of an application scenario provided by the embodiment of the present disclosure. As shown in fig. 1b, the application scenario includes: a server 10l and/or a terminal 102, when the application scenario includes the server 101 and the terminal 102, the server 101 and the terminal 102 communicate through a network. In this application scenario, application of the deep learning model may be performed on the server 101, for example, applying the deep learning model to depth estimation of a video; and/or, application of deep learning models may be performed on the terminal 102. Similarly, the training of the deep learning model may be performed on the server 101 or the terminal 102.

For example, the execution subject of the image processing method provided by the embodiment of the present disclosure may be a terminal device or a server. When the execution subject of the method is on the terminal device, the method can be used for estimating the depth of the video in real time. For example, while the terminal device plays, the real-time depth estimation is performed on the video; it may also be a non-real-time depth estimation of the video, e.g. a depth estimation of the video stored in the terminal device. When the execution subject of the method is a server, the real-time depth estimation of the video can be performed, for example, when the video is transmitted to the terminal device for playing, the real-time depth estimation of the video is performed; it may also be a non-real time depth estimation of the video, e.g. a depth estimation of the video stored on a server.

The terminal device may be a Personal Digital Assistant (PDA) device, a handheld device with a wireless communication function (e.g., a smart phone or a tablet), a computing device (e.g., a Personal Computer (PC)), an in-vehicle device, a wearable device (e.g., a smart watch or a smart band), a smart home device (e.g., a smart display device), and the like. The server can be a centralized server, a distributed server or a cloud server.

Referring to fig. 2, fig. 2 is a schematic flowchart of an image processing method according to an embodiment of the disclosure. As shown in fig. 2, the image processing method includes:

and S20l, acquiring an image to be processed.

The image to be processed is an image waiting to be processed in the current execution device, for example, an image waiting for depth estimation or an image waiting for super-resolution reconstruction.

In one example, the image to be processed may be acquired in a plurality of images stored in advance.

In yet another example, an image input by a user may be acquired, and the image input by the user may be determined as an image to be processed. For example, when the current execution device is a terminal, an image input by the user in the input box may be acquired. For another example, when the current execution device is a server, the image sent by the user terminal may be acquired.

In another example, an image displayed on a currently executing device is acquired.

S202, carrying out image processing on the image through a deep learning model to generate a target image, wherein the deep learning model is obtained through training based on a loss function including the spectrum loss between the sample image and the reference image.

Specifically, when the image processing is depth estimation, the image to be processed is input into a deep learning model, and in the deep learning model, the image is subjected to depth estimation to obtain a depth map of the image, so that the target image is obtained. When the image processing is super-resolution reconstruction, the image to be processed is input into a deep learning model, and super-resolution reconstruction is carried out on the image in the deep learning model to obtain a reconstructed image of the image, namely the target image is obtained. For clarity of description, the cases of depth estimation and super-resolution reconstruction are described separately later by various embodiments.

First, a case of depth estimation is described.

Referring to fig. 3, fig. 3 is a schematic flowchart of an image processing method according to another embodiment of the disclosure. As shown in fig. 3, the image processing method includes:

s301, acquiring a plurality of frame images in the video to be processed.

Optionally, the image processing method provided by the embodiment of the present disclosure is applied to depth estimation of a monocular video.

The video to be processed is the video waiting for depth estimation in the current execution equipment.

In one example, a video to be processed may be obtained from a plurality of videos stored in advance. For example, according to a preset processing sequence of a plurality of videos, a video to be processed is sequentially acquired from the plurality of videos. As another example, a video selected by the user is acquired from a plurality of videos, and the video selected by the user is determined as a video to be processed.

In yet another example, a video input by a user may be obtained and determined as the video to be processed. For example, when the current execution device is a terminal, a video input by the user in the input box may be acquired. For another example, when the current execution device is a server, a video sent by the user terminal may be acquired.

In another example, a video currently in a playing state is acquired and determined to be a video to be processed, or the video to be played is acquired and determined to be the video to be processed, so that real-time depth estimation of the video is realized.

S302, performing depth estimation on the multi-frame images through a deep learning model to generate an estimated depth map of the multi-frame images, wherein the deep learning model is a depth estimation model obtained based on loss function training, the loss function comprises frequency spectrum loss between a sample image and a reference image, and the reference image is an image obtained by performing data enhancement processing on the sample image.

Wherein the image dataset used to train the deep learning model comprises a plurality of sample images.

In the training process of the deep learning model, data enhancement processing is carried out on the sample image, and the sample image after the data enhancement processing is determined to be a reference image corresponding to the sample image. Subtle changes (e.g., brightness changes, contrast changes) between the reference image and the sample image obtained through the data enhancement processing are used for simulating disturbance factors appearing in the time-series advancing video scene, and the method helps to guide the deep learning model to 'understand' the disturbance factors existing in the video scene in the training process. After a reference image corresponding to the sample image is obtained, based on a loss function including spectral loss between the sample image and the reference image, model parameters of the deep learning model are adjusted until the loss function is converged, and the trained deep learning model is obtained. Therefore, in the training process, the anti-noise capability of the deep learning model is enhanced, and the sensitivity of the deep learning model to slight changes in a video scene is reduced.

After the deep learning model is obtained through training, the deep learning model is deployed on the current execution equipment in advance.

In the step, the multi-frame images in the video are input into the trained deep learning model, and the depth estimation is performed on the multi-frame images in the video through the deep learning model to obtain depth maps corresponding to the multi-frame images in the video. For the convenience of differential description, the depth map obtained by depth estimation will be referred to as an estimated depth map.

And S303, generating a target video according to the target depth map of the multi-frame image.

And obtaining the target video according to the estimated depth maps corresponding to the multi-frame images in the video.

The target video is a depth map video corresponding to the video to be processed, namely a video depth estimation result output by the deep learning model. And each frame of image in the target video is an estimated depth map corresponding to the image of the corresponding frame in the video to be processed. For example, the first frame image of the target video is the estimated depth map corresponding to the first frame image in the video to be processed, the second frame image of the target video is the estimated depth map corresponding to the second frame image in the video to be processed, … …, and so on.

In the embodiment of the disclosure, the deep learning model is obtained by training under the constraint of a loss function including the spectral loss between the sample image and the reference image, so that the anti-noise capability of the deep learning model is improved, and the sensitivity of the deep learning model to slight changes in a video scene is reduced. In the application process, the trained deep learning model is adopted to carry out deep estimation on multiple frames of images in the video to be processed to obtain the target video, so that the method is beneficial to inhibiting the jitter in the target video and improving the quality of the target video.

In order to better understand the anti-noise capability of the deep learning model in the training process and reduce the sensitivity of the deep learning model to slight changes in a video scene, the subsequent training process of the deep learning model is described in detail. The application process of the deep learning model and the training process of the deep learning model can be executed on the same device or different devices.

The deep learning model is trained as follows: acquiring a sample image from the image data set, and performing data enhancement on the sample image to obtain a reference image corresponding to the sample image; adjusting the deep learning model according to the sample image, the reference image corresponding to the sample image and the loss function to obtain an adjusted deep learning model; determining whether the loss function converges; and if the loss function is determined to be converged, determining that the deep learning model training is finished, otherwise, circularly executing the training process based on the adjusted deep learning model.

For example, data enhancement may be performed on a plurality of sample images to obtain reference images corresponding to the sample images; or acquiring a sample image in each training, performing data enhancement on the acquired sample image, performing one-time training on the deep learning model based on the sample image and the reference image corresponding to the sample image, and acquiring the next sample image in the next training process of the deep learning model. And performing data enhancement on the next sample image, and circulating.

For example, it may be determined whether a function value of the loss function obtained in the training process is smaller than or equal to a preset threshold, if so, determining that the loss function is converged, otherwise, determining that the loss function is not converged.

For example, after determining the function value of the loss function, before adjusting the deep learning model, it may be determined whether the loss function converges, if so, it is determined that the deep learning model training is finished without adjusting the deep learning model, otherwise, the deep learning model is continuously adjusted, and the next cycle is entered. Or, after the deep learning model is adjusted, whether the loss function is converged or not is determined, if yes, the deep learning model training is determined to be finished, and if not, the next cycle is entered.

In view of the fact that the training process of the deep learning model is a cyclic process, the single training process of the deep learning model is described by the embodiment. Referring to fig. 4, fig. 4 is a schematic flowchart illustrating a single training process of a deep learning model in an image processing method according to an embodiment of the present disclosure. As shown in fig. 4, the single training process includes:

s401, carrying out depth estimation on the sample image through a deep learning model to obtain an estimated depth map of the sample image.

In this step, the sample image is input into the deep learning model, and the depth estimation is performed on the sample image through the deep learning model, so as to obtain an estimated depth map output by the deep learning model, that is, an estimated depth map of the sample image.

S402, carrying out depth estimation on the reference image through a deep learning model to obtain an estimated depth map of the reference image.

In the step, the reference image is input into the deep learning model, and the sample image is subjected to depth estimation through the deep learning model to obtain an estimated depth map output by the deep learning model, namely the estimated depth map of the reference image.

And S403, determining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In this step, the estimated depth map of the sample image may be subjected to spectral transformation, the estimated depth map of the reference image may be subjected to spectral transformation, and the estimated depth map of the sample image after spectral transformation and the estimated depth map of the reference image after spectral transformation are compared to obtain a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In some embodiments, the spectral transform is a fourier transform to improve the accuracy of the computation of the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image by fourier transform. At this time, one possible implementation manner of S403 includes: carrying out Fourier transform on the estimated depth map of the sample image to obtain a first frequency spectrogram; carrying out Fourier transform on the estimated depth map of the reference image to obtain a second frequency spectrogram; and obtaining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image according to the difference between the first spectrogram and the second spectrogram.

The Fourier transform can convert the image from a spatial domain to a frequency domain, a series of sinusoidal plane signals are used as base signals to represent positive and negative images in the transformation process, so that the high-frequency part and the low-frequency part of the image are decoupled, the frequency spectrum loss between the estimated depth map of the sample image and the estimated depth map of the reference image determined based on the Fourier transform is beneficial to performing consistency constraint on the estimated depth map of the sample image and the estimated depth map of the reference image in the frequency domain, namely facilitating to perform consistency constraint on the estimated depth map of the sample image and the estimated depth map of the reference image in the high-frequency part and the low-frequency part of the image respectively.

Specifically, the estimated depth map of the sample image is converted from a spatial domain to a frequency domain through Fourier transform, and a first spectrogram is obtained. Similarly, the estimated depth map of the reference image is converted from a spatial domain to a frequency domain through Fourier transform, and a second spectrogram is obtained. And calculating the difference between the first spectrogram and the second spectrogram to obtain the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

Further, the calculation formula of the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image can be expressed as: rfft (F) ₁ (I))-rfft(F ₁ (I _aug ) Wherein I denotes a sample image, I _aug Representing a reference image, F ₁ Representing a deep learning model, rfft representing a Fourier transform, F ₁ (I) Estimated depth map for sample image, F ₁ (I _aug ) For the estimated depth map of the sample image, rfft (F) ₁ (I) Is a first spectrogram, rfft (F) ₁ (Iaug)) is the second spectrogram.

And S404, determining a function value of the loss function according to the spectrum loss.

In this step, the loss function includes a spectral loss between the sample image and the reference image, and the spectral loss between the sample image and the reference image is a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, so that after obtaining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, a function value of the loss function can be determined according to the spectral loss.

In some embodiments, one possible implementation of S404 includes: and determining the function value of the loss function as the absolute value of the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, so that the training process of the deep learning model is constrained by the loss function formed by the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, and the difference between the high-frequency part of the estimated depth map of the sample image and the high-frequency part of the estimated depth map of the reference image is fully considered.

In some embodiments, the loss function further comprises a difference between the estimated depth map of the sample image and the estimated depth map of the reference image. At this time, one possible implementation manner of S404 includes: determining a loss value of the loss function according to a difference between the estimated depth map of the sample image and the estimated depth map of the reference image and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

Wherein, the difference between the estimated depth map of the sample image and the estimated depth map of the reference image refers to the difference between the estimated depth map of the sample image and the estimated depth map of the reference image in the spatial domain.

Specifically, in the training process of the deep learning model, the loss value of the loss function is determined according to the difference between the estimated depth map of the sample image and the estimated depth map of the reference image in the spatial domain and the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image in the frequency domain. The loss value of the loss function is used for reflecting the model training effect of the deep learning model, and the smaller the loss value of the loss function is, the better the model training effect of the deep learning model is, the smaller the difference between the estimated depth map of the sample image and the estimated depth map of the reference image in the spatial domain and the spectral loss in the frequency domain are, in other words, the higher the consistency between the estimated depth map of the sample image and the estimated depth map of the reference image in the low frequency part and the high frequency part is. Therefore, in the training process of the deep learning model, the consistency of the estimated depth map of the sample image and the estimated depth map of the reference image in the space domain and the frequency domain is restricted through the loss function, the consistency of the depth estimated map of the sample image output by the deep learning model and the depth estimated map of the reference image output by the deep learning model is improved, the sensitivity of the deep learning model to the subtle difference between the sample image and the reference image is reduced, and the anti-noise capability of the deep learning model is improved.

Further, the difference between the estimated depth map of the sample image and the estimated depth map of the reference image is a difference between the estimated depth map of the sample image and the estimated depth map of the reference image.

Further, the function value of the loss function is determined to be the sum of the difference between the estimated depth map of the sample image and the estimated depth map of the reference image and the absolute value of the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, so that the difference between the estimated depth map of the sample image and the estimated depth map of the reference image and the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image are fully embodied in the loss function.

In some embodiments, the actual depth map of the sample image, i.e. the true depth map of the sample image, is also included in the image dataset. At this time, one possible implementation manner of S404 includes: determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

The difference between the estimated depth map of the sample image and the actual depth map of the sample image is the difference between the estimated depth map of the sample image and the actual depth map of the sample image in a spatial domain, and the difference between the estimated depth map of the sample image and the estimated depth map of the reference image is the difference between the estimated depth map of the sample image and the estimated depth map of the reference image in the spatial domain.

Therefore, in the loss function, not only the spatial difference between the estimated depth map of the sample image and the estimated depth map of the reference image and the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, but also the spatial difference between the estimated depth map of the sample image and the actual depth map of the sample image are considered. As can be understood, the difference between the estimated depth map of the sample image and the estimated depth map of the reference image in the spatial domain and the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image are used to guide the deep learning model to "understand" the difference between the high-frequency part of the sample image and the high-frequency part of the reference image and the difference between the low-frequency part of the sample image and the low-frequency part of the reference image in the training process of the deep learning model, and the sensitivity of the deep learning model to the subtle difference between the images in the depth estimation is reduced; and in the training process of the deep learning model, the difference between the estimated depth map of the sample image and the actual depth map of the sample image is used for reflecting the depth estimation accuracy of the deep learning model so as to restrict the estimated depth map predicted by the deep learning model to be as close to the real depth map of the sample image as possible and improve the depth estimation accuracy of the deep learning model.

Further, the difference between the estimated depth map of the sample image and the actual depth map of the sample image is a difference between the estimated depth map of the sample image and the actual depth map of the sample image, and the difference between the estimated depth map of the sample image and the estimated depth map of the reference image is a difference between the estimated depth map of the sample image and the estimated depth map of the reference image. At this time, one possible implementation manner of S404 includes:

determining a difference between the estimated depth map of the sample image and the actual depth map of the sample image; determining a difference between the estimated depth map of the sample image and the estimated depth map of the reference image; and determining a function value of the loss function according to the difference value between the estimated depth map of the sample image and the actual depth map of the sample image, the difference value between the estimated depth map of the sample image and the estimated depth map of the reference image, and the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In an example, the function value of the loss function is determined as a sum of an absolute value of a difference between the estimated depth map of the sample image and the actual depth map of the sample image, an absolute value of a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and an absolute value of a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

Optionally, the absolute value of the difference between the estimated depth map of the sample image and the actual depth map of the sample image is determined through an L1 loss function, that is, the absolute value of the difference between each pixel in the estimated depth map of the sample image and each pixel in the actual depth map of the sample image is determined through the L1 loss function, and then the average value of the absolute values is determined. And determining the absolute value of the difference between the estimated depth map of the sample image and the estimated depth map of the reference image through the L1 loss function, namely determining the absolute value of the difference between each pixel in the estimated depth map of the sample image and each pixel in the estimated depth map of the reference image through the L1 loss function, and then determining the mean value of the absolute values.

And S405, adjusting the model parameters of the deep learning model according to the function value of the loss function.

In this step, after the function value of the loss function is obtained, the model parameters of the deep learning model are adjusted based on the function value of the loss function, so as to realize one-time training of the deep learning model. Here, the specific model architecture and model parameters of the deep learning model are not limited.

In the embodiment of the disclosure, based on a loss function including a spectral loss between an estimated depth map of a sample image and an estimated depth map of a reference image, training of a deep learning model is constrained, so that constraint on a high-frequency part of the image is realized through the loss function, the deep learning model is guided to 'understand' disturbance factors existing in a video scene in a training process, the anti-noise capability of the deep learning model is enhanced, and the sensitivity of the deep learning model to subtle changes in the video scene is reduced.

In some embodiments, the data enhancement processing algorithm is used to perform data enhancement processing on the sample image to obtain a reference image corresponding to the sample image, and the data enhancement processing algorithm includes at least one of the following: noise transformation, brightness transformation, contrast transformation, motion blur transformation, and optical flow transformation. Therefore, disturbance factors in a video scene are simulated by adding at least one of noise transformation, brightness transformation, contrast transformation, motion blur transformation and optical flow transformation, and the sample image is interfered to obtain a reference image.

The data enhancement processing algorithm does not affect the scene depth of the sample image, in other words, the sample image before being processed by the data enhancement processing algorithm is consistent with the sample image before being processed by the data enhancement processing algorithm in the scene depth. Assuming that I represents a sample image, D (I) represents a scene depth of the sample image, and T represents a data enhancement processing algorithm taken on the sample image, D (I) = D (T (I)). Therefore, based on the sample image and the reference image obtained after the data enhancement processing is carried out on the sample image, the consistent estimated depth map can be effectively output by the deep learning model aiming at the images before and after the data enhancement, and the anti-noise capability of the deep learning model is improved.

Various data enhancement processing algorithms are described below:

(1) Noise adding transformation

The noise adding transformation refers to adding noise on an image. When the data enhancement processing is performed on the sample image by using the noise adding transformation, one possible implementation manner includes: gaussian noise is added on the sample image, so that the sample image is subjected to noise adding transformation through the Gaussian noise which is closer to the noise in the video scene, and the noise adding transformation effect of the sample image is improved. For example, adding gaussian noise to the sample image can be expressed as:

T _g (I) = I + G (I). Where G (I) represents Gaussian noise generated on the sample image I, T _g (I) The image obtained by performing noise-added transformation on the sample image I is shown.

Wherein, the gaussian noise is a type of noise whose probability density function obeys gaussian distribution. Based on the probability density function of Gaussian noise, each pixel in the sample image is added with the random number which accords with Gaussian distribution to obtain the pixel added with the Gaussian noise, and further obtain the sample image added with the Gaussian noise.

Alternatively, other noise, such as salt and pepper noise, may be added to the sample image in addition to gaussian noise.

Optionally, one or more kinds of noise may be added to the sample image, for example, only gaussian noise may be added to the sample image, and gaussian noise and salt-and-pepper noise may be added to the sample image.

(2) Luminance conversion

Luminance transformation refers to changing the luminance of a scene in an image. When the data enhancement processing is performed on the sample image by adopting the luminance transformation, one possible implementation manner includes: randomly generating a brightness conversion factor; and performing brightness transformation on the sample image based on the brightness transformation factor. Therefore, the diversity of the brightness conversion of the sample image is improved through the random brightness conversion factor, and the training effect of the deep learning model is further improved.

For example, the luminance transformation of the sample image may be expressed as:

T ₁ (I) = I × α + θ. Where α and θ represent randomly generated luminance conversion factors, T ₁ (I) The image obtained by luminance conversion of the sample image I is shown.

Further, in order to avoid serious distortion of the image due to an excessively large luminance transformation factor, the luminance transformation factor may be randomly generated within a preset value range of the luminance transformation factor when the luminance transformation factor is randomly generated. For example, α is randomly generated in a preset value range of α, and θ is generated in a preset value range of θ.

(3) Contrast transformation

Contrast transformation refers to changing the contrast of an image. When the contrast transformation is used to perform data enhancement processing on the sample image, one possible implementation manner includes: the image is gamma transformed. Among them, gamma (Gamma) transformation is a way of enhancing image data, and can be used to change the contrast of an image. Therefore, the contrast transformation effect of the sample image is improved by the gamma transformation.

Optionally, when the sample image is subjected to gamma conversion, a gamma value (γ value) in the gamma conversion may be randomly generated, and the sample image is subjected to gamma conversion based on the randomly generated γ value, so that the diversity of the gamma conversion of the sample image is improved, and the training effect of the deep learning model is further improved.

Further, in order to avoid serious distortion of the image due to an excessively large γ value, when the γ value is randomly generated, the γ value may be randomly generated within a preset value range of the γ value.

For example, the image obtained by contrast transforming the sample image I can be represented as T _γ (I)。

(4) Motion blur transformation

Motion blur transformation refers to the simulation of motion blur in an image caused by the motion of a camera or the motion of objects within a scene. In a single-frame still-shot image, obvious motion blur usually does not occur, and in a video, due to the motion of a camera or the operation of an object in a scene, motion blur of different degrees generally occurs in the single-frame image, and the motion blur causes the details in the image to be lost, thereby increasing the difficulty of the scene depth in a deep learning model predicted image. If the deep learning model does not carry out targeted learning aiming at the depth estimation of the image with motion blur in the training process, the effect of the deep learning model obtained by training is unstable when the deep learning model is applied to video depth estimation, and the accuracy of the depth estimation is not high.

Therefore, the data enhancement is carried out on the sample image through the motion blur transformation, the depth estimation of the image with the motion blur by the deep learning model is facilitated to carry out targeted learning, and the stability and the accuracy of the deep learning model when being applied to the video depth estimation are improved. After the sample image after motion blur transformation is added in the training process of the deep learning model, when the obtained deep learning model is used for depth estimation of a video obtained by camera movement or object movement, the output video depth estimation result has no obvious jitter or blur.

One possible implementation of the motion blur transformation to perform data enhancement processing on the sample image includes: and carrying out convolution operation of a space domain and/or a frequency domain on the sample image and the random rectangular wave.

Specifically, when a sample image and a random rectangular wave are subjected to convolution operation of a space domain and/or a frequency domain, the rectangular wave is simulated through a two-dimensional convolution kernel function, and the convolution kernel function and the sample image are subjected to convolution operation in the time domain and/or the frequency domain; alternatively, the convolution operation is performed with each of the rows of the sample image and the columns of the sample image by a one-dimensional convolution kernel function. And finally obtaining a sample image after convolution operation, namely the sample image after motion blur transformation.

For example, the image after motion blur transformation of the sample image I can be represented as T _m (I)。

(5) Optical flow transformation

In a video, the image contents of images of adjacent frames are approximately the same, only slight pixel displacement exists, performing optical flow transformation on a sample image means simulating the slight pixel displacement between the adjacent frames in the video on the sample image, and applying the sample image subjected to the optical flow transformation to the training process of a deep learning model, so that the sensitivity of the deep learning model to the slight pixel displacement between the images of the adjacent frames in the video is favorably reduced, an estimated depth map with large difference is prevented from being output by the deep learning model aiming at the images of the adjacent frames, the jitter of a video depth estimation result generated by the deep learning model is restrained, and the quality of a target video is improved.

One possible implementation of the data enhancement processing on the sample image by using the optical flow transformation includes: randomly generating the light stream size of the target light stream and the light stream angle of the target light stream, wherein the light stream size meets a preset value range; and performing image mapping on the sample image according to the size of the optical flow, the angle of the optical flow and the actual depth map of the sample image to obtain the sample image after optical flow transformation. Therefore, the diversity of the target optical flow is improved through the randomly generated optical flow size and optical flow angle, and the diversity of optical flow transformation on the sample image is further improved; and determining that the generated target optical flow is small and is close to fine pixel displacement between images of adjacent frames by limiting the size of the optical flow within a preset value range.

Specifically, when the sample image is subjected to image mapping according to the optical flow magnitude, the optical flow angle and the actual depth map of the sample image, the optical flow magnitude and the optical flow angle of the target optical flow can be adjusted according to the actual depth map of the sample image, so that when the adjusted target optical flow acts on the sample image, smaller pixel displacement is simulated at pixels with larger field depth, and larger pixel displacement is simulated at pixels with smaller field depth, and the physical correlation relationship between the optical flow and the scene depth is fully met. The physical correlation between the optical flow and the scene depth means that, between images of adjacent frames of a video, the displacement of a pixel with a large scene depth (when imaging, an object corresponding to the pixel is farther from an imaging plane of a camera in space) is relatively small, and the displacement of a pixel with a small scene depth (when imaging, an object corresponding to the pixel is closer to the imaging plane of the camera in space) is relatively large. Therefore, the optical flow transformation of the sample image is more suitable for the actual situation, and pixels on the sample image are not simply subjected to a uniform translation operation.

Optionally, when the image mapping is performed on the sample image according to the optical flow size, the optical flow angle, and the actual depth map of the sample image, a target optical flow formed by the optical flow size and the optical flow angle is multiplied by the reciprocal of each pixel value in the actual depth map of the sample image to obtain an adjusted target optical flow, so that when the adjusted target optical flow acts on the sample image, a smaller pixel displacement is simulated at a pixel with a larger field depth, and a larger pixel displacement is simulated at a pixel with a smaller field depth. And then, carrying out image mapping on the sample image according to the adjusted target optical flow, namely mapping pixel points on the sample image from the current pixel position to another pixel position according to the adjusted target optical flow, and simulating pixel movement between images of adjacent frames in the video to obtain the sample image after optical flow conversion.

For example, performing an optical-flow transformation on a sample image may be represented as:

f＝[mag*sin(angle)，mag*cos(angle)]*D(I) ^-1 ，T _f (I) = remap (I, f). Wherein, f = [ mag sin (angle), mag cos (angle)]*D(I) ^-1 For the formula for adjusting the target optical flow, angle represents the optical flow angle of the target optical flow, mag represents the optical flow magnitude of the target optical flow, D (I) ^-1 An inverse of an actual depth map D (I) representing a sample image I, f represents an adjusted target optical flow, remap (I, f) represents image mapping of the sample image based on the adjusted target optical flow, remap () represents image mapping, T _f (I) The image is obtained by performing optical flow transformation on the sample image I.

In some embodiments, when the data enhancement processing algorithm is multiple, the data enhancement processing algorithm is executed to perform the data enhancement processing on the sample image according to the execution probability and the execution sequence of each data enhancement processing algorithm. Therefore, when the sample image is subjected to data enhancement, the obtained reference image may be the sample image which is not subjected to any data enhancement processing algorithm, may be the sample image which is subjected to only one data enhancement processing algorithm, and may also be the sample image which is subjected to a plurality of data enhancement processing algorithms continuously. In other words, the data enhancement processing algorithm is executed to perform the data enhancement processing on the sample image according to the execution probability and the execution sequence of each data enhancement processing algorithm, so that various combinations of various data enhancement processing algorithms can be realized, the diversity and the complexity of the data enhancement processing are improved, the influence of disturbance factors in a video scene on the image in the video is more fully simulated by enhancing the data of the sample image, and the training effect of a deep learning model is further improved.

Illustratively, referring to fig. 5 (fig. 5 is an exemplary diagram of the order of execution of the plurality of data enhancement processing algorithms), the ordering of the plurality of data enhancement processing algorithms is: a first data enhancement processing algorithm, a second data enhancement processing algorithm, a third data enhancement processing algorithm, … …, and an nth data enhancement processing algorithm.

As shown in fig. 5, the sample image serves as an input image. Firstly, according to the execution sequence of a first data enhancement processing algorithm, determining whether a first data enhancement processing algorithm is adopted to process an input image, if so, executing the first data enhancement processing algorithm on the input image, wherein the image a1 is a processed image, otherwise, the first data enhancement processing algorithm is not executed, and the image a1 is an image which is not processed by the first data enhancement processing algorithm. And then, according to the execution sequence of the second data enhancement processing algorithm, determining whether the image a1 is processed by the second data enhancement processing algorithm, if so, executing the second data enhancement processing algorithm on the image a1, wherein the image a2 is a processed image, otherwise, not executing the second data enhancement processing algorithm, and the image a2 is an image which is not processed by the second data enhancement processing algorithm. And judging and executing each data enhancement processing algorithm in the following manner until the nth data enhancement processing algorithm is reached to obtain a final reference image. Where n is the total number of data processing algorithms.

Therefore, in the processing process of the sample image, the combined processing process of a plurality of data enhancement processing algorithms is realized, and finally the reference image corresponding to the sample image is obtained.

For example, if the data enhancement processing algorithms include a noise-adding transformation, a brightness transformation, a contrast transformation, a motion-blur transformation, and an optical-flow transformation, and the sequence of the plurality of data enhancement processing algorithms is, for example, the optical-flow transformation, the motion-blur transformation, the contrast transformation, the brightness transformation, and the noise-adding transformation in sequence, the data enhancement processing algorithms are executed to perform the data enhancement processing on the sample image according to the execution probability and sequence of each data enhancement processing algorithm, and the obtained reference image may be represented as: I.C. A _aug ＝T _g (T _l (T _γ (T _m (T _f (I)))))。

Therefore, in the embodiment of the disclosure, the reference image of the sample image is generated through the data enhancement processing algorithm and the data enhancement processing process, so that the influence of disturbance factors in a video scene on the image in the video is simulated through the sample image and the reference image, in the training process, the consistency of the depth estimation of the deep learning model on the sample image and the reference image is constrained, the anti-noise capability of the deep learning model is improved, the sensitivity of the deep learning model on slight differences between images is reduced, the jitter of the deep learning model in the target video generated when the deep learning model is applied to the depth estimation of the video to be processed is effectively inhibited, and the quality of the target video is improved.

In some embodiments, prior to data enhancement of the sample image, the sample image may be image pre-processed to conform the sample image to an image form required by the model input of the deep learning model.

Optionally, the pre-processing the sample image comprises resizing the sample image. The sample image is scaled (enlarged or reduced) and cropped to adjust the size of the sample image to the image size required for model input of the deep learning model.

Optionally, the preprocessing the sample image includes adjusting an image channel order of the sample image. For example, the image channel order of the sample image is set to the RGB order (channel order of red, green, and blue).

Optionally, preprocessing the sample image includes setting a channel of the sample image to be a first dimension in image data of the sample image. For example, the image data of the sample image may be generally represented as four-dimensional data: n × C × H × W, where N denotes the number of images of the sample image, C denotes the number of image channels of the sample image, H denotes the height of the sample image, and W denotes the width of the sample image, and setting the channels of the sample image to the first dimension in the image data of the sample image means adjusting the image data of the sample image to: c N H W.

Optionally, the preprocessing the sample data includes normalizing the pixel value of the sample image, and the value range of the pixel value of the normalized sample image is-1 to 1.

Therefore, in the embodiment of the present disclosure, before the deep learning model is trained, the image preprocessing process may be performed on the sample image, so that the sample image meets the model input requirement of the deep learning model, which is beneficial to improving the training effect of the deep learning model and improving the accuracy of the deep learning model obtained by training for depth estimation.

Hereinafter, a case where the image processing is super-resolution reconstruction of an image will be described by way of an embodiment.

Fig. 6 is a flowchart illustrating super-resolution reconstruction of an image by a deep learning model. As shown in fig. 6, the low resolution image is input to a deep learning model, which includes an encoder and a decoder. The deep learning model extracts the features of the low-resolution image through the encoder, performs image reconstruction based on the extracted features through the decoder, and finally generates a reconstructed image.

In the related art, when training a deep learning model for super-resolution reconstruction of an image, a training process of the deep learning model is usually constrained based on a difference between a reconstructed image of a sample image output by the deep learning model and a reference image corresponding to the sample image, and the constraint mode also only constrains consistency of a low-frequency part of the reconstructed image of the sample image and a low-frequency part of the reference image corresponding to the sample image, and ignores consistency of a high-frequency part of the reconstructed image of the sample image and a high-frequency part of the reference image corresponding to the sample image, so that quality of the reconstructed image output by the deep learning model is poor.

In order to solve the above problem, an embodiment of the present disclosure provides an image processing method, in which super-resolution reconstruction is performed on an image through a deep learning model to obtain a target image, wherein in a training process of the deep learning model, training of the deep learning model is constrained based on a loss function including a spectral loss between a sample image and a reference image. Therefore, the difference of the high frequency part of the image is fully considered through the loss function, the consistency of the output of the deep learning model is restrained based on the difference of the high frequency part of the image, and the accuracy of the reconstructed image output when the deep learning model is used for super-resolution reconstruction of the image is improved.

For example, the execution subject of the image processing method provided by the embodiment of the present disclosure may be a terminal device or a server.

Referring to fig. 7, fig. 7 is a flowchart illustrating an image processing method according to another embodiment of the disclosure. As shown in fig. 7, the image processing method includes:

and S701, acquiring an image to be processed.

The image to be processed is an image waiting for super-resolution reconstruction in the current execution equipment.

In another example, an image displayed on a currently executing device is acquired, the resolution of the image is detected, and if the resolution of the image is smaller than a preset threshold, the image is determined to be an image to be processed.

And S702, performing super-resolution reconstruction on the image through the deep learning model to obtain a target image. The deep learning model is a super-resolution reconstruction model obtained based on loss function training, the loss function comprises spectrum loss between a sample image and a reference image, and the reference image is an image which has the same image content as the sample image and has a resolution greater than that of the sample image.

The super-resolution reconstruction model is a deep learning model for image super-resolution reconstruction.

The image data set used for training the deep learning model comprises a plurality of sample images and reference images corresponding to the sample images. In the training process of the deep learning model, based on a loss function including the spectral loss between the sample image and the reference image, the model parameters of the second model are adjusted until the loss function is converged, and the trained deep learning model is obtained. Therefore, in the training process, the consistency of the high-frequency part of the reconstructed image of the sample image output by the deep learning model and the high-frequency part of the reference image can be restrained, and the accuracy of super-resolution reconstruction of the image by the deep learning model is improved.

In the step, the image to be processed is input into the trained deep learning model, and super-resolution reconstruction is performed on the image to be processed through the deep learning model to obtain a reconstructed image of the image to be processed, namely a target image.

In the embodiment of the disclosure, the deep learning model is obtained by training under the constraint of the loss function including the spectrum loss between the sample image and the reference image, and the accuracy of the deep learning model for performing the image super-resolution reconstruction is improved. In the application process, the trained deep learning model is adopted to carry out super-resolution reconstruction on the image to be processed to obtain a target image, so that the quality of the target image is improved.

In order to better understand the accuracy of super-resolution reconstruction of the deep learning model in the training process, the training process of the deep learning model is described in detail subsequently. The application process of the deep learning model and the training process of the deep learning model can be executed on the same device or different devices.

The deep learning model training process is as follows: acquiring a sample image and a reference image corresponding to the sample image from an image dataset; adjusting the deep learning model according to the sample image, the reference image corresponding to the sample image and the loss function to obtain an adjusted deep learning model; determining whether the loss function converges; and if the loss function is determined to be converged, determining that the deep learning model is finished, otherwise, circularly executing the training process based on the adjusted deep learning model.

For example, it may be determined whether a function value of the loss function obtained in the training process is smaller than or equal to a preset threshold, if yes, it is determined that the loss function is converged, otherwise, it is determined that the loss function is not converged.

In view of the fact that the training process of the deep learning model is a cyclic process, the single training process of the deep learning model is described through the embodiment. Referring to fig. 8, fig. 8 is a schematic flowchart of a single training of a deep learning model in an image processing method according to another embodiment of the present disclosure. As shown in fig. 8, the single training process includes:

s801, performing super-resolution reconstruction on the sample image through the deep learning model to obtain a reconstructed image of the sample image.

In the step, the sample image is input into the deep learning model, and the sample image is subjected to super-resolution reconstruction through the deep learning model to obtain a reconstructed image output by the deep learning model, namely a reconstructed image of the sample image. Wherein the resolution of the reconstructed image of the sample image is greater than the resolution of the sample image.

And S802, determining the spectrum loss between the reconstructed image of the sample image and the reference image.

In this step, the reconstructed image of the sample image may be subjected to spectrum transformation, the reference image may be subjected to spectrum transformation, and the reconstructed image after the spectrum transformation and the reference image after the spectrum transformation are compared to obtain a spectrum loss between the reconstructed image after the spectrum transformation and the reference image after the spectrum transformation.

In some embodiments, the spectral transform is a fourier transform to improve the computational accuracy of the spectral loss between the reconstructed image of the sample image and the reference image by fourier transform. At this time, one possible implementation manner of S602 includes: fourier transform is carried out on the reconstructed image of the sample image to obtain a third spectrogram; performing Fourier transform on the reference image to obtain a fourth spectrogram; and obtaining the spectral loss between the reconstructed image of the sample image and the reference image according to the difference between the third spectrogram and the fourth spectrogram.

The Fourier transform can convert the image from a spatial domain to a frequency domain, a series of sinusoidal plane signals are used as base signals to represent positive and negative images in the transform process, so that the high-frequency part and the low-frequency part of the image are decoupled, the spectral loss between the reconstructed image of the sample image and the reference image is determined based on the Fourier transform, the consistency constraint of the reconstructed image of the sample image and the reference image on the frequency domain is facilitated, namely, the consistency of the high-frequency part of the reconstructed image of the sample image and the high-frequency part of the reference image is constrained in the training process of the deep learning model, the consistency of the low-frequency part of the reconstructed image of the sample image and the low-frequency part of the reference image is constrained, and the accuracy of the deep learning model is improved.

Specifically, the reconstructed image of the sample image is converted from a spatial domain to a frequency domain through Fourier transform, so that a third spectrogram is obtained. Similarly, the reference image is transformed from the spatial domain to the frequency domain by fourier transform, resulting in a fourth spectrogram. And calculating a difference value between the third spectrogram and the fourth spectrogram to obtain the spectral loss between the reconstructed image of the sample image and the reference image.

Further, a reconstructed image of the sample imageThe formula for calculating the spectral loss from the reference image can be expressed as: rfft (F) ₂ (I) Rfft (HR (I)), where I denotes the sample image, HR (I) denotes the reference image, F ₂ Representing a deep learning model, rfft representing a Fourier transform, F ₂ (I) For the reconstructed image of the sample image, rfft (F) ₂ (I) Is the third spectrum and rift (HR (I)) is the fourth spectrum.

And S803, determining a function value of the loss function according to the spectrum loss.

In this step, the loss function includes a spectral loss between the sample image and the reference image, and the spectral loss between the sample image and the reference image is a spectral loss between the reconstructed image of the sample image and the reference image.

In some embodiments, one possible implementation manner of S803 includes: and determining the function value of the loss function as the absolute value of the spectral loss between the reconstructed image of the sample image and the reference image, so that the training process of the deep learning model is constrained through the loss function formed by the spectral loss between the reconstructed image of the sample image and the reference image, and the consistency of the high-frequency part of the reconstructed image of the sample image and the high-frequency part of the reference image is constrained.

In some embodiments, the loss function further comprises a difference between a reconstructed image of the sample image and the reference image. At this time, one possible implementation manner of S803 includes: and determining a function value of the loss function according to the difference between the reconstructed image of the sample image and the reference image and the spectral loss between the reconstructed image of the sample image and the reference image. The difference between the reconstructed image of the sample image and the reference image refers to the difference between the reconstructed image of the sample image and the reference image in the spatial domain.

Specifically, in the training process of the deep learning model, a function value of the loss function is determined according to the difference between the reconstructed image of the sample image and the reference image in a space domain and the spectral loss between the reconstructed image of the sample image and the reference image in a frequency domain. The smaller the function value of the loss function is, the better the model effect of the deep learning model is reflected, and the smaller the difference between the reconstructed image of the sample image and the reference image in the spatial domain and the spectral loss in the frequency domain are reflected. Therefore, in the training process of the deep learning model, the deep learning model is continuously optimized, the value of the loss function is reduced, and the constraint of the consistency between the reconstructed image of the sample image and the reference image of the sample image in the space domain and the frequency domain is realized.

Further, the difference between the reconstructed image of the sample image and the reference image is a difference between the reconstructed image of the sample image and the reference image.

Further, the function value of the loss function is determined to be the sum of the absolute value of the difference between the reconstructed image of the sample image and the reference image and the absolute value of the spectral loss between the reconstructed image of the sample image and the reference image, so that the difference between the reconstructed image of the sample image and the reference image and the spectral loss between the reconstructed image of the sample image and the reference image are fully embodied in the loss function.

Optionally, the absolute value of the difference between the reconstructed image of the sample image and the reference image is determined through an L1 loss function, that is, the absolute value of the difference between each pixel in the reconstructed image of the sample image and each pixel in the reference image is determined through the L1 loss function, and then the mean value of the absolute values is determined. Determining the function value of the loss function as the sum of the mean and the absolute value of the spectral loss between the reconstructed image of the sample image and the reference image

And S804, adjusting the model parameters of the deep learning model according to the function value of the loss function.

In the embodiment of the disclosure, training of the deep learning model is constrained based on a loss function including spectral loss between a reconstructed image of a sample image and a reference image, so that consistency constraint of the reconstructed image of the sample image and the reference image in a high-frequency part and a low-frequency part of the image is realized through the loss function, and the quality of the reconstructed image output by the deep learning model is improved.

In some embodiments, the sample images may be pre-processed before being input into the deep learning model, such that the sample images conform to the image form required for model input of the deep learning model. The preprocessing of the sample image may refer to the preprocessing of the sample image in the training process of the deep learning model, and is not described herein again.

Fig. 9 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure, corresponding to the image processing method of the foregoing embodiment. For ease of illustration, only portions that are relevant to embodiments of the present disclosure are shown. Referring to fig. 9, the image processing apparatus 90 includes:

an acquiring unit 901 configured to acquire an image to be processed;

a processing unit 902, configured to process the image through a deep learning model, and generate a target image, where the deep learning model is obtained based on a loss function training including a spectral loss between the sample image and the reference image.

In a possible implementation manner, when the image is processed as a depth estimation, the obtaining unit is further configured to: acquiring a plurality of frame images in a video to be processed. The processing unit is further configured to: carrying out depth estimation on the multi-frame images through a deep learning model to generate an estimated depth map of the multi-frame images; and generating a target video according to the estimated depth map of the multi-frame image.

Optionally, the image processing apparatus further comprises a training unit 903 for training the deep learning model.

In one possible implementation, the training unit 903 is configured to: carrying out depth estimation on the sample image through a deep learning model to obtain an estimated depth map of the sample image; carrying out depth estimation on the reference image through a deep learning model to obtain an estimated depth map of the reference image; determining a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image; determining a function value of a loss function according to the spectrum loss; and adjusting the model parameters of the deep learning model according to the function values of the loss function.

In one possible implementation, the training unit 903 is further configured to: carrying out Fourier transform on the estimated depth map of the sample image to obtain a first frequency spectrogram; carrying out Fourier transform on the estimated depth map of the reference image to obtain a second frequency spectrogram; and obtaining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image according to the difference between the first spectrogram and the second spectrogram.

In one possible implementation, the training unit 903 is further configured to: determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In one possible implementation, the training unit 903 is further configured to: determining a difference between the estimated depth map of the sample image and the actual depth map of the sample image; determining a difference between the estimated depth map of the sample image and the estimated depth map of the reference image; and determining a function value of the loss function according to the difference value between the estimated depth map of the sample image and the actual depth map of the sample image, the difference value between the estimated depth map of the sample image and the estimated depth map of the reference image, and the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In one possible implementation, the training unit 903 is further configured to: determining a function value of the loss function as a sum of an absolute value of a difference between the estimated depth map of the sample image and the actual depth map of the sample image, an absolute value of a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and an absolute value of a difference between a spectral map of the estimated depth map of the sample image and a spectral map of the estimated depth map of the reference image.

In one possible implementation, the training unit 903 is further configured to: performing data enhancement processing on the sample image by adopting a data enhancement processing algorithm to obtain a reference image corresponding to the sample image, wherein the data enhancement processing algorithm comprises at least one of the following algorithms: noise transformation, brightness transformation, contrast transformation, motion blur transformation and optical flow transformation.

In a possible implementation manner, the number of data enhancement processing algorithms is multiple, and the training unit 903 is specifically configured to: and performing data enhancement processing on the sample image by using the data enhancement processing algorithm according to the execution probability and the execution sequence of each data enhancement processing algorithm.

In one possible implementation, the training unit 903 is further configured to: randomly generating the light stream size of the target light stream and the light stream angle of the target light stream, wherein the light stream size meets a preset value range; and performing image mapping on the sample image according to the size of the optical flow, the angle of the optical flow and the actual depth map of the sample image to obtain the sample image after optical flow transformation.

In one possible implementation, the training unit 903 is further configured to: multiplying a target optical flow formed by the size of the optical flow and the angle of the optical flow by the reciprocal of each pixel value in the actual depth map of the sample image to obtain an adjusted target optical flow; carrying out image mapping on the sample image according to the adjusted target optical flow to obtain a sample image after optical flow transformation; the adjustment formula of the target optical flow formed by the optical flow size and the optical flow angle is expressed as follows: f = [ mag sin (angle), mag cos (angle)]*D(I) ^-1 Where I denotes a sampleImage, angle represents optical flow angle, mag represents optical flow magnitude, D (I) ^-1 An inverse of the actual depth map D (I) representing the sample image, and f represents the adjusted target optical flow; the formula for image mapping of the sample image is expressed as: t is _f (I) = remap (I, f), remap () represents image map, T _f (I) The image obtained by performing optical flow transformation on the sample image is shown.

In one possible implementation, the training unit 903 is further configured to: and performing convolution operation of a space domain and/or a frequency domain on the sample image and the random rectangular wave.

In a possible implementation manner, when the image processing is super-resolution reconstruction, the processing unit 902 is further configured to: and performing super-resolution reconstruction on the image through a deep learning model to obtain a target image.

In one possible implementation, the training unit 903 is configured to: performing super-resolution reconstruction on the sample image through a deep learning model to obtain a reconstructed image of the sample image; determining a spectral loss between a reconstructed image of the sample image and the reference image; determining a function value of a loss function according to the spectrum loss; and adjusting the model parameters of the deep learning model according to the function values of the loss function.

In one possible implementation, the training unit 903 is further configured to: fourier transform is carried out on the reconstructed image of the sample image to obtain a third spectrogram; performing Fourier transform on the reference image to obtain a fourth spectrogram; and obtaining the spectral loss between the reconstructed image of the sample image and the reference image according to the difference between the third spectrogram and the fourth spectrogram.

In one possible implementation, the training unit 903 is further configured to: and determining a function value of the loss function according to the difference between the reconstructed image of the sample image and the reference image and the spectral loss between the reconstructed image of the sample image and the reference image.

In one possible implementation, the training unit 903 is further configured to: determining a function value of the loss function as a sum of an absolute value of a difference between a reconstructed image of the sample image and the reference image and an absolute value of a difference between a spectrogram of the reconstructed image of the sample image and a spectrogram of the reference image.

The image processing apparatus provided in this embodiment may be configured to execute the technical solution related to the communication processing method in the foregoing method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

The training unit 903 may also be deployed on a separate model determining device, and the model determining device may be configured to execute a technical scheme related to a deep learning model training process in the image processing method in the foregoing method embodiment, and the implementation principle and the technical effect of the method are similar, which is not described herein again.

Referring to fig. 10, a schematic structural diagram of an electronic device 1000 suitable for implementing the embodiment of the present disclosure is shown, where the electronic device 1000 may be a terminal device or a server. Among them, the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a car terminal (e.g., car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 1001, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communication device 1009 may allow the electronic device 1000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in the above embodiments.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In a first aspect, according to one or more embodiments of the present disclosure, there is provided an image processing method including:

acquiring an image to be processed;

performing image processing on the image through a deep learning model to generate a target image, wherein the deep learning model is obtained based on loss function training including spectral loss between a sample image and a reference image;

According to one or more embodiments of the present disclosure, the image processing is depth estimation, and the acquiring the image to be processed includes: acquiring a plurality of frame images in a video to be processed. The processing the image through the deep learning model to generate a target image comprises: and performing depth estimation on the multiple frames of images through the deep learning model to generate an estimated depth map of the multiple frames of images. The image processing method further includes: and generating a target video according to the estimated depth map of the multi-frame image.

In accordance with one or more embodiments of the present disclosure, training the deep learning model based on a loss function including spectral loss between the sample image and the reference image includes: carrying out depth estimation on the sample image through the deep learning model to obtain an estimated depth map of the sample image; carrying out depth estimation on the reference image through the deep learning model to obtain an estimated depth map of the reference image; determining a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image; determining a function value of the loss function according to the spectrum loss; and adjusting the model parameters of the deep learning model according to the function values of the loss function.

According to one or more embodiments of the present disclosure, the determining a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image comprises: carrying out Fourier transform on the estimated depth map of the sample image to obtain a first frequency spectrogram; carrying out Fourier transform on the estimated depth map of the reference image to obtain a second frequency spectrogram; and obtaining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image according to the difference between the first spectrogram and the second spectrogram.

According to one or more embodiments of the present disclosure, the determining a function value of the loss function according to the spectral loss includes: determining a function value of the loss function from a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

According to one or more embodiments of the present disclosure, the determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image comprises: determining a difference between the estimated depth map of the sample image and the actual depth map of the sample image; determining a difference between the estimated depth map of the sample image and the estimated depth map of the reference image; determining a function value of the loss function based on a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

According to one or more embodiments of the present disclosure, the determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image comprises: determining a function value of the loss function as a sum of an absolute value of a difference between the estimated depth map of the sample image and the actual depth map of the sample image, an absolute value of a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and an absolute value of a difference between the spectrogram of the estimated depth map of the sample image and the spectrogram of the estimated depth map of the reference image.

According to one or more embodiments of the present disclosure, the performing data enhancement processing on the sample image includes: performing data enhancement processing on the sample image by adopting a data enhancement processing algorithm to obtain a reference image corresponding to the sample image, wherein the data enhancement processing algorithm comprises at least one of the following steps: noise transformation, brightness transformation, contrast transformation, motion blur transformation and optical flow transformation.

According to one or more embodiments of the present disclosure, the data enhancement processing algorithm includes a plurality of data enhancement processing algorithms, and the data enhancement processing is performed on the sample image by using the data enhancement processing algorithm to obtain a reference image corresponding to the sample image, including: and executing the data enhancement processing algorithm to perform data enhancement processing on the sample image according to the execution probability and the execution sequence of each data enhancement processing algorithm.

According to one or more embodiments of the present disclosure, the data enhancement processing of the sample image by using optical flow transformation includes: randomly generating the light stream size of a target light stream and the light stream angle of the target light stream, wherein the light stream size meets a preset value range; and carrying out image mapping on the sample image according to the optical flow size, the optical flow angle and the actual depth map of the sample image to obtain the sample image after optical flow transformation.

According to one or more embodiments of the present disclosure, the image mapping the sample image according to the optical flow magnitude, the optical flow angle and the actual depth map of the sample image to obtain an optical flow transformed sample image includes: multiplying a target optical flow formed by the optical flow size and the optical flow angle by the reciprocal of each pixel value in the actual depth map of the sample image to obtain an adjusted target optical flow; carrying out image mapping on the sample image according to the adjusted target optical flow to obtain a sample image after optical flow transformation; wherein the adjustment formula of the target optical flow formed by the optical flow magnitude and the optical flow angle is expressed as: f = [ mag sin (angle), mag cos (angle)]*D(I) ^-1 The I represents a sample image, the angle represents the optical flow angle, the mag represents the optical flow magnitude, the D (I) ^-1 An inverse of an actual depth map D (I) representing the sample image, f representing an adjusted target optical flow; wherein the formula for image mapping the sample image is represented as: t is _f (I) = remap (I, f), the remap () representing an image map, the T _f (I) And representing an image obtained by performing optical flow transformation on the sample image.

According to one or more embodiments of the present disclosure, performing data enhancement processing on the sample image by using motion blur transformation includes: and performing convolution operation of a space domain and/or a frequency domain on the sample image and a random rectangular wave.

According to one or more embodiments of the present disclosure, the processing of the image into super-resolution reconstruction, the processing of the image through a deep learning model to generate a target image, includes: and performing super-resolution reconstruction on the image through a deep learning model to obtain a target image.

In accordance with one or more embodiments of the present disclosure, training the deep learning model based on a loss function including spectral loss between the sample image and the reference image includes: performing super-resolution reconstruction on the sample image through the deep learning model to obtain a reconstructed image of the sample image; determining a spectral loss between a reconstructed image of the sample image and the reference image; determining a function value of the loss function according to the spectrum loss; and adjusting the model parameters of the deep learning model according to the function values of the loss function.

According to one or more embodiments of the present disclosure, the determining a spectral loss between a reconstructed image of the sample image and the reference image comprises: performing Fourier transform on the reconstructed image of the sample image to obtain a third spectrogram; performing Fourier transform on the reference image to obtain a fourth spectrogram; and obtaining the spectral loss between the reconstructed image of the sample image and the reference image according to the difference between the third spectrogram and the fourth spectrogram.

According to one or more embodiments of the present disclosure, the determining a function value of the loss function according to the spectral loss includes: determining a function value of the loss function according to a difference between the reconstructed image of the sample image and the reference image and a spectral loss between the reconstructed image of the sample image and the reference image.

According to one or more embodiments of the present disclosure, the determining a function value of the loss function according to a difference between the reconstructed image of the sample image and the reference image and a spectral loss between the reconstructed image of the sample image and the reference image includes: determining a function value of the loss function as a sum of an absolute value of a difference between the reconstructed image of the sample image and the reference image and an absolute value of a difference between the spectrogram of the reconstructed image of the sample image and the spectrogram of the reference image.

In a second aspect, according to one or more embodiments of the present disclosure, there is provided an image processing apparatus including:

an acquisition unit configured to acquire an image to be processed;

According to one or more embodiments of the present disclosure, the image processing is depth estimation, and the obtaining unit is further configured to: acquiring a multi-frame image in a video to be processed; the processing unit is further to: performing depth estimation on the multiple frames of images through the deep learning model to generate an estimated depth map of the multiple frames of images; and generating a target video according to the estimated depth map of the multi-frame image.

According to one or more embodiments of the present disclosure, the image processing apparatus further comprises a training unit, wherein the training unit is configured to: carrying out depth estimation on the sample image through the deep learning model to obtain an estimated depth map of the sample image; carrying out depth estimation on the reference image through the deep learning model to obtain an estimated depth map of the reference image; determining a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image; determining a function value of the loss function according to the spectrum loss; and adjusting the model parameters of the deep learning model according to the function value of the loss function.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: carrying out Fourier transform on the estimated depth map of the sample image to obtain a first frequency spectrogram; carrying out Fourier transform on the estimated depth map of the reference image to obtain a second frequency spectrogram; and obtaining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image according to the difference between the first spectrogram and the second spectrogram.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: determining a difference between the estimated depth map of the sample image and the actual depth map of the sample image; determining a difference between the estimated depth map of the sample image and the estimated depth map of the reference image; determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: determining a function value of the loss function as a sum of an absolute value of a difference between the estimated depth map of the sample image and the actual depth map of the sample image, an absolute value of a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and an absolute value of a difference between the spectrogram of the estimated depth map of the sample image and the spectrogram of the estimated depth map of the reference image.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: performing data enhancement processing on the sample image by adopting a data enhancement processing algorithm to obtain a reference image corresponding to the sample image, wherein the data enhancement processing algorithm comprises at least one of the following steps: noise transformation, brightness transformation, contrast transformation, motion blur transformation, and optical flow transformation.

According to one or more embodiments of the present disclosure, the data enhancement processing algorithm is plural, and the training unit is further configured to: and executing the data enhancement processing algorithm to perform data enhancement processing on the sample image according to the execution probability and the execution sequence of each data enhancement processing algorithm.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: randomly generating the light stream size of a target light stream and the light stream angle of the target light stream, wherein the light stream size meets a preset value range; and carrying out image mapping on the sample image according to the optical flow size, the optical flow angle and the actual depth map of the sample image to obtain the sample image after optical flow transformation.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: multiplying a target optical flow formed by the optical flow size and the optical flow angle by the reciprocal of each pixel value in the actual depth map of the sample image to obtain an adjusted target optical flow; carrying out image mapping on the sample image according to the adjusted target optical flow to obtain a sample image after optical flow transformation; wherein the adjustment formula of the target optical flow formed by the optical flow magnitude and the optical flow angle is expressed as: f = [ mag sin (angle), mag cos (angle)]*D(I) ^-1 The I represents a sample image, the angle represents the optical flow angle, the mag represents the optical flow magnitude, the D (I) ^-1 An inverse of an actual depth map D (I) representing the sample image, f representing an adjusted target optical flow; wherein a formula for image mapping the sample image is represented as: t is _f (I) = remap (I, f), the remap () representing a mapImage mapping, said T _f (I) And representing an image obtained by performing optical flow transformation on the sample image.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: and performing convolution operation of a space domain and/or a frequency domain on the sample image and a random rectangular wave.

In accordance with one or more embodiments of the present disclosure, the processing unit is further configured to: and performing super-resolution reconstruction on the image through a deep learning model to obtain a target image.

According to one or more embodiments of the present disclosure, the image processing apparatus further comprises a training unit, wherein the training unit is configured to: performing super-resolution reconstruction on the sample image through the deep learning model to obtain a reconstructed image of the sample image; determining a spectral loss between a reconstructed image of the sample image and the reference image; determining a function value of the loss function according to the spectrum loss; and adjusting the model parameters of the deep learning model according to the function values of the loss function.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: performing Fourier transform on the reconstructed image of the sample image to obtain a third spectrogram; performing Fourier transform on the reference image to obtain a fourth spectrogram; and obtaining the spectral loss between the reconstructed image of the sample image and the reference image according to the difference between the third spectrogram and the fourth spectrogram.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: determining a function value of the loss function according to a difference between the reconstructed image of the sample image and the reference image and a spectral loss between the reconstructed image of the sample image and the reference image.

In accordance with one or more embodiments of the present disclosure, the training unit is further configured to: determining a function value of the loss function as a sum of an absolute value of a difference between the reconstructed image of the sample image and the reference image and an absolute value of a difference between the spectrogram of the reconstructed image of the sample image and the spectrogram of the reference image.

In a third aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory to cause the at least one processor to perform the image processing method as set forth in the first aspect above and in various possible designs of the first aspect.

In a fourth aspect, according to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the image processing method as described in the first aspect above and in various possible designs of the first aspect.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other combinations of features described above or equivalents thereof without departing from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. An image processing method comprising:

acquiring an image to be processed;

2. The image processing method of claim 1, the image processing being depth estimation, the obtaining an image to be processed comprising:

acquiring a multi-frame image in a video to be processed;

the processing the image through the deep learning model to generate a target image comprises:

performing depth estimation on the multiple frames of images through the deep learning model to generate an estimated depth map of the multiple frames of images;

the image processing method further includes:

and generating a target video according to the estimated depth map of the multi-frame image.

3. The image processing method of claim 2, training the deep learning model based on a loss function comprising spectral loss between the sample image and the reference image, comprising:

carrying out depth estimation on the sample image through the deep learning model to obtain an estimated depth map of the sample image;

carrying out depth estimation on the reference image through the deep learning model to obtain an estimated depth map of the reference image;

determining a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image;

determining a function value of the loss function according to the spectrum loss;

and adjusting the model parameters of the deep learning model according to the function values of the loss function.

4. The method of image processing according to claim 3, said determining a spectral loss between an estimated depth map of the sample image and an estimated depth map of the reference image, comprising:

carrying out Fourier transform on the estimated depth map of the sample image to obtain a first frequency spectrogram;

carrying out Fourier transform on the estimated depth map of the reference image to obtain a second frequency spectrogram;

and obtaining the spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image according to the difference between the first spectrogram and the second spectrogram.

5. The image processing method of claim 3 or 4, the determining a function value of the loss function from the spectral loss, comprising:

determining a function value of the loss function according to a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image.

6. The method of image processing according to claim 5, said determining a function value of the loss function from a difference between the estimated depth map of the sample image and an actual depth map of the sample image, a difference between the estimated depth map of the sample image and an estimated depth map of the reference image, and a spectral loss between the estimated depth map of the sample image and the estimated depth map of the reference image, comprising:

determining a difference between the estimated depth map of the sample image and the actual depth map of the sample image;

determining a difference between the estimated depth map of the sample image and the estimated depth map of the reference image;

7. The method of image processing according to claim 6, said determining a function value of the loss function from a difference between the estimated depth map of the sample image and the actual depth map of the sample image, a difference between the estimated depth map of the sample image and the estimated depth map of the reference image being a spectral loss between the sample image and the estimated depth map of the reference image, comprising:

determining a function value of the loss function as a sum of an absolute value of a difference between the estimated depth map of the sample image and the actual depth map of the sample image, an absolute value of a difference between the estimated depth map of the sample image and the estimated depth map of the reference image, and an absolute value of a difference between the spectrogram of the estimated depth map of the sample image and the spectrogram of the estimated depth map of the reference image.

8. The method of image processing according to any of claims 2-4,6,7, the performing data enhancement processing on the sample image comprising:

performing data enhancement processing on the sample image by adopting a data enhancement processing algorithm to obtain a reference image corresponding to the sample image, wherein the data enhancement processing algorithm comprises at least one of the following algorithms: noise transformation, brightness transformation, contrast transformation, motion blur transformation, and optical flow transformation.

9. The image processing method according to claim 8, wherein the data enhancement processing algorithm is multiple, and the data enhancement processing on the sample image by using the data enhancement processing algorithm to obtain the reference image corresponding to the sample image comprises:

and executing the data enhancement processing algorithm to perform data enhancement processing on the sample image according to the execution probability and the execution sequence of each data enhancement processing algorithm.

10. The image processing method according to claim 9, wherein performing data enhancement processing on the sample image by using optical flow transformation comprises:

randomly generating the light stream size of a target light stream and the light stream angle of the target light stream, wherein the light stream size meets a preset value range;

and carrying out image mapping on the sample image according to the optical flow size, the optical flow angle and the actual depth map of the sample image to obtain the sample image after optical flow transformation.

11. The image processing method according to claim 10, wherein said image mapping the sample image according to the optical flow magnitude, the optical flow angle and the actual depth map of the sample image to obtain an optical flow transformed sample image comprises:

multiplying a target optical flow formed by the optical flow size and the optical flow angle by the reciprocal of each pixel value in the actual depth map of the sample image to obtain an adjusted target optical flow;

carrying out image mapping on the sample image according to the adjusted target optical flow to obtain a sample image after optical flow transformation;

wherein the adjustment formula of the target optical flow formed by the optical flow magnitude and the optical flow angle is expressed as:

f＝[mag*sin(angle),mag*cos(angle)]*D(I) ^-1 the I represents a sample image, the angle represents the optical flow angle, the mag represents the optical flow magnitude, the D (I) ^-1 An inverse of an actual depth map D (I) representing the sample image, f representing an adjusted target optical flow;

wherein a formula for image mapping the sample image is represented as:

T _f (I) = remap (I, f), the remap () representing an image map, the T _f (I) And representing an image obtained by performing optical flow transformation on the sample image.

12. The image processing method according to claim 8, wherein performing data enhancement processing on the sample image by using motion blur transformation comprises:

and performing convolution operation of a space domain and/or a frequency domain on the sample image and a random rectangular wave.

13. The image processing method according to claim 1, the image processing being super-resolution reconstruction, the processing of the image by a deep learning model to generate a target image comprising:

and performing super-resolution reconstruction on the image through a deep learning model to obtain a target image.

14. The image processing method of claim 13, training the deep learning model based on a loss function comprising spectral loss between the sample image and the reference image, comprising:

performing super-resolution reconstruction on the sample image through the deep learning model to obtain a reconstructed image of the sample image;

determining a spectral loss between a reconstructed image of the sample image and the reference image;

15. The method of image processing according to claim 14, said determining spectral loss between a reconstructed image of the sample image and the reference image, comprising:

performing Fourier transform on the reconstructed image of the sample image to obtain a third spectrogram;

performing Fourier transform on the reference image to obtain a fourth spectrogram;

and obtaining the spectral loss between the reconstructed image of the sample image and the reference image according to the difference between the third spectrogram and the fourth spectrogram.

16. The image processing method of claim 14 or 15, the determining a function value of the loss function from the spectral loss, comprising:

determining a function value of the loss function according to a difference between the reconstructed image of the sample image and the reference image and a spectral loss between the reconstructed image of the sample image and the reference image.

17. The image processing method of claim 16, the determining a function value of the loss function from a difference between the reconstructed image of the sample image and the reference image and a spectral loss between the reconstructed image of the sample image and the reference image, comprising:

determining a function value of the loss function as a sum of an absolute value of a difference between the reconstructed image of the sample image and the reference image and an absolute value of a difference between the spectrogram of the reconstructed image of the sample image and the spectrogram of the reference image.

18. An image processing apparatus comprising:

an acquisition unit configured to acquire an image to be processed;

19. An electronic device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the image processing method of any of claims 1 to 17.

20. A computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the image processing method of any one of claims 1 to 17.