CN118135083A - Image processing method, model training method and device - Google Patents

Image processing method, model training method and device Download PDF

Info

Publication number
CN118135083A
CN118135083A CN202211535932.7A CN202211535932A CN118135083A CN 118135083 A CN118135083 A CN 118135083A CN 202211535932 A CN202211535932 A CN 202211535932A CN 118135083 A CN118135083 A CN 118135083A
Authority
CN
China
Prior art keywords
image
light source
scene
source color
encoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211535932.7A
Other languages
Chinese (zh)
Inventor
汪昊
李炜明
河仁友
王强
成映勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to CN202211535932.7A priority Critical patent/CN118135083A/en
Priority to KR1020230153731A priority patent/KR20240082198A/en
Priority to US18/524,153 priority patent/US20240187559A1/en
Priority to EP23213638.2A priority patent/EP4401041A1/en
Publication of CN118135083A publication Critical patent/CN118135083A/en
Pending legal-status Critical Current

Links

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides an image processing method, a model training method and a device thereof. An image processing method may include: acquiring an input image; predicting light source color information in a scene corresponding to the input image by using an image processing model, and a panoramic image corresponding to the input image; and performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image. The above-described image processing method performed by the electronic device may be performed using an artificial intelligence model.

Description

Image processing method, model training method and device
Technical Field
The present disclosure relates to the field of image processing artificial intelligence, and more particularly, to an image processing method, a model training method, an image processing apparatus, and a model training apparatus.
Background
Mixed reality technology provides a real information experience for a user by adding virtual content in a real scene in front of the user. When virtual content and a real scene are subjected to virtual-real fusion, a mixed reality system needs to have high-precision real-time processing and understanding on illumination conditions of the real scene so as to finish the high-quality virtual-real fusion effect presented in front of a user. However, existing devices are unable to predict the light source color at the time of shooting a scene, resulting in color deviation of the rendered image presented by the mixed reality system from the real scene image.
Disclosure of Invention
Exemplary embodiments of the present disclosure provide an image processing method, a model training method, and an image processing apparatus, a model training apparatus, solve at least the above technical problems and other technical problems not mentioned above, and provide the following advantageous effects.
According to a first aspect of embodiments of the present disclosure, there is provided an image processing method, the method including: acquiring an input image; predicting light source color information in a scene corresponding to the input image by using an image processing model, and a panoramic image corresponding to the input image; and performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image.
Optionally, predicting the light source color information in the scene corresponding to the input image and the panoramic image corresponding to the input image by using an image processing model includes: encoding the input image by utilizing an encoding network of the image processing model to obtain encoding characteristics of the input image; the light source color information and the panoramic image are predicted based on the encoding features using a first prediction network of the image processing model.
Optionally, predicting the light source color information in the scene corresponding to the input image and the panoramic image corresponding to the input image by using an image processing model includes: encoding the input image by utilizing an encoding network of the image processing model to obtain encoding characteristics of the input image; predicting the light source color information based on the coding features using a second prediction network of the image processing model; and decoding the coding features by using a decoding network of the image processing model, and predicting the panoramic image.
Optionally, encoding the input image by using an encoding network of the image processing model to obtain an encoding feature of the input image, including: dividing the input image to obtain a plurality of image blocks; encoding the plurality of image blocks respectively to obtain encoding characteristics of the plurality of image blocks; and obtaining the coding characteristics of the input image based on the coding characteristics of the image blocks.
Optionally, encoding the plurality of image blocks respectively to obtain encoding features of the plurality of image blocks, including: respectively encoding the plurality of image blocks to obtain initial encoding characteristics of each image block; and carrying out feature enhancement on the initial coding feature of each image block by utilizing the first attention network of the image processing model to obtain the enhanced coding feature of each image block, wherein the enhanced coding feature is used as the coding feature of the plurality of image blocks.
Optionally, predicting the light source color information includes: for each image block, predicting light source color information in a scene corresponding to the image block based on coding features of the image block and determining a confidence of the image block; and pooling the light source color information corresponding to each image block based on the confidence of each image block to obtain the light source color information.
Optionally, predicting the panoramic image includes: acquiring the confidence coefficient of each image block; decoding the coding features of each image block respectively to obtain a high dynamic range image corresponding to each image block; and pooling the high dynamic range image corresponding to each image block based on the confidence of each image block to obtain the panoramic image.
Optionally, encoding the input image by using an encoding network of the image processing model to obtain an encoding feature of the input image, including: encoding the input image by utilizing the encoding network to obtain global characteristics of the input image; performing feature decoupling on the global features by using a decoupling network of the image processing model to obtain illumination features and scene features; and utilizing a second attention network of the image processing model to perform feature enhancement on the illumination feature and the scene feature to obtain a scene-based illumination feature serving as a coding feature of the input image.
Optionally, performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image, including: performing image rendering based on the input image, the panoramic image and information related to the virtual object to obtain a rendered image; and outputting the rendered image by using a video perspective device.
Optionally, performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image, including: performing color correction on the panoramic image based on the light source color information by using a white balance method to obtain a panoramic image corresponding to a white light source; performing image rendering based on the panoramic image of the white light source and information related to the virtual object to obtain a first rendered image; mapping the light source color to obtain the real light source color of the scene; performing color adjustment on the first rendered image by utilizing the real light source color to obtain a second rendered image; and outputting a target image by using an optical perspective device, wherein the target image is formed by superposing the input image and the prime number second rendering image.
Optionally, the method further comprises: and predicting camera parameters related to the input image shooting by using the image processing model.
Optionally, the input image is a limited field of view image.
According to a second aspect of embodiments of the present disclosure, there is provided a training method of an image processing model, the training method including: acquiring training data, wherein the training data comprises limited view field images with different view angles in the same scene, a first panoramic image and first light source color information when the limited view field images and the first panoramic image are shot; predicting second light source color information in a scene corresponding to the limited field of view image and a second panoramic image corresponding to the limited field of view image by using the image processing model; network parameters of the image processing model are adjusted based on the first light source color information, the second light source color information, the first panoramic image, and the second panoramic image.
Optionally, predicting, using the image processing model, second light source color information in a scene corresponding to the limited field of view image and a second panoramic image corresponding to the limited field of view image, including: encoding the limited view field image by utilizing an encoding network of the image processing model to obtain encoding characteristics of the limited view field image; the second light source color information and the second panoramic image are predicted based on the encoding features using a first prediction network of the image processing model.
Optionally, predicting, using the image processing model, second light source color information in a scene corresponding to the limited field of view image and a second panoramic image corresponding to the limited field of view image, including: encoding the limited view field image by utilizing an encoding network of the image processing model to obtain encoding characteristics of the limited view field image; predicting the second light source color information based on the coding features using a second prediction network of the image processing model; and decoding the coding feature by using a decoding network of the image processing model, and predicting the second panoramic image.
Optionally, encoding the limited field of view image using an encoding network of the image processing model to obtain an encoding feature of the limited field of view image, including: dividing the limited view field image to obtain a plurality of image blocks; encoding the plurality of image blocks respectively to obtain encoding characteristics of the plurality of image blocks; and obtaining the coding characteristics of the limited field image based on the coding characteristics of the image blocks.
Optionally, encoding the plurality of image blocks respectively to obtain encoding features of the plurality of image blocks, including: respectively encoding the plurality of image blocks to obtain initial encoding characteristics of each image block; and carrying out feature enhancement on the initial coding feature of each image block by utilizing the first attention network of the image processing model to obtain the enhanced coding feature of each image block, wherein the enhanced coding feature is used as the coding feature of the plurality of image blocks.
Optionally, predicting the second light source color information includes: for each image block, predicting light source color information in a scene corresponding to the image block based on coding features of the image block and determining a confidence of the image block; and pooling the light source color information corresponding to each image block based on the confidence of each image block to obtain the second light source color information.
Optionally, predicting the second panoramic image includes: acquiring the confidence coefficient of each image block; decoding the coding features of each image block respectively to obtain a high dynamic range image corresponding to each image block; and pooling the high dynamic range image corresponding to each image block based on the confidence of each image block to obtain the second panoramic image.
Optionally, encoding the limited field of view image using an encoding network of the image processing model to obtain an encoding feature of the limited field of view image, including: encoding the limited field image by using the encoding network to obtain global features of the limited field image; performing feature decoupling on the global features by using a decoupling network of the image processing model to obtain illumination features and scene features; and utilizing a second attention network of the image processing model to perform feature enhancement on the illumination feature and the scene feature to obtain a scene-based illumination feature serving as an encoding feature of the limited-field image.
Optionally, the method further comprises: acquiring first camera parameters related to the limited field image shooting; predicting second camera parameters associated with the limited field of view image capture using the image processing model; network parameters of the image processing model are adjusted based on the first light source color information, the second light source color information, the first panoramic image, the second panoramic image, the first camera parameters, and the second camera parameters.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the image processing method and the model training method as described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the image processing method and model training method as described above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executed by at least one processor in an electronic device to perform the image processing method and the model training method as described above.
The present disclosure proposes devices and methods for predicting true light source colors and HDR panoramic images for use in various scenarios for improving robustness and adaptability in mixed reality applications.
The present disclosure proposes a real-time image processing model that is capable of predicting both HDR panoramic images and illuminant colors.
The virtual-real fusion technology is designed, and high-reality rendering can be carried out on the ST MR equipment of the VST MR equipment by utilizing illumination results predicted by a single image processing model, so that the virtual content and the real scene have illumination consistency.
The disclosure provides an image processing model based on image blocks, which explores reliable image blocks capable of capturing local context clues and improves accuracy of scene illumination estimation.
The present disclosure proposes an image processing model that can decouple illumination features and scene features, which can robustly predict HDR panoramic images with real details for various scenes.
Drawings
These and/or other aspects and advantages of the present disclosure will become apparent from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flowchart of an image processing method according to an exemplary embodiment of the present disclosure;
Fig. 2 is a schematic view of an image processing flow according to a first exemplary embodiment of the present disclosure;
fig. 3 is a schematic view of an image processing flow according to a second exemplary embodiment of the present disclosure;
Fig. 4 is a schematic view of an image processing flow according to a third exemplary embodiment of the present disclosure;
Fig. 5 is a schematic view of an image processing flow according to a fourth exemplary embodiment of the present disclosure;
Fig. 6 is a schematic view of an image processing flow according to a fifth exemplary embodiment of the present disclosure;
fig. 7 is a schematic view of an image processing flow according to a sixth exemplary embodiment of the present disclosure;
FIG. 8 is a flow diagram of rendering an image according to a first exemplary embodiment of the present disclosure;
FIG. 9 is a flow diagram of rendering an image according to a second exemplary embodiment of the present disclosure;
FIG. 10 is a schematic diagram of a VST MR device and an OST MR device rendering images in accordance with an exemplary embodiment of the present disclosure;
FIG. 11 is a flow diagram of rendering an image according to a third exemplary embodiment of the present disclosure;
FIG. 12 is a flowchart of a training method of an image processing model according to an exemplary embodiment of the present disclosure;
FIG. 13 is a schematic diagram of sample data according to an exemplary embodiment of the present disclosure;
FIG. 14 is a flow diagram of a training method of an image processing model according to an exemplary embodiment of the present disclosure;
FIG. 15 is a flow diagram of a training method of a scene decoder and scene classifier according to an exemplary embodiment of the present disclosure;
fig. 16 is a schematic structural view of an image processing apparatus of a hardware running environment of an embodiment of the present disclosure;
fig. 17 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to written meanings, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the claims and their equivalents.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Currently, for high dynamic range (HIGH DYNAMIC RANGE, HDR) image prediction techniques, an auto encoder based network ENVMAPNET is proposed for estimating HDR ambient images in real time from limited field of view camera images. The network proposes two new penalties for end-to-end training, where penalty ProjectionLoss is used to generate images and penalty ClusterLoss is used for resistance training. Furthermore, a mobile application is implemented based on this, which is able to run the network model on the mobile device at 9 milliseconds and to render visually coherent virtual objects in real time in an unseen real world environment. However, this network model treats all image blocks equally and can only handle indoor scenes. In view of this, the present disclosure proposes an image processing model based on image blocks, which is capable of processing semantic cues from various parts of an image and paying attention to the contribution of the different parts, and of processing indoor and outdoor scenes simultaneously.
One solution to the problem that directly compressing an outdoor panoramic image into low-dimensional hidden features can cause the HDR intensity of the sun to interfere with the complex texture of the surrounding sky and lack fine-grained control has been to design a layered decoupled sky model (HDSky). Using this model, any outdoor panoramic image can be decomposed hierarchically into a number of factors based on three well-designed automatic encoders. The first auto-encoder compresses each sunny panoramic image into a sky vector and a sun vector with some constraints. The second and third automatic encoders further separate the solar and sky intensities from the solar and sky vectors, with several custom penalty functions, respectively. In addition, a unified framework is designed to predict all-weather sky information from a single outdoor image. But this method is only applicable to processing illumination estimates of outdoor scenes. In this regard, the present disclosure enables one network to address illumination estimation for multiple classes of scenes by decoupling hidden features into hidden features that express illumination and scene hidden features that process indoor and outdoor scenes.
A learning-based approach has also been proposed to infer reasonable HDR, omnidirectional illumination for unconstrained, low dynamic range (Low DYNAMIC RANGE, LDR) images of limited field of view. For training data, video of various reflective spheres placed within the camera field of view is collected so that most of the background is not occluded, and different illumination cues are displayed in one exposure using materials with different reflective functions. A deep neural network is then trained to achieve regression of HDR illumination from LDR background images by matching LDR ground truth ball images with images rendered using image-based re-illumination predictive illumination. Running at an interactive frame rate on a mobile device, virtual objects can be truly rendered into a real scene of a mobile Mixed Reality (MR). However, the image resolution obtained by this method (restoring a low resolution illumination map (32 x 32 resolution illumination map) for an unconstrained scene) may not meet the MR rendering requirements. In this regard, the present disclosure contemplates a method that is capable of generating higher resolution illumination maps because the high resolution illumination maps reflect the real details that are important for rendering virtual objects with mirrors.
Furthermore, the inventors have found that existing illumination estimation models do not predict the light source color, whereas the real light source color needs to be used in an optical perspective (Optical See Through, OST) MR device to eliminate the illumination color bias of the camera image and the real scene. The rendered images in the video perspective (Video See Through, VST) MR device are synthesized using the illumination model estimated from the camera images and therefore have the same color bias when combined with the camera images. However, in an OST MR device, the rendered image is projected onto a transparent optical combiner to fuse with the real image perceived by the human eye. When the camera has color deviation, the virtual content rendered by the deviation illumination model has color deviation from the real image seen by human eyes. Therefore, in the illumination estimation method of the OST MR, an illumination model with light source color prediction needs to be estimated for a Computer Graphics (CG) rendering engine to render an image with consistent virtual to real illumination.
In addition, the inventors have found that prior art methods using encoder-decoder networks directly encode an input image as a global illumination feature, lacking fine-grained control over independent illumination factors. In fact, not all local areas of the image contribute the same to the determination of the light source. For example, some local areas will provide more informative semantic cues for inferring the direction of illumination, such as shadows of objects (e.g., vases, boxes, sofas), reflecting information from floors and mirror planes. Meanwhile, estimating scene illumination from the context information of the local area makes knowledge learned by the network more solvable.
Furthermore, the inventors have found that methods such as ENVMAPNET for HDR illumination image estimation directly predict a complete high resolution panoramic illumination image using a generative model. Due to the lack of a large-scale database for high quality panoramic HDR images, such methods are typically trained using a very limited number of images. While it is very difficult to infer a complete HDR panorama when the input image is recorded in a LDR image with a limited field of view. In addition, the shooting environments of scenes are diverse, including indoor scenes (e.g., schools, apartments, museums, laboratories, factories, sports facilities, restaurants) and outdoor scenes (e.g., parks, streets). In these different types of scenes/environments, the features, structures, textures, lighting conditions of the objects differ greatly. Thus, how to infer reasonable scene details from incomplete cues of appearance of input and generalize models to various types of scenes remains a matter of doubt.
Based at least on the above-mentioned problems, the present disclosure proposes a method and apparatus for predicting real light source colors and HDR panoramic images suitable for various scenes, for improving robustness and adaptability in mixed reality applications; the real-time network model is provided, and the HDR panoramic image and the light source color can be predicted at the same time; the virtual-real fusion technology is designed, and the result predicted by the network model can be utilized to perform high-reality image rendering on the VST MR and OST MR equipment, so that the virtual content and the real scene have illumination consistency; the image processing model based on the image blocks is also provided, reliable image blocks capable of capturing local context clues are explored, and accuracy of scene illumination estimation is improved; a network model capable of decoupling illumination features and scene features is also provided, and illumination images with real details can be robustly predicted for various scenes.
Hereinafter, the method, apparatus 5 of the present disclosure will be described in detail with reference to the accompanying drawings, according to various embodiments of the present disclosure.
Fig. 1 is a flowchart of an image processing method according to an exemplary embodiment of the present disclosure. The image processing method of the present disclosure may be an illumination estimation method. The method shown in fig. 1 may be implemented by the image processing model of the present disclosure (which may be referred to as a light estimation model). The various modules in the image processing model may be
Is implemented by a neural network. The image processing method may be performed in an electronic device having an image processing function. The electronic 0 devices may be, for example, smartphones, tablet computers, portable computers, desktop computers, and the like.
Referring to fig. 1, in step S101, an input image is acquired.
For example, an image captured by a camera (or camera) may be used as the input image. In general, when a camera is used to photograph an indoor or outdoor scene, the camera may not be used due to a limited angle of view
A panoramic image can be captured. As examples, the input image may be a limited field of view image, a limited field of view 5 color image, or a scene incomplete color image. As an example, the input image may be an LDR image with a limited field of view.
In step S102, light source color information in a scene corresponding to an input image and a panoramic image corresponding to the input image are predicted using an image processing model.
According to an embodiment of the present disclosure, one image processing model may be used to simultaneously predict light source color information in a scene corresponding to input image pair 0 and a panoramic image corresponding to the input image. The image processing die
The model may be implemented by a neural network.
As an example, the image processing model may include an encoding network (encoder) and a first prediction network (first predictor). Encoding the input image using the encoding network of the image processing model to obtain an input
The coding features of the image are then used to predict 5 corresponding light source color information and panoramic image based on the coding features using a first prediction network of the image processing model.
As an example, the encoding network of the present disclosure may be implemented by an automatic encoder. The auto-encoder may compress the original spatial features of the input image into one potential spatial feature (i.e., an encoded feature) through the encoding map for use in a subsequent decoding operation. Furthermore, the coding network may also be implemented by other neural networks. The above examples are merely exemplary, and the present disclosure is not limited thereto.
0 After obtaining the coding feature of the input image, the coding feature may be input to a first prediction network implemented by a neural network, and the first prediction network may output light source color information corresponding to the input image and the panoramic image.
As another example, the image processing model may include an encoding network (encoder), a second prediction network (second predictor), and a decoding network (decoder). After the coding network outputs the coding features, a second prediction network of the image processing model can be utilized to predict the light source color information corresponding to the input image based on the coding features; and decoding the encoded features using a decoding network of the image processing model to predict a panoramic image corresponding to the input image.
In the present disclosure, the second prediction network may be referred to as a light source color predictor for predicting an illumination color in a scene to which the captured image corresponds.
For example, the second predictive network may be formed from a layer 2 or layer 3 convolutional network. The coding features of the input image may output a three-dimensional vector representing the color of the light source in the scene to which the image corresponds, via a convolutional network. The above examples are merely exemplary, and the present disclosure is not limited thereto.
As another example, the decoding network may obtain a high dynamic range HDR panoramic image corresponding to the input image by decoding the encoded features. In the present disclosure, an HDR panoramic image may be referred to as an HDR illumination panorama, illumination map, etc., without limitation of the present disclosure.
As an example, the decoding network of the present disclosure may be implemented by an auto decoder corresponding to an auto encoder that obtains an HDR panoramic image by reconstructing potential spatial features. In the present disclosure, a decoder for predicting a panoramic image may be referred to as an illumination decoder.
For example, the above-described neural network ENVMAPNET may be employed as the encoding and decoding networks of the present disclosure to predict an HDR panoramic image, and a convolutional network may be employed as the second prediction network of the present disclosure to predict the illumination color corresponding to the input image. In the present disclosure, an encoding network for encoding may also be referred to as an encoder, a decoding network for decoding encoding features may be referred to as a decoder, and a prediction network for predicting light source colors may be referred to as a predictor.
According to another embodiment of the present disclosure, the second prediction network for predicting the light source color may be an image block-based prediction network, i.e., an encoding network outputs encoding features of a plurality of image blocks divided from an input image, which may predict the light source color in a scene corresponding to each image block. Fig. 2 is a schematic diagram of an image processing flow according to a first exemplary embodiment of the present disclosure.
Referring to fig. 2, the image processing model may be composed of an encoder, a decoder, and a light source color predictor.
The input image passes through an encoder to obtain global coding features of the input image, and the global coding features pass through a decoder to obtain an HDR panoramic image. Here, the encoder and decoder may be implemented as the encoder and decoder in ENVMAPNET.
And adding a light source color prediction branch based on the image block on the basis of the image block to obtain light source color information of a scene corresponding to the image block. For example, the light source color predictor is implemented using a layer 2 to layer 3 convolutional network. The encoder may divide the input image according to a preset image block division method to obtain a plurality of image blocks (such as n×m image blocks, where N and M are positive integers greater than zero), and then encode each image block separately to obtain encoding features of each image block. For example, an input image is divided into 3×4=12 image blocks, and the spatial dimension of the coding feature output by the encoder is 3×4. The input to the light source color predictor is feature-coded features having N x M spatial dimensions, the light source color predictor estimating confidence and light source color for each coded feature of the image block. Here, the confidence may represent a proportion/weight, such as 95%, of the corresponding image block in estimating the light source color of the entire image. And carrying out weighted pooling on the light source colors of the image blocks according to the confidence of each image block to obtain the light source color information of the scene corresponding to the input image.
The image processing model shown in fig. 2 is capable of outputting a high resolution HDR panoramic image and light source color information of a three-dimensional vector. The HDR panoramic image reflects information such as the omnidirectional illumination direction, distribution, intensity, and the like of the scene, and the light source color information reflects the illumination color of the scene, for example, the light source color information is information related to the real illumination color of the photographed real scene. The light source color information may be represented by RGB values, but is not limited thereto.
Fig. 3 is a schematic view of an image processing flow according to a second exemplary embodiment of the present disclosure. In addition to predicting light source colors as shown in fig. 2, the present disclosure can also be extended to estimating camera parameters of a captured image, such as exposure and sensitivity, etc.
Referring to fig. 3, an input image passes through an encoder to obtain global encoding features of the input image, and the global encoding features pass through a decoder to obtain an HDR panoramic image. Here, the encoder and decoder may be implemented as the encoder and decoder in ENVMAPNET.
After obtaining the encoding features of the respective image blocks of the input image from the encoder, the predictor may predict camera parameters (such as exposure, sensitivity) of the respective image blocks, light source colors of the scene, and confidence, based on the encoding features of the respective image blocks, and then obtain the light source colors, the camera parameters of the input image, respectively, using a confidence-based weighted pooling operation.
The predictors illustrated in fig. 3 may include the light source color predictor of fig. 2, and may also include a corresponding predictor for predicting camera parameters. In addition, prediction of camera parameters, light source color, and confidence may also be implemented by one predictor.
By predicting more camera parameters, the augmented reality (Augmented Reality, AR) and MR can be allowed to develop more intelligent technology, and virtual-real fusion can be realized in a highly real mode.
Fig. 4 is a schematic view of an image processing flow according to a third exemplary embodiment of the present disclosure. The method shown in fig. 4 can predict an accurate HDR panoramic image from the local features of different image blocks.
Referring to fig. 4, the input image is a limited field of view color image or an incomplete color image. The input image is passed through an encoder to obtain the encoding characteristics of each image block of the input image. For example, the spatial dimension of the coding features output by an encoder implemented by a convolutional neural network is nxm, i.e., the input image is divided into nxm image blocks.
The image block-based decoder may decode the encoded features of the respective image blocks separately, resulting in a High Dynamic Range (HDR) image corresponding to each image block. For each of the individual image blocks of the input image, the light source color predictor may predict a light source color in a scene corresponding to the image block based on the coding features of the image block and determine a confidence of the image block.
The image block-based decoder may pool the high dynamic range image corresponding to each image block based on the confidence of the respective image block, e.g., using adaptive weighted pooling, resulting in an HDR panoramic image corresponding to the input image. The encoding features of the image blocks are used for outputting the HDR images of the image blocks through the image block-based decoder, and the HDR images of the image blocks are weighted and pooled according to the confidence of the image blocks to obtain the HDR panoramic image. The coding features of the image blocks pass through a light source color predictor, the light source colors of the image blocks and the confidence coefficient of the image blocks are output, and the light source colors of the image blocks are weighted and pooled according to the confidence coefficient of the image blocks to obtain the light source color information in the scene corresponding to the input image.
Because there is a certain relation between image blocks reflecting illumination information on the image, such as shadow of an object may cover a plurality of image blocks or shadow information exists on objects in a plurality of image blocks, the relation of information between the image blocks is very important for accurately predicting illumination details, and the operation of performing focus learning and selection on relevant content of the illumination information can be completed through a design attention mechanism.
Fig. 5 is a schematic view of an image processing flow according to a fourth exemplary embodiment of the present disclosure.
Referring to fig. 5, an encoder encodes the divided image blocks, respectively, to obtain an initial encoding characteristic of each image block. And carrying out feature enhancement on the initial coding feature of each image block by utilizing the first attention network of the image processing model to obtain the enhanced coding feature of each image block. As an example, the first attention network may be implemented by a transducer network. The individual coded features input to the first attention network may be applied with an attention mechanism.
For example, the coding features of the individual image blocks obtained by the encoder output the enhanced coding features of the image blocks via an image block attention network. The attention operation of an image block can be designed as a basic object (token) of the attention operation of the image block, and the self-attention mechanism (self-attention) is used for operating the initial coding feature of the image block (i.e. the coding feature of each image block output by the encoder). The enhanced image block coding features respectively pass through a decoder and a light source color predictor, and the light source color information of the HDR panoramic image and the photographed image scene is output.
The above embodiments are all operated on global features of the input image, and according to the embodiments of the present disclosure, the illumination features (may also be referred to as illumination hidden features, hidden features expressing illumination, etc.) and the scene features (may also be referred to as scene hidden features, hidden features expressing scene content, etc.) may be decoupled from the global features output from the encoder to ensure the accuracy of illumination estimation and the rationality of scene generation. Here, the illumination characteristic may represent a relevant characteristic reflecting an illumination color of the scene, and the scene characteristic may represent a relevant characteristic of the scene of the captured image. For example, scene features may include scene structural features, texture features, and the like.
Fig. 6 is a schematic view of an image processing flow according to a fifth exemplary embodiment of the present disclosure.
The method comprises the steps of encoding an input image by using an encoder of an image processing model to obtain global features of the input image, performing feature decoupling on the global features by using a decoupling network of the image processing model to obtain illumination features and scene features, and performing feature enhancement on the illumination features and the scene features by using a second attention network of the image processing model to obtain illumination features based on scenes. And predicting light source color information in a scene corresponding to the input image by using illumination characteristics based on the scene, and decoding the HDR panoramic image.
Referring to fig. 6, an input image is passed through an encoder to obtain global image features. The global image features are subjected to feature decoupling operation to respectively obtain illumination hidden features and scene hidden features. For example, the decoupling operation may be a simple mapping operation that converts the illumination latent feature and the scene latent feature into two independent feature items.
The scene hidden features and the illumination hidden features expressing the scene content pass through a scene-aware attention network, and the scene-aware attention network can apply an attention mechanism to the scene hidden features and the illumination hidden features to obtain illumination features of scene guidance. The decoder outputs an HDR panoramic image based on the scene-directed illumination characteristics.
Fig. 7 is a schematic view of an image processing flow according to a sixth exemplary embodiment of the present disclosure.
Referring to fig. 7, an input image passes through an encoder to obtain global image features of respective image blocks. For each image block, the global image features of the image block are subjected to feature decoupling operation to respectively obtain illumination hidden features and scene hidden features for the image block. The scene hidden features and the illumination hidden features pass through a scene-aware attention network, which can apply an attention mechanism to the scene hidden features and the illumination hidden features to obtain illumination features of scene guidance for the image block. Here, the scene-guided illumination feature may represent that the illumination feature includes some feature expressing illumination extracted from the scene information. An image block-based decoder (which may also be referred to as a local image block-based illumination decoder) derives an HDR image for each image block based on scene-directed illumination characteristics, and outputs an HDR panoramic image through a confidence-based weighted pooling operation. The light source color predictor based on the image blocks outputs light source color information in the scenes corresponding to the image blocks based on illumination characteristics of scene guidance, and outputs light source colors in the scenes corresponding to the input images through a weighted pooling operation based on confidence.
In step S103, image rendering is performed based on the predicted light source color information and/or the panoramic image, resulting in a rendered image.
According to embodiments of the present disclosure, a single image processing model may be utilized to process VST devices and image rendering of the OST devices.
For a VST device, image rendering may be performed based on the input image, the predicted HDR panoramic image, and information related to the virtual object, resulting in a rendered image, which is output by the VST device.
For OST equipment, a white balance method can be utilized to carry out color correction on the predicted panoramic image based on the predicted light source color information, so as to obtain a panoramic image of a white light source; performing image rendering based on the panoramic image of the white light source and information related to the virtual object to obtain a first rendered image; mapping the predicted light source color information to obtain the real light source color information of the real scene; performing color adjustment on the first rendered image by using the real light source color information to obtain a second rendered image; and outputting, by the OST device, a target image formed by superimposing the input image and the second rendered image. As an example, the mapping operation may be implemented through a lightweight neural network.
Fig. 8 is a flow diagram of rendering an image according to a first exemplary embodiment of the present disclosure.
Referring to fig. 8, a camera image is input, and an HDR panoramic image and light source color information are obtained using the image processing model described above.
For VST devices, a CG rendering engine (CG renderer) directly renders a rendered image using the predicted HDR panoramic image, virtual objects and related information (such as position, pose, orientation, etc.), the real image captured by the camera. The rendered image is output as a display of the VST device.
And for the OST equipment, performing color correction on the predicted HDR panoramic image by using the predicted light source color information by using white balance operation to obtain the corrected HDR panoramic image of the white light source. The CG rendering engine obtains a rendered image including only virtual content using the corrected HDR panoramic image of the white light source, the virtual object, and related information.
In order to compensate for the color deviation between the captured image of the camera and the real image perceived by the human eye, a simple mapping operation may be used to map the light source color corresponding to the camera image to the real scene light source color. For example, the predicted illuminant color is mapped to the scene's actual illuminant color via a mapping operation. The mapping operation may be implemented using a layer 2 or layer 3 multi-layer perceptron.
And performing color adjustment on the rendering image only comprising the virtual content by using the mapped real light source color of the scene to obtain a repaired rendering image, and superposing the repaired rendering image on a display of the OST equipment.
Methods according to embodiments of the present disclosure may be used in augmented reality, mixed reality systems to estimate illumination information of a user's surroundings (i.e., information reflecting light source distribution and intensity in the environment, such as including illumination map and/or light source color information) in real time, providing basis for drawing and rendering virtual objects, such that
Virtual objects are more naturally combined with real environments and real-time reasoning can support a fluent augmented 5 reality experience.
Fig. 9 is a flow diagram of rendering an image according to a second exemplary embodiment of the present disclosure.
Referring to fig. 9, an image captured by an AR camera may be passed through one of the image processing models of the present disclosure to obtain an HDR panoramic image and light source colors in the scene of the captured image. The high resolution high dynamic range illumination panoramic image has rich details and can support high realism rendering of virtual content of a high-gloss surface. CG rendering engines of 0VST devices or OST devices may utilize predicted HDR panoramic images, virtual
The pseudo object and its related information and/or light source color information to obtain a rendered image.
The virtual object may be rendered to the real scene using realistic illumination, particularly by enabling real reflection details on the specular virtual object surface using HDR panoramic images, and in the OST head-mounted device, light source color information is used to ensure that the virtual object has a consistent illumination color with the real scene.
5 The present disclosure can estimate illumination using an illumination model and achieve the effect of rendering specular objects, and is adaptable
The illumination estimation method for various types of scenes (such as living room, office, store, restaurant, street, park, etc.) can be applied to actual complex application scenes. In addition, the illumination of different devices (OST and VST) can be processed by using one illumination model, and virtual content and real-world scenes consistent in illumination can be rendered.
FIG. 10 is a VST MR device and OST MR device rendering according to an exemplary embodiment of the present disclosure
Schematic representation of the dye image. Fig. 10 illustrates the difference in the process of virtual bonding of an OST device and a VST device.
The illumination map in fig. 10 may represent an HDR panoramic image.
Referring to fig. 10, in a VST MR apparatus, a real image of the world is captured by a camera and displayed
The rendered images on the screen are combined. When the camera image has color deviation, human eyes cannot perceive obvious color inconsistency between the real 5 image and the rendered image. This is because rendering the image is using the slave camera map
Synthesized like the estimated illumination model and therefore have the same color deviation.
In an OST MR device, a rendered image is projected onto a transparent optical combiner, fused with a real image perceived by the human eye. When the camera image has color bias, virtual rendering with biased illumination
The object may have color deviations from the actual image seen by the human eye. Therefore, the OST MR device needs to provide an illumination estimate for the CG rendering 0 engine that contains the real world illuminant colors to render virtual objects and true objects
And (5) a consistent image is illuminated by real light.
Fig. 11 is a flow chart of rendering an image according to a third exemplary embodiment of the present disclosure.
Referring to fig. 11, using the image processing model of the present disclosure, the HDR panoramic image of the camera image and the light source colors in the scene reflected by the camera image are predicted. Image rendering is performed on different devices (such as VST devices and OST devices) using the rendering flow shown in fig. 8, so that the illumination of the virtual object and the real scene can be kept consistent on the different devices.
By estimating the illumination of the scene environment, the illumination of virtual body content rendered in the augmented reality system can be controlled, so that the virtual object is rendered with high quality and the interaction between the virtual object and the real scene is realized in real time.
Fig. 12 is a flowchart of a training method of an image processing model according to an exemplary embodiment of the present disclosure. The image processing model of the present disclosure may be trained on an electronic device or server.
Referring to fig. 12, in step S1201, training data is acquired. The training data may include a limited field image having different field angles in the same scene, a first panoramic image, and first light source color information when the limited field image and the first panoramic image are photographed.
Since the number of complete panoramic images is limited, LDR color images of different angles of view can be collected. Further, when the above-described real image is acquired, the light source color that captured the real image is acquired at the same time. In order to obtain a priori information of the scene content that is sufficiently rich, images of different angles of view may be captured in different scenes.
Fig. 13 is a schematic diagram of sample data according to an exemplary embodiment of the present disclosure.
An image having a specific angle of view may be used as an input of an image processing model, and a real panoramic image or a relatively complete panoramic image including the image having the specific angle of view may be used as a comparison object of the predicted panoramic image. As shown in FIG. 13, when training the model, multiple image pairs, such as scene incomplete image, scene complete image, scene incomplete image, scene relatively complete image, can be obtained, in such a way that a sufficient number of training samples can be constructed. The image pairs are those of the same scene.
In step S1202, second light source color information in a scene corresponding to the limited field of view image and a second panoramic image corresponding to the limited field of view image are predicted using the image processing model.
For example, in the case where the image processing model includes an encoding network and a first prediction network, the encoding network of the image processing model may be utilized to encode the limited field of view image resulting in encoded features of the limited field of view image; the second light source color information and the second panoramic image are predicted based on the encoding features using a first prediction network of the image processing model. The first predictive network may be implemented by a neural network.
For another example, in the case where the image processing model includes an encoding network, a first prediction network, and a decoding network, the limited field of view image may be encoded using the encoding network of the image processing model to obtain an encoded feature of the limited field of view image; predicting second light source color information based on the coding features using a second prediction network of the image processing model; and decoding the coding features by using a decoding network of the image processing model, and predicting the second panoramic image.
The encoding network of the present disclosure may be implemented by an automatic encoder. The auto-encoder compresses the original spatial features of the limited field of view image into one potential spatial feature (i.e., encoded feature) through the encoding map for use in a subsequent decoding operation. The above examples are merely exemplary, and the present disclosure is not limited thereto.
The second predictive network of the present disclosure may be formed from a layer 2 or layer 3 convolutional network. The coding features of the limited field image can output three-dimensional vectors representing the colors of the light sources in the scene corresponding to the image through the convolution network. The above examples are merely exemplary, and the present disclosure is not limited thereto.
The decoding network of the present disclosure may be implemented by an auto decoder corresponding to an auto encoder that obtains an HDR panoramic image by reconstructing potential spatial features. For example, the above-described neural network ENVMAPNET may be employed as the encoding and decoding networks of the present disclosure to predict an HDR panoramic image, and a convolutional network may be employed as the second prediction network of the present disclosure to predict illumination color information corresponding to a limited field-of-view image.
In step S1203, network parameters of the image processing model are adjusted based on the first light source color information, the second light source color information, the first panoramic image, and the second panoramic image.
For example, a first loss function may be constructed based on the first light source color information and the second light source color information, and a second loss function may be constructed based on the first panoramic image and the second panoramic image, network parameters of the image processing model being adjusted by minimizing losses calculated by the first loss function and the second loss function. For another example, the network parameters may be adjusted by supervised learning of the predicted HDR panoramic image and the real panoramic image. In addition, the network parameters can be further adjusted by determining whether the output HDR panoramic image is a real image or a false image through the discriminator, so that the predicted HDR panoramic image is more vivid.
According to another embodiment of the present disclosure, the image processing model may also predict relevant camera parameters of the captured image. For example, the panoramic image, the light source color information, and the camera parameters corresponding to the limited field of view image may be predicted and output simultaneously by the first prediction network. Or the second prediction network can output the light source color information and the camera parameters corresponding to the limited field image at the same time. Or additional neural networks (such as a layer 2 or layer 3 convolutional network) may be added to the image processing model for predicting relevant camera parameters.
In case of predicting camera parameters, a first camera parameter of a captured image may be acquired; and predicting a second camera parameter for capturing the limited field of view image by using the image processing model. During model training, network parameters of the image processing model are adjusted based on the first camera parameters, the second camera parameters, the first light source color information, the second light source color information, the first panoramic image and the second panoramic image. For example, a first loss function may be constructed based on the first light source color information and the second light source color information, a second loss function may be constructed based on the first panoramic image and the second panoramic image, a third loss function may be constructed based on the first camera parameters and the second camera parameters, and network parameters of the image processing model may be adjusted by minimizing losses calculated by the first loss function, the second loss function, and the third loss function.
In accordance with an embodiment of the present disclosure, in the case where the encoding network outputs image block-based encoding features, the decoding network and the second prediction network may be a block-based decoding network and a block-based prediction network. The coding network can divide the limited view field image according to a preset image block dividing method to obtain a plurality of image blocks; and respectively encoding the plurality of image blocks to obtain encoding characteristics of the plurality of image blocks as encoding characteristics of the limited field-of-view image. For each image block of the plurality of image blocks, the block-based prediction network may predict light source color information in a scene corresponding to the image block based on coding features of the image block and determine a confidence of the image block; and pooling the light source color information corresponding to the image blocks based on the confidence of each image block to obtain the light source color information in the scene corresponding to the limited field image. The decoding network can respectively decode the coding features of the image blocks to obtain a high dynamic range image corresponding to each image block; and pooling the high dynamic range image corresponding to the image block based on the confidence of each image block to obtain a second panoramic image with high dynamic range corresponding to the limited field image. In this way, network parameters of the encoding network, the block-based decoding network, and the block-based prediction network are adjusted by minimizing the losses calculated by the first and second loss functions.
According to a further embodiment of the present disclosure, the image processing model may further comprise a first attention network, and the encoding network may encode the plurality of image blocks, respectively, resulting in an initial encoding feature for each image block. The first attention network performs feature enhancement on the initial coding feature of each image block to obtain enhanced coding features of each image block, wherein the enhanced coding features are used as coding features of a plurality of image blocks. The block-based prediction network may predict light source color information in a scene corresponding to the image block based on the enhanced coding features of the image block and determine a confidence of the image block; and pooling the light source color information corresponding to the image blocks based on the confidence of each image block to obtain the light source color information in the scene corresponding to the limited field image. The decoding network can respectively decode the enhancement coding features of the image blocks to obtain a high dynamic range image corresponding to each image block; and pooling the high dynamic range image corresponding to the image blocks based on the confidence of each image block to obtain the HDR panoramic image. In this case, the network parameters of the encoding network, the block-based decoding network, the block-based prediction network, and the first attention network are adjusted by minimizing the losses calculated by the first and second loss functions.
According to a further embodiment of the present disclosure, the image processing model may further comprise a decoupling network and a second attention network. The encoding network can encode the limited field image to obtain the global feature of the limited field image. The decoupling network can perform feature decoupling on the global features to obtain illumination features and scene features. The second attention network may perform feature enhancement on the illumination features and scene features to obtain scene-based illumination features as encoding features for the limited field of view image. The decoding network utilizes the illumination characteristics based on the scene to obtain an HDR panoramic image, and the light source color prediction network utilizes the illumination characteristics based on the scene to obtain the light source color information of the shot image scene. In this case, the network parameters of the encoding network, decoding network, light source color prediction network, decoupling network and second attention network are adjusted by minimizing the losses calculated by the first and second loss functions.
In addition, considering that training the model with illumination information alone cannot completely decouple the scene features and the illumination features, the scene hidden features cannot be guaranteed to express scene-related information, and the illumination hidden features cannot be guaranteed to express illumination-related information. In order to mine the expressive power of the hidden features of the scene to various scene contents (such as scene attributes, structures, textures and the like), the present disclosure proposes a method for learning the expression of the scene contents by using the supervision training of two tasks of scene generation and scene classification.
Fig. 14 is a flow diagram of a training method of an image processing model according to an exemplary embodiment of the present disclosure. During the training phase of the image processing model, a neural network based scene decoder (which may also be referred to as a scene decoding network) and a scene classifier (which may also be referred to as a scene classification network) may be introduced.
As an example, a first scene type and a third panoramic image corresponding to the limited field of view image may be acquired, a second scene type of the limited field of view image is predicted based on the decoupled scene features using a scene classifier, the scene features are decoded using a scene decoder to obtain a fourth panoramic image of the limited field of view image, and network parameters of the image processing model are adjusted based on the first scene type, the second scene type, the third panoramic image and the fourth panoramic image, the first light source color information, the second light source color information, the first panoramic image and the second panoramic image.
Referring to fig. 14, the decoupled scene hidden features output scene types through a scene classifier, and a complete scene panorama image (i.e., a fourth panorama image, which may represent an image for presenting scene information) is generated through a scene decoder. The scene type may be an indoor or outdoor scene, or a scene further divided into libraries, offices, parks, streets, etc.
As an example, the tasks of the scene decoder and scene classifier are only used for learning the scene content expressive power, and thus only used for model training phase, not needed at model reasoning, while the network parameters of the scene decoder and scene classifier can be fixed at the time of training the model for optimal learning of the scene hidden features. As shown in fig. 14, in the training phase, training is performed on scene classification tasks in a fully supervised manner; and for the scene generating task, guiding the learning of the scene hidden features in a self-supervision mode. Specifically, for the self-supervision training process, the generated scene panoramic image is subjected to coding operation and feature decoupling operation to generate illumination hidden features and scene hidden features, and the scene hidden features generated by the generated scene panoramic image and the scene hidden features generated by the input image are constrained to be as similar as possible.
Since the main role of the learning of the scene classification task and the scene generation task is to learn the scene content, it needs to be ensured that the scene classifier can distinguish between different types of scenes and the scene decoder can reconstruct the input features, the present disclosure proposes a method of learning the scene decoder and the scene classifier, so that the scene decoder and the scene classifier with fixed parameters can be used in the training stage of the image processing model.
Fig. 15 is a flow diagram of a training method of a scene decoder and scene classifier according to an exemplary embodiment of the present disclosure.
Referring to fig. 15, the input image is a color image of a scene incompleteness. The input image is first subjected to illumination enhancement operation to obtain an enhanced image. In order to ensure that the model has invariance to different illumination, the enhanced image is subjected to an encoder to obtain scene characteristics, the scene characteristics are subjected to a scene decoder to generate a scene panoramic image, and the scene type is obtained through a scene classifier.
Based on the generated scene panoramic image and the real panoramic image, learning and training are performed in a semi-supervision mode. Based on the generated scene type and the real scene type, learning training is performed in a full supervision mode.
After the trained scene classifier and scene decoder are obtained, they are applied to the model training of FIG. 14. The network parameters of the scene classifier and the scene decoder are fixed in the training process of the image processing model so as to better perform optimization learning on the scene hidden features.
Furthermore, in the training of the image processing model, the network parameters may be adjusted by supervised learning of the predicted HDR panoramic image and the real panoramic image (i.e. GT illumination panoramic image). The network parameters can be further adjusted by determining whether the output HDR panoramic image is a real image or a false image through the discriminator, so that the predicted HDR panoramic image is more vivid.
Training with the arbiter enables the decoder to produce an HDR panoramic image that looks real. Training using a field Jing Jiema and a scene classifier enables scene hidden features to have rich scene content for guiding the encoder to generalize to illumination estimation for various scene types.
Fig. 16 is a schematic structural diagram of an image processing apparatus of a hardware running environment of an embodiment of the present disclosure. Here, the image processing apparatus 1600 may implement the image processing function and the model training function described above.
As shown in fig. 16, the image processing apparatus 1600 may include: a processing component 1601, a communication bus 1602, a network interface 1603, an input-output interface 1604, a memory 1605, and a power supply component 1606. Wherein a communication bus 1602 is used to enable connection communication between the components. The input output interface 1604 may include a video display (such as a liquid crystal display), a microphone and speaker, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and the input output interface 1604 may optionally also include a standard wired interface, a wireless interface. Network interface 1603 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). Memory 1605 may be a high-speed random access memory or a stable nonvolatile memory. Memory 1605 may also optionally be a storage device separate from the aforementioned processing component 1601.
Those skilled in the art will appreciate that the structure shown in fig. 16 does not constitute a limitation of the image processing apparatus 1600, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 16, an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, programs related to the above-described image processing method and model training method, and a database may be included in the memory 1605 as one storage medium.
In the image processing apparatus 1600 shown in fig. 16, the network interface 1603 is mainly used for data communication with an external apparatus/terminal; the input/output interface 1604 is mainly used for data interaction with a user; the processing component 1601, the memory 1605 in the image processing apparatus 1600 may be provided in the image processing apparatus 1600, and the image processing apparatus 1600 executes the image processing method and the model training method and the like provided by the embodiments of the present disclosure by the processing component 1601 calling a program stored in the memory 1605 to implement the image processing method and the model training method of the present disclosure and various APIs provided by an operating system.
The processing component 1601 may include at least one processor, with a set of computer-executable instructions stored in a memory 1605 that, when executed by the at least one processor, perform an image processing method and a model training method according to embodiments of the present disclosure. Further, the processing component 1601 may perform an image processing process or a model training process, etc., as described above. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.
By way of example, image processing device 1600 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the image processing apparatus 1600 need not be a single electronic apparatus, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction sets) individually or in combination. The image processing device 1600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
In image processing apparatus 1600, processing component 1601 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processing component 1601 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
The processing component 1601 may execute instructions or code stored in a memory, where the memory 1605 may also store data. Instructions and data may also be transmitted and received over a network via network interface 1603, wherein network interface 1603 may employ any known transmission protocol.
Memory 1605 may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, memory 1605 may include a separate device, such as an external disk drive, storage array, or other storage device that any database system may use. The memory and the processor may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor is able to read files stored in the memory.
According to embodiments of the present disclosure, an electronic device may be provided. Fig. 17 is a block diagram of an electronic device, according to an embodiment of the present disclosure, the electronic device 1700 may include at least one memory 1702 and at least one processor 1701, the at least one memory 1702 storing a set of computer-executable instructions that, when executed by the at least one processor 1701, perform an image processing method and a model training method according to an embodiment of the present disclosure.
The processor 1701 may include a Central Processing Unit (CPU), an audio-visual processor, a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, the processor 1701 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
Memory 1702, which is one type of storage medium, may include an operating system (e.g., a MAC operating system), a data storage module, a network communication module, a user interface module, a recommendation module, and a database.
The memory 1702 may be integral to the processor 1701, e.g., RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. In addition, the memory 1702 may include a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory 1702 and the processor 1701 may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, or the like, such that the processor 1701 is capable of reading files stored in the memory 1702.
In addition, the electronic device 1700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of electronic device 1700 may be connected to each other via buses and/or networks.
By way of example, the electronic device 1700 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 1700 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) singly or in combination. The electronic device 1700 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
It will be appreciated by those skilled in the art that the structure shown in fig. 17 is not limiting and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.
At least one of the above modules may be implemented by an AI model. The functions associated with the AI may be performed by a non-volatile memory, a volatile memory, and a processor.
The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors such as a Central Processing Unit (CPU), an Application Processor (AP), etc., processors for graphics only (e.g., graphics Processor (GPU), visual Processor (VPU), and/or AI-specific processors (e.g., neural Processing Unit (NPU)).
The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operational rules or artificial intelligence models may be provided through training or learning. Here, providing by learning means that a predefined operation rule or AI model having a desired characteristic is formed by applying a learning algorithm to a plurality of learning data. Learning may be performed in the device itself performing AI according to an embodiment and/or may be implemented by a separate server/device/system.
A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data so that, allowing, or controlling the target device makes a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
According to the present invention, in the image processing method performed by the electronic device, the output image after processing the target area can be obtained by taking the input image as the input data of the artificial intelligence model.
The artificial intelligence model may be obtained through training. Herein, "obtaining by training" refers to training a basic artificial intelligence model having a plurality of training data by a training algorithm to obtain predefined operational rules or artificial intelligence models configured to perform a desired feature (or purpose).
As an example, the artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by calculation between the calculation result of the previous layer and the plurality of weight values. Examples of neural networks include, but are not limited to, convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann machines limited (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNN), generative Antagonism Networks (GAN), and deep Q networks.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform an image processing method and a model training method according to the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card-type memories (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, hard disks, solid state disks, and any other devices configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
In accordance with embodiments of the present disclosure, a computer program product may also be provided, instructions in which are executable by a processor of a computer device to perform the above-described image processing method and model training method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (15)

1. An image processing method, comprising:
acquiring an input image;
Predicting light source color information in a scene corresponding to the input image by using an image processing model, and a panoramic image corresponding to the input image;
and performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image.
2. The method of claim 1, wherein predicting light source color information in a scene corresponding to the input image and a panoramic image corresponding to the input image using an image processing model comprises:
encoding the input image by utilizing an encoding network of the image processing model to obtain encoding characteristics of the input image;
the light source color information and the panoramic image are predicted based on the encoding features using a first prediction network of the image processing model.
3. The method of claim 1, wherein predicting light source color information in a scene corresponding to the input image and a panoramic image corresponding to the input image using an image processing model comprises:
encoding the input image by utilizing an encoding network of the image processing model to obtain encoding characteristics of the input image;
Predicting the light source color information based on the coding features using a second prediction network of the image processing model;
And decoding the coding features by using a decoding network of the image processing model, and predicting the panoramic image.
4. A method according to claim 2 or 3, wherein encoding the input image using the encoding network of the image processing model results in encoded features of the input image, comprising:
dividing the input image to obtain a plurality of image blocks;
encoding the plurality of image blocks respectively to obtain encoding characteristics of the plurality of image blocks;
and obtaining the coding characteristics of the input image based on the coding characteristics of the image blocks.
5. The method of claim 4, wherein encoding the plurality of tiles, respectively, results in encoding characteristics of the plurality of tiles, comprising:
respectively encoding the plurality of image blocks to obtain initial encoding characteristics of each image block;
and carrying out feature enhancement on the initial coding feature of each image block by utilizing the first attention network of the image processing model to obtain the enhanced coding feature of each image block, wherein the enhanced coding feature is used as the coding feature of the plurality of image blocks.
6. The method of claim 4, wherein predicting the light source color information comprises:
For each image block, predicting light source color information in a scene corresponding to the image block based on coding features of the image block and determining a confidence of the image block;
and pooling the light source color information corresponding to each image block based on the confidence of each image block to obtain the light source color information.
7. The method of claim 4, wherein predicting the panoramic image comprises:
acquiring the confidence coefficient of each image block;
Decoding the coding features of each image block respectively to obtain a high dynamic range image corresponding to each image block;
And pooling the high dynamic range image corresponding to each image block based on the confidence of each image block to obtain the panoramic image.
8. A method according to claim 2 or 3, wherein encoding the input image using the encoding network of the image processing model results in encoded features of the input image, comprising:
encoding the input image by utilizing the encoding network to obtain global characteristics of the input image;
performing feature decoupling on the global features by using a decoupling network of the image processing model to obtain illumination features and scene features;
And utilizing a second attention network of the image processing model to perform feature enhancement on the illumination feature and the scene feature to obtain a scene-based illumination feature serving as a coding feature of the input image.
9. The method of claim 1, wherein performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image comprises:
performing image rendering based on the input image, the panoramic image and information related to the virtual object to obtain a rendered image;
and outputting the rendered image by using a video perspective device.
10. The method of claim 1, wherein performing image rendering based on the light source color information and/or the panoramic image to obtain a rendered image comprises:
performing color correction on the panoramic image based on the light source color information by using a white balance method to obtain a panoramic image corresponding to a white light source;
Performing image rendering based on the panoramic image of the white light source and information related to the virtual object to obtain a first rendered image;
mapping the light source color information to obtain the real light source color information of the scene;
Performing color adjustment on the first rendered image by utilizing the real light source color information to obtain a second rendered image;
And outputting a target image by using an optical perspective device, wherein the target image is formed by superposing the input image and the prime number second rendering image.
11. The method of claim 1, further comprising:
And predicting camera parameters related to the input image shooting by using the image processing model.
12. The method of claim 1, wherein the input image is a limited field of view image.
13. A method of training an image processing model, the method comprising:
Acquiring training data, wherein the training data comprises limited view field images with different view angles in the same scene, a first panoramic image and first light source color information when the limited view field images and the first panoramic image are shot;
predicting second light source color information in a scene corresponding to the limited field of view image and a second panoramic image corresponding to the limited field of view image by using the image processing model;
Network parameters of the image processing model are adjusted based on the first light source color information, the second light source color information, the first panoramic image, and the second panoramic image.
14. An electronic device, comprising:
At least one processor;
at least one memory storing computer-executable instructions,
Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1 to 13.
15. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1 to 13.
CN202211535932.7A 2022-12-01 2022-12-01 Image processing method, model training method and device Pending CN118135083A (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN202211535932.7A CN118135083A (en) 2022-12-01 2022-12-01 Image processing method, model training method and device
KR1020230153731A KR20240082198A (en) 2022-12-01 2023-11-08 Apparatus and method for processing image
US18/524,153 US20240187559A1 (en) 2022-12-01 2023-11-30 Apparatus and method with image processing
EP23213638.2A EP4401041A1 (en) 2022-12-01 2023-12-01 Apparatus and method with image processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211535932.7A CN118135083A (en) 2022-12-01 2022-12-01 Image processing method, model training method and device

Publications (1)

Publication Number Publication Date
CN118135083A true CN118135083A (en) 2024-06-04

Family

ID=91239519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211535932.7A Pending CN118135083A (en) 2022-12-01 2022-12-01 Image processing method, model training method and device

Country Status (2)

Country Link
KR (1) KR20240082198A (en)
CN (1) CN118135083A (en)

Also Published As

Publication number Publication date
KR20240082198A (en) 2024-06-10

Similar Documents

Publication Publication Date Title
CN114119849B (en) Three-dimensional scene rendering method, device and storage medium
WO2021103137A1 (en) Indoor scene illumination estimation model, method and device, and storage medium and rendering method
Bi et al. Deep reflectance volumes: Relightable reconstructions from multi-view photometric images
Nalbach et al. Deep shading: convolutional neural networks for screen space shading
Zhang et al. All-weather deep outdoor lighting estimation
CN108537864B (en) Editing digital images using neural networks with network rendering layers
US11288857B2 (en) Neural rerendering from 3D models
US10936909B2 (en) Learning to estimate high-dynamic range outdoor lighting parameters
Yao et al. Neilf: Neural incident light field for physically-based material estimation
Meilland et al. 3d high dynamic range dense visual slam and its application to real-time object re-lighting
US10957026B1 (en) Learning from estimated high-dynamic range all weather lighting parameters
WO2022228383A1 (en) Graphics rendering method and apparatus
CN115115688B (en) Image processing method and electronic equipment
Wei et al. Object-based illumination estimation with rendering-aware neural networks
TW202336694A (en) Integrated machine learning algorithms for image filters
CN117557714A (en) Three-dimensional reconstruction method, electronic device and readable storage medium
CN113110731B (en) Method and device for generating media content
Lu et al. 3D real-time human reconstruction with a single RGBD camera
CN114820292A (en) Image synthesis method, device, equipment and storage medium
US20210209833A1 (en) Unsupervised learning of three dimensional visual alphabet
CN116977431A (en) Three-dimensional scene geometry, material and illumination decoupling and editing system
CN116958000A (en) Remote sensing ship target image generation method based on UE5 and application thereof
Kang et al. View-dependent scene appearance synthesis using inverse rendering from light fields
CN118135083A (en) Image processing method, model training method and device
CN115439595A (en) AR-oriented indoor scene dynamic illumination online estimation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication