CN117044215A

CN117044215A - Method and system for low-light media enhancement

Info

Publication number: CN117044215A
Application number: CN202280018046.9A
Authority: CN
Inventors: S 格林·罗什·K; 尼克希尔·克里施南; 亚什·哈布哈彊卡; 博德希萨特娃·曼达尔
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-06-15
Filing date: 2022-06-13
Publication date: 2023-11-10
Also published as: EP4248657A1; EP4248657A4; US20220398700A1

Abstract

A method for enhancing media comprising: receiving, by an electronic device, a media stream; performing, by the electronic device, alignment of a plurality of frames of a media stream; correcting, by the electronic device, brightness of the plurality of frames; selecting, by the electronic device, one of a first neural network, a second neural network, or a third neural network by analyzing parameters of the plurality of frames with corrected brightness, wherein the parameters include at least one of lens boundary detection and artificial light flicker; and generating, by the electronic device, an output media stream by processing the plurality of frames of the media stream using a selected one of the first neural network, the second neural network, or the third neural network.

Description

Method and system for low-light media enhancement

Technical Field

The present disclosure relates to the field of media processing, and more particularly to low-light media enhancement.

Background

Video captured in low light conditions or captured using low quality sensors may suffer from various problems:

high noise: the maximum exposure time of a video may be limited by the desired number of Frames Per Second (FPS), which leads to high noise in low light conditions;

Low brightness: in low light conditions, the lack of sufficient ambient light results in dark video;

color artifact: the accuracy of the sensor used to capture the exact color decreases, resulting in a loss of color accuracy as the number of photons captured decreases;

obtaining good output quality by performing low complexity Artificial Intelligence (AI) video processing (full HD-30 FPS) is difficult;

processing power and memory constraints for long duration video capture;

flicker due to time consistency issues; and

there is a lack of real world data sets for training.

In related art methods, spatial or temporal filters may be used to denoise/enhance video captured in low light conditions. However, spatial or temporal filters may not effectively remove noise from video when video is captured in low light conditions or using low quality sensors.

In some related art methods, a deep Convolutional Neural Network (CNN) may be used to enhance video. However, the deep CNN used in the related art method may be too computationally intensive and memory intensive to be deployed on an electronic device/mobile phone in real time. Enhanced video using depth CNN may also suffer flicker due to inconsistent denoising of consecutive video frames.

Disclosure of Invention

Technical problem

Methods and systems are provided for enhancing media captured in low light conditions and using inferior sensors.

Another aspect of embodiments herein is to provide a method and system for enhancing media by analyzing parameters of multiple frames of video to switch between a first neural network, a second neural network, and a third neural network, wherein the parameters include shot boundary detection and artificial light flicker, wherein the first neural network is a high complexity neural network (HCN) with one input frame, the second neural network is a time-guided lower complexity neural network (TG-LCN) for joint deflicker or joint denoise with a 'q' number of input frames and previous output frames, and the third neural network is a neural network for denoise with a 'p' number of input frames and previous output frames, wherein 'p' is less than 'q'.

It is another aspect of embodiments herein to provide methods and systems for training a first/second/third neural network using a multi-frame conjoined (Siamese) training method.

Solution to the problem

According to one aspect of the disclosure, a method for enhancing media includes: receiving, by an electronic device, a media stream; performing, by the electronic device, alignment of a plurality of frames of a media stream; correcting, by the electronic device, brightness of the plurality of frames; selecting, by the electronic device, one of a first neural network, a second neural network, or a third neural network by analyzing parameters of the plurality of frames with corrected brightness, wherein the parameters include at least one of lens boundary detection and artificial light flicker; and generating, by the electronic device, an output media stream by processing the plurality of frames of the media stream using a selected one of the first neural network, the second neural network, or the third neural network.

The media stream may be captured in low light conditions and the media stream may include at least one of noise, low brightness, artificial flicker, and color artifacts.

The output media stream may be a de-noised media stream with enhanced brightness and zero flicker.

Correcting the brightness of the plurality of frames of the media stream may include: identifying a single frame or the plurality of frames of the media stream as an input frame; linearizing the input frame using an Inverse Camera Response Function (ICRF); selecting a luminance multiplication factor for correcting the luminance of the input frame using the future time guide; applying a linear boost to the input frame based on the brightness multiplication factor; and applying a Camera Response Function (CRF) to the input frame to correct the brightness of the input frame, wherein the CRF is a function of the sensor type and metadata, wherein the metadata includes an exposure value and international organization for standardization (ISO), and the CRF and ICRF are stored as a look-up table (LUT).

Selecting the brightness multiplication factor may include: analyzing the brightness of the input frame; identifying a maximum constant boost value as a luminance multiplication factor based on the luminance of the input frame being less than a threshold and the luminance of all frames in the future time buffer being less than the threshold; identifying a lifting value of a monotonically decreasing function between a maximum constant lifting value and 1 as a brightness multiplication factor based on the brightness of the input frame being less than the threshold and the brightness of all frames in the future time buffer being greater than the threshold; identifying a unity gain boost value as a brightness multiplication factor based on the brightness of the input frame being greater than the threshold and the brightness of all frames in the future time buffer being greater than the threshold; and identifying a lifting value of a monotonically increasing function between 1 and a maximum constant lifting value as a luminance multiplication factor based on the luminance of the input frame being greater than the threshold and the luminance of the frame in the future time buffer being less than the threshold.

Selecting, by the electronic device, one of the first neural network, the second neural network, or the third neural network may include: analyzing each frame relative to an earlier frame to determine whether shot boundary detection is associated with each frame of the plurality of frames; selecting a first neural network to generate an output media stream by processing the plurality of frames of the media stream based on shot boundary detection associated with the plurality of frames; analyzing the plurality of frames for the presence of artificial light flicker based on shot boundary detection not associated with the plurality of frames; selecting a second neural network to generate an output media stream by processing the plurality of frames of the media stream based on the presence of artificial light flicker in the plurality of frames; and selecting a third neural network to generate an output media stream by processing the plurality of frames of the media stream based on an absence of artificial light flicker in the plurality of frames.

The first neural network may be a high complexity neural network with one input frame, the second neural network may be a time-guided lower complexity neural network for joint deflicker or joint denoising with 'q' number of input frames and previous output frames, and

the third neural network may be a neural network for denoising having a 'p' number of input frames and previous output frames, where 'p' is smaller than 'q'.

The first neural network may include a plurality of residual blocks at a lowest level to enhance noise removal capability, and the second neural network may include at least one convolution operation with fewer feature maps and a previous output frame as a guide to process a plurality of input frames.

The first, second, and third neural networks may be trained using a multi-frame conjoined training method to generate an output media stream by processing a plurality of frames of the media stream.

The method may further comprise: training a neural network of at least one of the first, second, and third neural networks by: creating a dataset for training a neural network, wherein the dataset comprises one of a local dataset and a global dataset; selecting at least two groups of frames from the created dataset, wherein each group comprises at least three frames; adding the synthesized motion to the selected at least two sets of frames, wherein the at least two sets of frames to which the synthesized motion is added include different noise realizations; and performing a conjoined training of the neural network using the reference real media and the at least two sets of frames added with the synthesized motion.

Creating the data set may include: capturing a burst data set, wherein the burst data set comprises one of a low-light static media with noise input and a clean reference real frame; simulating global and local motions for each burst dataset using the synthetic trajectory generation and the synthetic stop motion, respectively; removing at least one burst data set having a structural and luminance mismatch between the clean reference real frame and the low-light static media; and creating a data set by including the at least one burst data set that does not include a structural and brightness mismatch between the clean reference real frame and the low-light static media.

Simulating global motion for each burst data set may include: estimating a polynomial coefficient range based on parameters including maximum translation and maximum rotation; generating a third order polynomial trajectory using the estimated polynomial coefficient range; approximating a third order trajectory using the maximum depth and the generated third order polynomial trajectory; generating uniform sampling points based on the predefined sampling rate and the approximated 3D trajectory; generating 'n' affine transformations based on the generated uniform sampling points; and applying the generated n affine transforms to each burst data set.

Simulating the local motion of each burst data set includes: capturing local object motion from each burst data set in a static scene using the composite stop motion, the capturing local object motion comprising: capturing an input with a background scene and a reference real scene; capturing an input with foreground objects and a reference real scene; cutting out a foreground object; and creating a composite scene by placing foreground objects at different locations of the background scene; and simulating motion blur for each local object motion by averaging a predefined number of frames of the burst data set.

Performing the conjoined training of the neural network may comprise: delivering the at least two sets of frames with different noise realizations to a neural network to generate at least two sets of output frames; calculating a conjoined loss by calculating a loss between the at least two sets of output frames; calculating pixel loss by calculating an average value and a reference reality of the at least two sets of output frames; calculating a total loss using the connected loss and the pixel loss; and training the neural network using the calculated total loss.

According to an aspect of the disclosure, an electronic device includes: a memory; and a processor coupled to the memory and configured to: receiving a media stream; performing alignment of a plurality of frames of the media stream; correcting the brightness of the plurality of frames; selecting one of a first neural network, a second neural network, or a third neural network by analyzing parameters of the plurality of frames having corrected brightness, wherein the parameters include at least one of shot boundary detection and artificial light flicker; and generating an output media stream by processing the plurality of frames of the media stream using a selected one of the first neural network, the second neural network, or the third neural network.

Drawings

The foregoing and other aspects, features, and advantages of certain embodiments of the disclosure will become more apparent from the following description taken in conjunction with the accompanying drawings in which:

FIG. 1 illustrates an electronic device for enhancing media according to an embodiment of the present disclosure;

FIG. 2 illustrates a media enhancer for enhancing media executable in an electronic device according to an embodiment of the disclosure;

FIG. 3 is an example conceptual diagram depicting enhancement of video according to an embodiment of the present disclosure;

FIG. 4 illustrates an example Image Signal Processing (ISP) inference pipeline for enhancing video captured under low light conditions and/or using low quality sensors, according to an embodiment of the disclosure;

fig. 5 and 6 are exemplary diagrams depicting brightness correction performed on video while enhancing the video according to embodiments of the present disclosure;

FIG. 7 illustrates a High Complexity Network (HCN) for processing frames of video in the event that shot boundary detection is associated with the frames of video in accordance with an embodiment of the disclosure;

FIG. 8 illustrates a time-guided low complexity network (TG-LCN) for processing multiple frames of video in the presence of artificial light flicker in the multiple frames, according to an embodiment of the present disclosure;

FIG. 9 is an example diagram depicting a multi-scale pyramid method of generating output video by processing frames of video in accordance with embodiments disclosed herein;

FIG. 10 is an example diagram depicting training of a first/second/third neural network for enhancing video/media streams, according to an embodiment of the present disclosure;

FIG. 11 is an exemplary diagram depicting training of a first/second/third neural network using a multi-frame conjoined training method, according to an embodiment of the present disclosure;

FIG. 12 is an example diagram depicting creation of a dataset for training a first/second/third neural network, according to an embodiment of the present disclosure;

13A and 13B are example diagrams depicting simulations of global motion and local motion on a burst data set according to an embodiment of the present disclosure;

FIG. 14 is an example diagram depicting conjoined training of a first/second/third neural network, according to an embodiment of the present disclosure;

15A and 15B are example diagrams depicting use case scenarios to enhance low Frame Per Second (FPS) video captured under low light conditions, according to embodiments of the disclosure;

FIG. 16 is an example diagram depicting a use case scenario of enhancing indoor slow motion video in accordance with an embodiment of the present disclosure;

fig. 17 is an example diagram depicting a use case scenario of enhanced real-time High Dynamic Range (HDR) video in accordance with an embodiment of the disclosure; and

fig. 18 is a flowchart depicting a method for enhancing a media stream according to an embodiment of the present disclosure.

Detailed Description

Example embodiments and various aspects, features and advantageous details thereof are explained more fully with reference to the figures in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The description herein is merely intended to facilitate an understanding of ways in which the example embodiments herein may be practiced and to further enable those of skill in the art to practice the example embodiments herein. Accordingly, the present disclosure should not be construed as limiting the scope of the embodiments.

Embodiments of the present disclosure provide methods and systems for enhancing media/video in real-time using time-guided adaptive Convolutional Neural Network (CNN) switching, where media may be captured under very low light conditions, under high noise conditions, and/or captured using poor quality/low quality sensors.

Further, embodiments of the present disclosure provide methods and systems for enhancing media while minimizing noise and flicker artifacts using a deep learning based pipeline.

Further, embodiments of the present disclosure provide methods and systems for selecting between high complexity networks and low complexity networks by analyzing the temporal consistency of input frames of the media, thereby reducing the average time and power required to process the media.

Further, embodiments of the present disclosure provide methods and systems for reducing flicker using a conjoined training method.

Embodiments of the present disclosure will now be described with reference to the drawings, wherein like reference characters denote like features consistently throughout the figures.

Fig. 1 illustrates an electronic device 100 for enhancing media according to an embodiment of the present disclosure. The electronic device 100 referred to herein may be configured to enhance media.

Examples of electronic device 100 may be, but are not limited to, a cloud computing device (which may be part of a public or private cloud), a server, a database, a computing device, and the like. The server may be at least one of a stand-alone server, a server on a cloud, and the like. The computing device may be, but is not limited to, a personal computer, notebook, tablet, desktop computer, laptop, handheld device, mobile device, camera, internet of things (IoT) device, augmented Reality (AR)/Virtual Reality (VR) device, or the like. Further, the electronic device 100 may be at least one of a microcontroller, a processor, a system on a chip (SoC), an Integrated Chip (IC), a microprocessor-based programmable consumer electronics, and the like.

Examples of media/media streams may be, but are not limited to, video, animated images, graphics Interchange Format (GIF), a collection of moving images, and the like. In examples, the video may include low Frames Per Second (FPS) video, indoor slow motion video, high Dynamic Range (HDR) video, and so forth. In an example, media may be captured in low light conditions. In another example, media may be captured using poor quality/low quality sensors. In an example, the media may include, but is not limited to, at least one of noise, low brightness, artificial light, flicker, color artifacts, and the like. Embodiments herein may use terms such as "media," "video," "media stream," "video stream," "image frame," and the like interchangeably throughout this disclosure.

The electronic device 100 may enhance media/media streams stored in memory or received from at least one external device. Alternatively, the electronic device 100 may enhance the media captured in real-time. Enhancing the media refers to denoising the media and removing different artifacts (such as artificial light flicker, color artifacts, etc.) from the media.

The electronic device 100 includes a memory 102, a communication interface 104, a camera (camera sensor) 106, a display 108, and a controller (processor) 110. The electronic device 100 may also communicate with one or more external devices using a communication network to receive media for enhancement. Examples of external devices may be, but are not limited to, servers, databases, and the like. The communication network may include, but is not limited to, at least one of a wired network, a value added network, a wireless network, a satellite network, or a combination thereof. Examples of wired networks may be, but are not limited to, a Local Area Network (LAN), a Wide Area Network (WAN), an ethernet, and the like. Examples of wireless networks may be, but are not limited to, cellular networks, wireless LAN (Wi-Fi), bluetooth low energy, zigbee, wi-Fi direct (WFD), ultra Wideband (UWB), infrared data association (IrDA), near Field Communication (NFC), and so forth.

The memory 102 may include at least one type of storage medium among a flash memory type storage medium, a hard disk type storage medium, a multimedia card micro storage medium, a card type memory (e.g., SD or XD memory), a Random Access Memory (RAM), a Static RAM (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable ROM (EEPROM), a Programmable ROM (PROM), a magnetic memory, a magnetic disk, and/or an optical disk.

The memory 102 may store at least one of media, receive an input media stream for enhancement, output media stream (i.e., enhanced media stream), and the like.

The memory 102 may also store a first neural network 202a, a second neural network 202b, and a third neural network 202c that may be used to generate an output media stream by processing an input media stream. In an embodiment, the first neural network 202a may be a high complexity neural network (HCN) with one input frame of media. In an embodiment, the second neural network 202b may be a time-guided, lower complexity neural network (TG-LCN) that uses a 'q' number of input frames and previous output frames for joint deflicking or joint denoising. In an embodiment, the third neural network 202c may be a neural network that uses a 'p' number of input frames and previous output frames for denoising, where 'p' is less than 'q'. Each neural network is described later.

Examples of the first, second, and third neural networks (202 a, 202b, and 202 c) may be, but are not limited to, deep Neural Networks (DNNs), artificial Intelligence (AI) models, machine Learning (ML) models, multi-class Support Vector Machine (SVM) models, convolutional Neural Network (CNN) models, recurrent Neural Networks (RNNs), stacked hourglass networks, limited boltzmann machines (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNNs), generating countermeasure networks (GAN), regression-based neural networks, deep reinforcement models (with ReLU activation), deep Q networks, residual networks, conditional generation countermeasure networks (CGAN), and the like.

The first, second, and third neural networks (202 a, 202b, and 202 c) may include a plurality of nodes that may be arranged in a layer. Examples of layers may be, but are not limited to, convolutional layers, active layers, average pooling layers, max pooling layers, concatenation layers, drop layers, full Connectivity (FC) layers, soft max layers, and the like. Each layer has a plurality of weight values, and layer operations are performed by calculation of a previous layer and operations of a plurality of weights/coefficients. The topology of the layers of the first, second, and third neural networks (202 a, 202b, and 202 c) may vary based on the type of the respective networks. In an example, the first, second, and third neural networks (202 a, 202b, and 202 c) may include an input layer, an output layer, and a hidden layer. The input layer receives layer input and forwards the received layer input to the hidden layer. The hidden layer transforms layer inputs received from the input layer into a representation that can be used to generate an output in the output layer. The hidden layer extracts useful/low-level features from the input, introduces nonlinearities in the network, and reduces feature dimensions to make the features alike in terms of scaling and panning. Nodes of a layer may be fully connected to nodes in an adjacent layer via edges. Inputs received at nodes of the input layer may be propagated to nodes of the output layer via an activation function that calculates a state of nodes of each successive layer in the network based on coefficients/weights respectively associated with each of the edges of the connection layer.

The first, second, and third neural networks (202 a, 202b, and 202 c) may be trained using at least one learning method to perform at least one desired function. Examples of learning methods may be, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, regression-based learning, and the like. The trained first, second, and third neural networks (202 a, 202b, and 202 c) may be neural network models in which the number of layers, the sequence for processing the layers, and parameters associated with each layer may be known and fixed to perform at least one desired function. Examples of parameters associated with each layer may be, but are not limited to, activation functions, offsets, input weights, output weights, etc. associated with the layers of the first, second, and third neural networks (202 a, 202b, and 202 c). The functions associated with the learning method may be performed by the non-volatile memory, the volatile memory, and the controller 110. The controller 110 may include one or more processors. At this time, the one or more processors may be general-purpose processors such as a Central Processing Unit (CPU), an Application Processor (AP), etc., graphics processing units only such as a Graphics Processing Unit (GPU), a Visual Processing Unit (VPU), and/or Artificial Intelligence (AI) specific processors such as a Neural Processing Unit (NPU).

The one or more processors can perform at least one desired function according to predefined operating rules of the first, second, and third neural networks (202 a, 202b, and 202 c) stored in the non-volatile memory and the volatile memory. Predefined operating rules for the first, second and third neural networks (202 a, 202b and 202 c) are provided by training the modules using a learning method.

Here, being provided by learning means: by applying a learning method to a plurality of learning data, a first neural network, a second neural network, and a third neural network (202 a, 202b, and 202 c) of predefined operation rules, or desired characteristics are formed. The intended functions of the first, second and third neural networks (202 a, 202b and 202 c) may be performed in the electronic device 100 itself, where learning according to embodiments is performed, and/or may be implemented by separate servers/systems.

The communication interface 104 may be configured to communicate with one or more external devices using communication methods already supported by the communication network. The communication interface 104 may include components such as wired communicators, short-range communicators, mobile/wireless communicators, and broadcast receivers. The wired communicator may enable the electronic device 100 to communicate with external devices using a communication method such as, but not limited to, a wired LAN, ethernet, or the like. The short-range communicator may enable the electronic device 100 to communicate with external devices using communication methods such as, but not limited to, bluetooth Low Energy (BLE), near Field Communicator (NFC), WLAN (or Wi-Fi), zigbee, infrared data association (IrDA), wi-Fi direct (WFD), UWB communication, ant+ (interoperable wireless transmission capability) communication, shared Wireless Access Protocol (SWAP), wireless broadband internet (Wibro), wireless gigabit alliance (WiGiG), and the like. The mobile communicator may transceive wireless signals with at least one of a base station, an external terminal, or a server on the mobile communication network/cellular network. In an example, the wireless signal may include a voice call signal, a video phone call signal, or various types of data according to the transceiving of text/multimedia messages. The broadcast receiver may receive a broadcast signal and/or broadcast-related information from the outside through a broadcast channel. Broadcast channels may include satellite channels and ground wave channels. In an embodiment, the electronic device 100 may or may not include a broadcast receiver.

The camera sensor 106 may be configured to capture media.

The display 108 may be configured to enable a user to interact with the electronic device 100. The display 108 may also be configured to display the output media stream to a user.

The controller 110 may be configured to enhance media/media streams in real-time. In an embodiment, the controller 110 may enhance the media using time-directed adaptive neural network/CNN switching. Time-directed adaptive neural network switching refers to switching between a first neural network, a second neural network, and a third neural network (202 a-202 c) to enhance media.

To enhance the media stream, the controller 110 receives the media stream. In an example, the controller 110 may receive a media stream from the memory 102. In another example, the controller 110 may receive a media stream from an external device. In another example, the controller 110 may receive a media stream from the camera 106. Embodiments herein interchangeably use terms such as "media," "media stream," "input video frame," "input video sequence," etc., to refer to media captured under low light conditions or using low quality sensors.

When the media stream is received, the controller 110 performs alignment of a plurality of frames of the media stream. In an example, the plurality of frames may correspond to a plurality of image frames.

After aligning the plurality of frames of the media stream, the controller 110 corrects the brightness of the plurality of frames. To correct the plurality of frames, the controller 110 identifies a single frame or a plurality of frames of the media stream as an input frame. The controller 110 linearizes the input frame using an Inverse Camera Response Function (ICRF). In linearizing the input frame, the controller 110 uses the future time guide to select a luminance value for correcting the luminance of the input frame. To select the luminance value according to the future time guidance, the controller 110 analyzes the luminance of the input frame and the future time buffer. The future time buffer is the next n frames after the input frame. For example, the current input frame will be the (t-n) th frame, and the frames from (t-n) to t include the future time buffer(s). There may be a delay of n frames between the camera stream and the output. The controller 110 selects a constant boost value as the luminance value when the luminance of the input frame is analyzed to be less than the threshold and the luminance of all frames in the future time buffer is analyzed to be less than the threshold. In an embodiment, the threshold may be empirically set after the experiment. The controller 110 selects a boost value of a monotonically decreasing function as the luminance value based on an analysis that the luminance of the input frame is less than the threshold and the luminance of all frames in the future time buffer is greater than the threshold. The controller 110 selects a zero boost value as the luminance value based on an analysis that the luminance of the input frame is greater than a threshold and the luminance of all frames in the future time buffer is greater than the threshold. Thus, the controller 110 does not boost the brightness of the input frame based on selecting the zero boost value. The controller 110 selects a boost value of a monotonically increasing function as the luminance value based on an analysis that the luminance of the input frame is greater than the threshold and the luminance of the frame in the future time buffer is less than the threshold. After selecting the luminance value, the controller 110 applies a linear boost to the input frame based on the selected luminance value. The controller 110 applies a Camera Response Function (CRF) to the input frame to correct the brightness of the input frame. The CRF may be a function of the type (hereinafter referred to as sensor type) and metadata of the camera 106 used to capture the media stream. The metadata includes exposure values and international organization for standardization (ISO). CRF and ICRF may be characterized and stored in a look-up table (LUT).

After correcting the brightness of the plurality of frames of the media stream, the controller 110 selects one of the first, second, and third neural networks 202a, 202b, and 202c to process the media stream. The controller 110 selects one of the three neural networks (202 a, 202b, and 202 c) by analyzing parameters of a plurality of frames of the media stream. Examples of parameters may be, but are not limited to, shot boundary detection and artificial light flicker.

To select one of the three neural networks (220 a, 22b, and 202 c) to process multiple frames of the media stream, the controller 110 analyzes each frame with respect to an earlier frame to check whether shot boundary detection is associated with each of the multiple frames. Shot boundary detection may be checked by analyzing temporal similarities between frames. The controller 110 may analyze that shot boundary detection is associated with each of the plurality of frames based on the absence of temporal similarity between the plurality of frames. If shot boundary detection is associated with a plurality of frames, the controller 110 selects the first neural network 202a to process the plurality of frames of the media stream. If shot boundary detection is not associated with the plurality of frames, the controller 110 analyzes the plurality of frames for the presence of artificial light flicker. If there is artificial light flicker in the plurality of frames, the controller 110 selects the second neural network 202b to process the plurality of frames of the media stream. If there is no artificial light flicker in the plurality of frames, the controller 110 selects the third neural network 202c to process the plurality of frames of the media stream.

In an embodiment, the first neural network 202a may be a High Complexity Network (HCN) with one input frame (current frame) of media. The first neural network 202a includes a plurality of residual blocks at a lowest level to enhance noise removal capability. Embodiments herein may use terms such as "first neural network," "HCN," "high complexity CNN," etc., interchangeably throughout this disclosure.

In an embodiment, the second neural network 202b may be a time-guided, lower complexity neural network (TG-LCN) with a 'q' number of input frames and previous output frames for joint de-flicker or joint de-noising. The second neural network 202b includes at least one convolution operation with fewer feature maps and previous output frames as guides to process the plurality of input frames. Embodiments herein use terms such as "second neural network", "TG-LCN (n=q)", "TG-LCN", "q' frame flicker reduction denoiser", and the like interchangeably throughout the disclosure.

In an embodiment, the third neural network 202c may be a neural network for denoising having a 'p' number of input frames and previous output frames, where 'p' is less than 'q' (i.e., the number of frames of the media stream). In an example, consider that a media stream may include 5 frames (i.e., 'q' =5). In this scenario, 'p' may be equal to 3 number of frames (i.e., 'p' =3). The embodiments herein use terms such as "third neural network", "TG-LCN (n=p, p < q)", "TG-LCN (n=p)" and the like interchangeably throughout the disclosure.

The first, second, and third neural networks (202 a, 202b, and 202 c) may be trained neural networks. In an embodiment, the controller 110 may train the first, second, and third neural networks (202 a, 202b, and 202 c) using a multi-frame conjoined training method.

To train the first, second, and third neural networks (202 a, 202b, and 202 c), the controller 110 creates a data set. The dataset comprises one of a local dataset and a global dataset.

To create the data set, the controller 110 captures the burst data set, or alternatively, the controller 110 may receive the burst data set from an external device. The burst data set includes, but is not limited to, one of a low-light static media with noisy input and a clean reference real (ground) frame, etc. Clean reference true means a reference true image without noise. A clean reference reality can be obtained by averaging the individual frames in the burst. After capturing the burst data sets, the controller 110 simulates global and local motion for each burst data set using the synthetic trajectory generation and synthetic stop motion, respectively. To simulate the global motion of each burst data set, the controller 110 estimates a polynomial coefficient range based on parameters from the burst data set including maximum translation and maximum rotation. The maximum translation and rotation represent the maximum motion that the camera may experience during the capture session. This can be used to create a synthetic motion and can be set empirically after experimentation. The controller 110 generates a third order polynomial trajectory using the estimated polynomial coefficient range and approximates the third order trajectory using the maximum depth. For an approximately planar scene, the maximum depth determines the distance of the scene from the camera. The maximum depth may be empirically set after experimentation. In examples herein, the third order polynomial trajectory may be a trajectory used by the camera 106 to capture the burst dataset. The controller 110 generates uniform sampling points based on the predefined sampling rate and the approximated 3D trajectory. The predefined sampling rate may be a sampling rate that controls the smoothness between frames of each burst data set. The controller 110 generates 'n' affine transforms based on the generated uniform sampling points and applies the generated 'n' affine transforms to each burst data set. Thus, a global dataset is created by simulating the global motion of each burst dataset. To simulate the local motion of each burst, the controller 110 captures local object motion from each burst in a static scene using the synthesized stop motion, and to capture the local object motion, the controller 110 captures an input with a background scene and a reference real scene from each burst. The controller 110 also captures an input with foreground objects and a reference real scene from each burst data set. The controller 110 crops the foreground objects and creates a composite scene by placing the foreground objects at different locations of the background scene. When capturing local object motion, the controller 110 simulates the motion blur for each local object motion by averaging a predefined number of frames of the burst data set. Thus, a local data set is created by simulating the local motion of each burst data set. After simulating global and local motion for each burst data set, the controller 110 removes one or more burst data sets having structure and brightness mismatch between the clean reference real frame and the low light static media. The controller 110 creates a data set by including one or more burst data sets that do not contain structural and luminance mismatch between the clean reference real frame and the low-light static media.

After creating the dataset, the controller 110 selects at least two sets of frames from the created dataset. Each group comprising at least three frames. The controller 110 adds the synthesized motion to the selected at least two sets of frames. At least two sets of frames to which the synthesized motion is added include different noise realizations. The controller 110 performs a connected training of the first, second and third neural networks (202 a, 202b and 202 c) using the reference reality and at least two sets of frames to which the synthesized motion is added. The baseline reality can be used for loss calculation for training the neural network. To perform the conjoined training of the first, second, and third neural networks (202 a, 202b, and 202 c), the controller 110 communicates at least two sets of frames having different noise realizations to at least two of the first, second, and third neural networks (202 a, 202b, and 202 c) to generate at least two sets of output frames. The controller 110 calculates the conjoined loss by calculating the L2 loss between at least two sets of output frames. The controller 110 calculates the pixel loss by calculating an average value of at least two sets of output frames and a reference true corresponding to the output frames. The controller 110 calculates a total loss using the connected loss and the pixel loss, and trains the first, second, and third neural networks (202 a, 202b, and 202 c) using the calculated total loss.

After selecting the neural network of the first, second, and third neural networks (202 a, 202b, and 202 c), the controller 110 generates an output media stream by processing a plurality of frames of the media stream using the selected neural network (202 a, 202b, or 202 c). The output media stream may be a de-noised media stream with enhanced brightness and zero flicker. Embodiments herein interchangeably use terms such as "output media stream", "output video stream", "output video frame", "output image frame", "denoising media/video", "enhancement media/video", etc. to refer to media that includes zero noise and zero artificial flicker (i.e., includes zero artifacts) and correct brightness.

To generate an output media stream, the controller 110 selects a single or multiple frames of the media stream as input processing frames. The controller 110 downsamples the input processed frame over multiple scales to generate a low resolution input. In the example herein, the controller 110 downsamples the input processed frame by a factor of 2. In generating the lower resolution input, the controller 110 processes the low resolution input at a lower resolution using a selected one of the first neural network 202a, the second neural network 202b, or the third neural network 202c to generate a low resolution output. The controller 110 then amplifies the processed low resolution output on multiple scales to generate an output media stream. For example, the controller 110 amplifies the low resolution output by 2 times. The number of scales on which downsampling has been performed may be equal to the number of scales on which magnification has been performed.

The controller 110 may also be configured to dynamically change the complexity of a selected one of the first neural network 202a, the second neural network 202b, or the third neural network 202c by changing the number of scales used to process the low resolution input. The complexity of a selected one of the first neural network 202a, the second neural network 202b, or the third neural network 202c may be varied in an inverse relationship with respect to the number of frames of the media stream.

The controller 110 saves/stores the generated output media stream in the memory 102.

Fig. 2 illustrates a media enhancer 200 for enhancing media that may be implemented in the electronic device 100 according to an embodiment of the present disclosure. The media enhancer 200 may be stored in the memory 102 and processed/executed by the controller 110 of the electronic device 100 to enhance media/media streams. The media enhancer 200 includes a receiving and aligner module 204, a luminance correction module 206, a neural network selection module 208, an output generation module 210, and a training module 212.

The receive and aligner module 204 may be configured to receive a media stream/input media to enhance and perform alignment of multiple frames of the media stream.

The luminance correction module 206 may be configured to correct the luminance of a plurality of frames of the media stream. The luminance correction module 206 identifies a frame or frames of the media stream as input frames. The luminance correction module 206 uses ICRF to linearize the input frame and uses future time guides to select luminance values for correcting the luminance of the input frame. After selecting the luminance value, the luminance correction module 206 applies a linear boost to the input frame based on the selected luminance value. The luminance correction module 206 applies CRF to the input frame to correct the luminance of the input frame.

The neural network selection module 208 may be configured to select one of the first, second, and third neural networks (202 a, 202b, and 202 c) to process a plurality of frames of the media stream. To select the neural network (202 a, 202b, or 202 c), the neural network selection module 208 analyzes each frame relative to an earlier frame to check whether shot boundary detection is associated with each of the plurality of frames. If shot boundary detection is associated with a plurality of frames, the controller 110 selects the first neural network 202a to process the plurality of frames of the media stream. If shot boundary detection is not associated with multiple frames, the controller 110 analyzes the presence of artificial light flicker in the multiple frames of the media stream. If there is artificial light flicker in the plurality of frames, the controller 110 selects the second neural network 202b to process the plurality of frames of the media stream. If there is no artificial light flicker in the plurality of frames, the controller 110 selects the third neural network 202c to process the plurality of frames of the media stream.

The output generation module 210 may be configured to generate an output media stream by processing a plurality of frames of the media stream using the selected first neural network 202a, or the second neural network 202b, or the third neural network 202 c. The output generation module 210 selects a single or multiple frames of the media stream as input processing frames. The output generation module 210 downsamples the input processed frame over multiple scales to generate a low resolution input. The output generation module 210 processes the low resolution input using the selected first neural network 202a, or the second neural network 202b, or the third neural network 202 c. The output generation module 210 amplifies the low resolution output over multiple scales using the higher resolution frames as a guide to generate an output media stream.

The training module 212 may be configured to train the first, second, and third neural networks (202 a, 202b, and 202 c) using a multi-frame conjoined training method. To train the first, second, and third neural networks (202 a, 202b, and 202 c), the training module 212 creates a data set for training the first, second, and third neural networks (202 a, 202b, and 202 c). The dataset comprises one of a local dataset and a global dataset. The training module 212 selects at least two sets of frames from the created dataset and adds the composite motion to the selected at least two sets of frames. Upon adding the synthesized motion to the selected at least two sets of frames, the training module 212 performs a conjoined training of the first, second, and third neural networks (202 a, 202b, and 202 c) using the reference real media and the at least two sets of frames to which the synthesized motion is added.

Fig. 1 and 2 illustrate exemplary blocks of an electronic device 100, but it should be understood that other embodiments are not limited thereto. In other embodiments, the electronic device 100 may include a fewer or greater number of blocks. Moreover, the labels or names of the blocks are used for illustration purposes only and do not limit the scope of the embodiments herein. One or more blocks may be combined together to perform the same or substantially similar functions in electronic device 100.

The embodiments herein further describe the enhancement of media by considering the media as, for example, video, but it will be apparent to those skilled in the art that any other type of media may be considered.

Fig. 3 is an example conceptual diagram depicting enhancement of video according to an embodiment of the disclosure. Embodiments herein enable the electronic device 100 to effectively switch between high complexity neural networks (202 a, 202b, and 202 c) and low complexity neural networks (202 a, 202b, and 202 c) for denoising and deflashing based on shot boundary detection and artificial light flicker. Thus, the average run time for video enhancement based on neural networks is improved.

The electronic device 100 identifies key frames of the video by computing temporal similarities between frames of the video. The key frames may reference frames of video that have been associated with shot boundary detection. The electronic device 100 uses the HCN 202a to denoise key frames of the video. The electronic device 100 uses TG-LCN to denoise non-key frames of video with temporal output guidance. The non-key frames of the video may be frames of the video that include artificial light flashes. The temporal output pilot may refer to a previous output frame that has been used as a pilot. Both the HCN 202a and TG-LCN 202b may include multi-scale inputs as well as convolution-guided filters for fast processing and reduced memory. The electronic device 100 may use the third neural network 202c to denoise frames of video that do not include artificial light flicker or are not associated with shot boundary detection (i.e., have temporal similarity to other frames).

Fig. 4 illustrates an example Image Signal Processing (ISP) inference pipeline for enhancing video captured under low light conditions and/or using low quality sensors, according to an embodiment of this disclosure.

The electronic device 100 receives video for enhancement, wherein the received video may be captured in low light conditions or using low quality sensors. Upon receiving the video, the electronic device 100 may perform a visual defect recognition system (VDIS) on the video (an optional step) to detect and correct any defects in the video.

After performing the VDI, the electronic device 100 aligns the frames of the video (using any suitable existing method). In the examples herein, it is contemplated that the received video may include five consecutive frames I _t-2 ，I _t-1 ，I _t ，I _t+1 ，I _t+2 (i.e., q=5) (referred to as an input frame). After aligning the input frames of the video, the electronic device 100 performs brightness correction on the input frames of the video.

After performing the brightness correction, the electronic device 100 checks whether shot boundary detection is associated with the input frame by checking temporal similarity between the input frames of the video. If the input frames are dissimilar (i.e., shot boundary detection is associated with the input frames), the electronic device 100 uses the HCN to generate output frames for the dissimilar input frames. In the examples herein, consider an input frame (I _t ) Associated with shot boundary detection. In this scenario, the electronic device 100 uses HCN to transmit the input frame (I _t ) Denoising to generate an output frame (O _t )。

If the input frames are similar (i.e., there is no shot boundary detection), the electronic device 100 examines the input frames ('q' frames) to detect if there is artificial light flicker due to artificial light. If there is artificial light flicker in ' q ' (i.e., 5) input frames, the electronic device 100 uses TG-LCN/' q ' frame flicker reduction denoising 202b (n=q) uses ' q ' input frames and previous output frames ' O _t-1 ' generating an output frame (O _t ). A 'q' frame flicker reduction denoising (TG-LCN) (n=q) performs denoising and flicker elimination on 'q' input frames. If there is no artificial flicker due to artificial light in the 'q' input frames, the electronic device 100 uses the 'p' input frames (e.g., 'p' =3, i in the example shown in fig. 4) using the third neural network/TG-LCN (n=p) 202c _t-1 ，I _t ，I _t+1 ) And previous output frame' O _t-1 ' generating an output frame (O) _t ). Using O _t-1 As a pilot a second neural network/TG-LCN 202b and a third neural network/TG-LCN (n=p) 202c are used that allow for a much lower complexity. In the video sequence shown in fig. 4, most frames are similar in time, so the lower complexity is largely deployed, reducing average time and power.

Fig. 5 and 6 are exemplary diagrams depicting brightness correction performed on video while enhancing the video according to embodiments of the present disclosure. Embodiments herein enable the electronic device 100 to perform brightness correction for correcting brightness of a video using a LUT. The LUT may be selected by the electronic device 100 based on histogram statistics of the video. The LUT/set of LUTs may be predefined by tuning. CFR and ICFR may be characterized in the LUT. The LUT may include a CFR LUT group for storing CFR and an ICFR LUT group for storing ICFR.

The electronic device 100 receives as input a single or multiple frames of a video/video sequence. The electronic device 100 uses the ICFR to linearize the input frame. The electronic device 100 then uses the time future time guide to select the luminance value. The selected luminance values are shown in fig. 6.

As shown in fig. 6, to select a luminance value, the electronic device 100 analyzes the luminance of the future time buffers of the input frame (i.e., the current frame) and the 'b' frames. When the luminance of the analysis input frame is less than the threshold (t) and the luminance of all frames in the future time buffer of the size 'b' (i.e., the future time buffer of the 'b' frames) is less than the threshold (t), the electronic device 100 selects a constant boost value as the luminance value. When the luminance of the analysis input frame is less than the threshold (t) and the luminance of all frames in the future time buffer of the size 'b' is less than the threshold (t), the electronic device 100 selects the lifting value 'k' of the monotonically decreasing function 'f' as the luminance value. The electronic device 100 does not apply any boost/luminance values when the luminance of the analyzed input frame is greater than the threshold (t) and the luminance of all frames in the future time buffer of size 'b' is greater than the threshold (t). When the luminance of the analysis input frame is greater than the threshold (t) and the luminance of any frame in the future time buffer of the size 'b' is less than the threshold (t), the electronic device 100 selects the lifting value 'k' of the monotonically increasing function 'g' as the luminance value. Thus, a temporally linearly varying boost is applied to the smooth transition of luminance. In an example, the functions 'f' and 'g' may be empirically selected by tuning and may be calculated as:

Where 'n' indicates the number of frames of the video.

After selecting the luminance value, the electronic device 100 applies a linear boost to the input frame based on the selected luminance value. The electronic device 100 applies CRF to the input frame to correct the brightness of the input frame. CRF is a function of sensor type and metadata.

Fig. 7 illustrates HCN 202a for processing frames of video in the event that shot boundary detection is associated with the frames of video in accordance with an embodiment of the present disclosure.

The HCN 202a may be a single frame denoising network. The HCN 202a includes a plurality of residual blocks at the lowest level to improve noise removal capability. The HCN 202a may process input frames of video that do not have temporal similarity (i.e., are associated with shot boundary detection) to generate output video, which is de-noised video.

Fig. 8 illustrates a TG-LCN 202b for processing multiple frames of video in the presence of artificial light flicker in the multiple frames, according to an embodiment of the present disclosure.

The TG-LCN/TG-LCN (n) may be a multi-frame denoising network, where 'n' depicts the input frames of video. The TG-LCN 202b processes the input frames of video using the previous output frames as a guide to generate output video, which allows the TG-LCN to have a much lower complexity than the HCN 202a. TG-LCN does not use residual blocks. The convolution operations involved in TG-LCN may contain fewer feature maps to reduce computation.

Fig. 9 is an example diagram depicting a multi-scale pyramid method of generating output video by processing frames of video according to an embodiment of the present disclosure. Embodiments may employ a multi-scale pyramid approach to process frames of video to manage execution time for both HCN 202a and TG-LCN (n) 202b, where 'n' is the number of input frames.

The electronic device 100 receives a frame or frames of video as an input processing frame/frames. The electronic device 100 downsamples the input processing frame at a lower resolution over multiple scales to generate a low resolution input. The input processing frame mayThe resolution is downsampled. The electronic device 100 uses the selected HCN 202a or TG-LCN (n=q) 202b or third neural network 202c to process the low resolution input to generate a low resolution output. The electronic device 100 uses a Convolutional Guided Filter (CGF) to amplify/upsample each lower level low resolution output on multiple scales to generate an output video. The CGF accepts a higher resolution input set, a low resolution input, and a low resolution output to generate an output video having a higher resolution output image.

In embodiments herein, the network to which the multi-scale pyramid method is applied to HCN 202a may be represented by HCN ', and the network to which the multi-scale pyramid method is applied to TG-LCN 202b is represented by TG-LCN'. The electronic device 100 dynamically changes the complexity of the HCN 202a, TG-LCN 202b, or third neural network 202c in an inverse relationship with respect to the number of frames of the video.

Fig. 10 is an example diagram depicting training of first, second, and third neural networks (202 a, 202b, and 202 c) for enhancing video/media streams in accordance with an embodiment of the present disclosure.

To train the first, second, and third neural networks (202 a, 202b, and 202 c), the electronic device 100 creates a dataset using a low exposure burst shot (burst shot) and a higher exposure burst shot, and uses a self-supervising method to refine the dataset. The electronic device 100 corrects the brightness of the created data set. Then, for time consistency, the electronic device 100 trains the first, second, and third neural networks (202 a, 202b, and 202 c) using a multi-frame conjoined training method and a self-similarity penalty.

Fig. 11 is an example diagram depicting training of a first neural network, a second neural network, and a third neural network (202 a, 202b, and 202 c) using a multi-frame conjoined training method, according to an embodiment of the disclosure. The video may include similar frames with different noise realizations, which may lead to temporal inconsistencies in the final output video. Accordingly, conjoined training may be used to train the first, second, and third neural networks (202 a, 202b, and 202 c). The electronic device 100 trains the first, second, and third neural networks (202 a, 202b, and 202 c) over multiple iterations/scales.

The electronic device 100 first creates a data set for training the first, second, and third neural networks (202 a, 202b, and 202 c). Creating a dataset is shown in fig. 12. As shown in fig. 12, the electronic device 100 captures a burst data set of a low-light scene using the camera 106. In the examples herein, each acquisition may consist of 15 noise inputs and 1 clean reference real frame. In the examples herein, a burst data set may be automatically exposed, ET, and input k > 1Is captured. In the examples herein, each capture of a burst data set may consist of a set of 5 x j noise inputs and k.ltoreq.j clean reference real frames. In the examples herein, electronic device 100 may use any application, such as a custom dump (custom dump) application, to capture a burst data set. After capturing the data sets, the electronic device 100 simulates global and local motion for each burst data set using the composite trajectory generation and composite stop motion, respectively. Thereby creating a local data set and a global data set. The simulated global and local motions are shown in fig. 13A and 13B. In an example, the global motion and the local motion may be in 5 phases of the data augmentation relative to the multiplication factor The data set of the burst captured with auto exposure MF e {3,2,1 (EV 0), 0.5,0.33} is simulated. In an example, a burst data set may be limited to<The auto-exposure ET for EV0 of 33 milliseconds, i.e., 30 Frames Per Second (FPS), is captured. In the examples herein, clean reference real frames may be at iso=50 and +.>Is captured.

As shown in fig. 13A, to simulate global motion, the electronic device 100 estimates a polynomial coefficient range based on parameters including maximum translation and maximum rotation. The maximum translation and maximum rotation control the maximum displacement. The electronics estimate the polynomial coefficient range based on parameters including maximum translation and maximum rotation. The electronic device 100 then generates a third order polynomial trajectory using the estimated polynomial coefficient range and approximates the third order trajectory using the maximum depth. The third order trajectory may be a trajectory followed by the camera 106 for capturing the burst data set. The electronic device 100 generates 'n' affine transformations based on the generated uniform sampling points. In examples herein, uniform sampling points may be generated using a sampling rate that controls smoothness between frames of each burst data set. The electronic device 100 applies the generated 'n' affine transforms to each burst data set. Thus, a global dataset is created by simulating global motion on each burst dataset.

As shown in fig. 13B, to simulate local motion, the electronic device 100 uses two local motion characteristics: local object motion/object motion and motion blur. The electronic device 100 captures local object motion (i.e., composite stop motion) by locally moving an attribute (prop) in a static scene. In an embodiment, capturing local object motion includes capturing the input and reference real scenes only in the case of background and capturing the input and reference real scenes only in the case of foreground objects. The electronic device 100 crops the foreground objects and creates a composite scene by placing the foreground objects at different locations of the background scene. The electronic device 100 selects the 3 sets of input frames (t-1, t, t+1) required for each training pair from the captured inputs. Electronic device 100 simulates motion blur for each stopped motion by averaging the selected 3 sets of input frames to 5 ((x-delta, x, x + delta)) frames for the attribute (prop). The electronic device 100 may use two captures of each static scene for the conjoined training of the first, second, and third neural networks (202 a, 202b, and 202 c). In the examples herein, the minimum number of input frames per training pair j=3×3×2=18 frames. The reference reality may be captured and aligned with the't' frame. In examples herein, a burst data set including 1000 training pairs (> 500 captures) may be captured to create a data set for training the first, second, and third neural networks (202 a, 202b, and 202 c).

As shown in fig. 12, in simulating global motion and local motion, the electronic device 100 removes at least one burst data set having structure and brightness mismatch between a clean reference real frame and a low-light static medium. The electronic device 100 creates a data set by including at least one burst data set that does not include a structural and brightness mismatch between the clean reference real frame and the low-light static media. The electronic device 100 further boosts the brightness of the created data set and the clean reference real frame and saves the created data set and the clean reference real frame in the memory 102.

Once the dataset is created, as shown in fig. 11, the electronic device 100 uses the composite trajectory to add composite motion to at least two sets of frames of the created dataset to account for motion during reasoning in accordance with the composite modeling. Synthetic modeling includes performing trajectory modeling for 3 rotational and translational degrees of freedom using a third order polynomial. The 3 rotational and translational degrees of freedom can be uniformly sampled from the interval [0, t ] to generate uniform sampling points, where t represents the simulated capture duration. The composite frame may be generated by applying homographies to each selected set of frames corresponding to the selected uniform sampling points. After adding the synthesized motion to each of the selected at least two sets of frames, the electronic device 100 performs a conjoined training of the first, second, and third neural networks (202 a, 202b, and 202 c). The conjoined training of the first, second, and third neural networks (202 a, 202b, and 202 c) is shown in fig. 14.

As shown in fig. 14, two of the first, second, and third networks (202 a, 202b, and 202 c) may be used in training. The electronic device 100 communicates a first input set (input set 1) from the created data set to a first set of neural networks (which may include a first neural network (202 a) and/or a second neural network (202 b) and/or a third neural network (202 c)). The electronic device 100 passes the second input set (input set 2) from the created data set to a second set of neural networks (which may include the first neural network (202 a) and/or the second neural network (202 b) and/or the third neural network (202 c)). The first set of neural networks and the second set of neural networks share the same weighting/weighting. The first group of neural networks generates a first output (output 1) by processing the first set of inputs. The second group of neural networks generates a second output (output 2) by processing the second set of inputs. The first output/second output may be a video/media comprising denoised frames with zero artificial light flicker.

The electronic device 100 calculates the conjoined loss by calculating the L2 loss between output 1 and output 2. The electronic device 100 also calculates the pixel loss by calculating the average of the outputs 1 and 2 and the reference reality. The electronic device 100 uses the conjoined loss and the pixel loss to calculate the total loss. The electronic device 100 trains the first, second, and third neural networks (202 a, 202b, and 202 c) using the calculated total loss.

Fig. 15A and 15B are example diagrams depicting use case scenarios to enhance low FPS video captured in low light conditions, according to embodiments of the present disclosure.

Consider an example scenario in which the electronic device 100 receives a low FPS video to be enhanced captured under low light conditions, where the low FPS video refers to a video with an FPS of up to 60. In such a scenario, the electronic device performs VDIS and aligns the input frames of the video. The electronic device 100 performs luminance correction on input frames of video and appends aligned input frames of video to form an input video sequence.

Electronic device 100 checks whether shot boundary detection is associated with an input frame of the input video sequence by analyzing temporal similarity between the input frames. In the examples herein, consider an input frame (I _t ) With other input framesThere is no temporal similarity. In this scenario, the electronic device 100 selects the HCN 202a to process the input frame (I _t ) And selecting a third neural network/TG-LCN (n=p) 202c to process the input frame +.>To generate output video O _t 。

In an embodiment, the electronic device 100 does not check for the presence of artificial light flicker in the input frames of the input video sequence, since artificial light flicker is minimal in the low FPS video.

Fig. 16 is an example diagram depicting a use case scenario of enhancing indoor slow motion video in accordance with an embodiment of the present disclosure.

Consider an example scenario in which the electronic device 100 receives indoor slow motion video captured at a high frame rate (240/960 FPS), resulting in noisy video frames. In such a scenario, the electronic device 100 enhances the indoor slow motion video by denoising and removing artificial light flicker from the slow motion video. The electronic device examines the input frames of the slow motion video for temporal similarity. If the input frames are dissimilar (i.e., there is shot boundary detection), the electronic device 100 generates an output frame (O) using the HCN 202a _t ) Wherein the current frame (I _t ) Is used as an input to the HCN '202a, and the HCN'202a outputs the frame (I _t ) Denoising is performed. If the input frames of the slow motion video are similar (i.e., there is no shot boundary detection), the electronic device 100 checks whether there is artificial light flicker in the input frames of the slow motion video ('q' input frames) due to the artificial light. If artificial light flicker is present in the ' q ' input video frames, the electronic device 100 selects a second neural network/' q ' frame flicker reduction noise canceller (TG-LCN ') (n=q) 202b to use the ' q ' input frames and the previous output frame O _t-1 To generate an output video frame (O _t ). q-frame flicker reduction denoiser (TG-LCN') (n=q) 202b performs denoising and flicker removal on q input video frames. If there is no artificial light flicker in the ' q ' input video frames, the electronic device 100 uses a third neural network/TG-LCN ' (n=p) 202c to use p input frames ((in the illustrated example, p=3) (I _t-1 ，I _t ，I _t+1 ) And previous output frame O) _t-1 Generating output video frames (O) _t ). Using O _t-1 As a guide allows the use of a much lower complexity network and also helps to remove artificial light flicker.

Fig. 17 is an example diagram depicting a use case scenario of enhanced real-time High Dynamic Range (HDR) video in accordance with an embodiment of the disclosure. HDR video may be generated using alternating exposures. Each successive 3 frames forms an input dataset from the HDR video. In an example scenario, as shown in fig. 17, output frame 1 (t) may be obtained using low (t-1), medium (t), and high frames (t+1). The output frame 2 (t+1) may be obtained using middle (t), high (t+1), and low (t+2) frames, etc. Temporal similarity may be measured between a previous output frame and a current input frame.

Fig. 18 is a flow chart 1800 depicting a method for enhancing a media stream in accordance with an embodiment of the disclosure.

At step 1802, the method includes receiving, by the electronic device 100, a media stream. At step 1804, the method includes performing, by the electronic device 100, an alignment of a plurality of frames of the media stream. At step 1806, the method includes correcting, by the electronic device 100, the brightness of the plurality of frames.

At step 1808, the method includes: after correcting the brightness of the plurality of frames, one of the first neural network 202a, the second neural network 202b, or the third neural network 20c is selected by the electronic device 100 by analyzing the parameters of the plurality of frames. The parameters include at least one of shot boundary detection and artificial light flicker.

At step 1810, the method includes generating, by the electronic device 100, an output media stream by processing a plurality of frames of the media stream using a selected one of the first neural network 202a, the second neural network 202b, or the third neural network 202 c. The various actions in method 1800 may be performed in the order presented, in a different order, or simultaneously. Further, in some embodiments, some actions listed in fig. 18 may be omitted.

Embodiments herein provide methods and systems for enhancing video in real-time using time-directed adaptive CNN switching, where video has been captured with very low light, under high noise conditions, and/or captured using poor quality sensors. Embodiments herein provide a deep learning based pipeline to enable real-time video enhancement while minimizing noise and flicker artifacts. Embodiments herein provide methods and systems for selecting between high complexity and low complexity networks by analyzing the temporal consistency of input frames, thereby reducing the average time and power required to process video. Embodiments herein provide for the use of conjoined training to reduce flicker.

Embodiments herein provide a method for low-light video enhancement. The method includes receiving an input video stream corrupted by noise, low luminance, or color artifacts. The brightness is raised to a desired level using a pre-adjusted look-up table. The temporal similarity of successive frames is analyzed. If dissimilar frames exist (based on analysis of successive frames), a high complexity single frame DNN model is deployed. If there are similar frames (based on analysis of consecutive frames), a lower complexity multi-frame (p) DNN model (e.g., a 3-frame DNN model) guided by the previous output is deployed. Upon detecting artificial light flicker in an input video stream, an input comprising a plurality of frames (q, where q > p) (e.g., the input comprises five frames) is used to perform flicker removal along with noise reduction. The output from one of the paths is saved to the output video stream.

Embodiments herein provide a method for fast video denoising. The method includes receiving as input a single or multiple frames. The frame is downsampled to a lower resolution using multiple scales. The video frames are processed at a lower resolution, generating a low resolution output. Using the higher resolution frames as a guide, the low resolution output is amplified at multiple levels. For time consistency, a conjoined training method may be used to train a low resolution network for downsampling and upsampling.

Embodiments herein provide a deep learning based pipeline that enables real-time video enhancement while minimizing noise and flicker artifacts.

Embodiments herein provide a method of selecting between a high complexity network and a low complexity network by analyzing the temporal consistency of input frames, thereby reducing the average time required to process video/media.

Embodiments herein provide a method to dynamically change network complexity at the time of reasoning.

Embodiments herein provide a method for reducing flicker with conjoined training.

Embodiments herein may be implemented by at least one software program running on at least one hardware device and executing network management functions to control elements. The elements shown in fig. 1 and 2 may be at least one of hardware devices or a combination of hardware devices and software modules.

Embodiments of the present disclosure provide methods and systems for low-light media enhancement. It should therefore be understood that the scope of protection extends to such a program, and that such computer readable storage means comprise, in addition to the computer readable means having the message therein, program code means for carrying out one or more steps of the method, when the program is run on a server or mobile device or any suitable programmable means. The method is implemented in an embodiment by or in conjunction with a software program written in, for example, a very high speed integrated circuit hardware description language (VHDL) or another programming language, or by one or more VHDL or several software modules executed on at least one hardware device. The hardware device may be any kind of portable device that can be programmed. The apparatus may also include an ASIC, or a combination of hardware and software means, such as an ASIC and FPGA, or at least one microprocessor and at least one memory having software modules located therein. The method embodiments described herein may be implemented partly in hardware and partly in software. Alternatively, the invention may be implemented on different hardware devices, e.g. using multiple CPUs.

The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Thus, although embodiments have been described, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.

Claims

1. A method for enhancing media, the method comprising:

receiving, by an electronic device, a media stream;

performing, by the electronic device, alignment of a plurality of frames of a media stream;

correcting, by the electronic device, brightness of the plurality of frames;

selecting, by the electronic device, one of a first neural network, a second neural network, or a third neural network by analyzing parameters of the plurality of frames with corrected brightness, wherein the parameters include at least one of lens boundary detection and artificial light flicker; and

An output media stream is generated by the electronic device by processing the plurality of frames of the media stream using a selected one of the first neural network, the second neural network, or the third neural network.

2. The method of claim 1, wherein the media stream is captured in low light conditions and

wherein the media stream includes at least one of noise, low brightness, artificial flicker, and color artifacts.

3. The method of claim 1, wherein the output media stream is a de-noised media stream with enhanced brightness and zero flicker.

4. The method of claim 1, wherein correcting the brightness of the plurality of frames of the media stream comprises:

identifying a single frame or the plurality of frames of the media stream as an input frame;

linearizing the input frame using an inverse camera response function ICRF;

selecting a luminance multiplication factor for correcting the luminance of the input frame using the future time guide;

applying a linear boost to the input frame based on the brightness multiplication factor; and

a camera response function CRF is applied to the input frame to correct the brightness of the input frame,

wherein CRF is a function of sensor type and metadata,

wherein the metadata includes exposure value and International organization for standardization ISO, and

Wherein CRF and ICRF are stored as look-up tables LUT.

5. The method of claim 4, wherein selecting a brightness multiplication factor comprises:

analyzing the brightness of the input frame;

identifying a maximum constant boost value as a luminance multiplication factor based on the luminance of the input frame being less than a threshold and the luminance of all frames in the future time buffer being less than the threshold;

identifying a lifting value of a monotonically decreasing function between a maximum constant lifting value and 1 as a brightness multiplication factor based on the brightness of the input frame being less than the threshold and the brightness of all frames in the future time buffer being greater than the threshold;

identifying a unity gain boost value as a brightness multiplication factor based on the brightness of the input frame being greater than the threshold and the brightness of all frames in the future time buffer being greater than the threshold; and

based on the luminance of the input frame being greater than the threshold and the luminance of the frame in the future time buffer being less than the threshold, a lifting value of a monotonically increasing function between 1 and a maximum constant lifting value is identified as a luminance multiplication factor.

6. The method of claim 1, wherein selecting, by the electronic device, one of a first neural network, a second neural network, or a third neural network comprises:

Analyzing each frame relative to an earlier frame to determine whether shot boundary detection is associated with each of the plurality of frames;

selecting a first neural network to generate an output media stream by processing the plurality of frames of the media stream based on shot boundary detection associated with the plurality of frames;

analyzing the plurality of frames for the presence of artificial light flicker based on shot boundary detection not associated with the plurality of frames;

selecting a second neural network to generate an output media stream by processing the plurality of frames of the media stream based on the presence of artificial light flicker in the plurality of frames; and

a third neural network is selected to generate an output media stream by processing the plurality of frames of the media stream based on an absence of artificial light flicker in the plurality of frames.

7. The method of claim 6, wherein the first neural network is a high complexity neural network having one input frame,

wherein the second neural network is a time-guided, lower complexity neural network for joint de-flicker or joint de-noising with q number of input frames and previous output frames, and

wherein the third neural network is a neural network for denoising having p number of input frames and previous output frames, wherein p is smaller than q.

8. The method of claim 7, wherein the first neural network includes a plurality of residual blocks at a lowest level to enhance noise removal capability, and

wherein the second neural network includes at least one convolution operation with fewer feature maps and a previous output frame as a guide to process the plurality of input frames.

9. The method of claim 6, wherein the first, second, and third neural networks are trained using a multi-frame conjoined training method to generate an output media stream by processing the plurality of frames of the media stream.

10. The method of claim 9, further comprising: training a neural network of at least one of the first, second, and third neural networks by:

creating a dataset for training a neural network, wherein the dataset comprises one of a local dataset and a global dataset;

selecting at least two groups of frames from the created dataset, wherein each group comprises at least three frames;

adding the synthesized motion to the selected at least two sets of frames, wherein the at least two sets of frames to which the synthesized motion is added include different noise realizations; and

The conjoined training of the neural network is performed using the reference real media and the at least two sets of frames added with the synthesized motion.

11. The method of claim 10, wherein creating a dataset comprises:

capturing a burst data set, wherein the burst data set comprises one of a low-light static media with noise input and a clean reference real frame;

simulating global and local motions for each burst dataset using the synthetic trajectory generation and the synthetic stop motion, respectively;

removing at least one burst data set having a structural and luminance mismatch between the clean reference real frame and the low-light static media; and

a data set is created by including the at least one burst data set that does not include a structural and brightness mismatch between the clean reference real frame and the low-light static media.

12. The method of claim 11, wherein simulating global motion for each burst data set comprises:

estimating a polynomial coefficient range based on parameters including maximum translation and maximum rotation;

generating a third order polynomial trajectory using the estimated polynomial coefficient range;

approximating a third order trajectory using the maximum depth and the generated third order polynomial trajectory;

Generating uniform sampling points based on the predefined sampling rate and the approximated 3D trajectory;

generating n affine transformations based on the generated uniform sampling points; and

the generated n affine transforms are applied to each burst data set.

13. The method of claim 11, wherein simulating the local motion of each burst data set comprises:

capturing local object motion from each burst data set in a static scene using a composite stop motion, the capturing local object motion comprising:

capturing an input with a background scene and a reference real scene;

capturing an input with foreground objects and a reference real scene;

cutting out a foreground object; and

creating a composite scene by placing foreground objects at different locations of a background scene; and

motion blur for each local object motion is simulated by averaging a predefined number of frames of the burst data set.

14. The method of claim 10, wherein performing conjoined training of a neural network comprises:

delivering the at least two sets of frames with different noise realizations to a neural network to generate at least two sets of output frames;

calculating a conjoined loss by calculating a loss between the at least two sets of output frames;

Calculating pixel loss by calculating an average value and a reference reality of the at least two sets of output frames;

calculating a total loss using the connected loss and the pixel loss; and

the neural network is trained using the calculated total loss.

15. An electronic device, comprising:

a memory; and

a processor coupled to the memory and configured to:

receiving a media stream;

performing alignment of a plurality of frames of the media stream;

correcting the brightness of the plurality of frames;

selecting one of a first neural network, a second neural network, or a third neural network by analyzing parameters of the plurality of frames having corrected brightness, wherein the parameters include at least one of shot boundary detection and artificial light flicker; and

an output media stream is generated by processing the plurality of frames of the media stream using a selected one of the first, second, or third neural networks.