CN117911295A

CN117911295A - Video image reconstruction method and device and electronic equipment

Info

Publication number: CN117911295A
Application number: CN202211269032.2A
Authority: CN
Inventors: 俞碧婷; 杨敬钰; 彭昱博; 岳焕景; 周振宇; 尹玄武
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2024-04-19

Abstract

The disclosure provides a video image reconstruction method, a video image reconstruction device and electronic equipment, relates to the field of image processing, and can reconstruct more true HDR video images. Comprising the following steps: constructing a video data set, wherein the video data set comprises a data pair formed by a sample low dynamic range LDR video image and a sample high dynamic range HDR video image under each frame, the sample LDR video image comprises one of a first exposure image and a second exposure image, and the sample HDR video image is an HDR video image obtained by carrying out pixel fusion on the first exposure image and the second exposure image; training an HDR image reconstruction model by utilizing a video data set so that the HDR image reconstruction model accords with a preset training standard; determining a target LDR video image sequence corresponding to an HDR video image reconstruction task, inputting the target LDR video image sequence into an HDR image reconstruction model which accords with a preset training standard, and obtaining a reconstructed HDR video image.

Description

Video image reconstruction method and device and electronic equipment

Technical Field

The disclosure relates to the field of image processing, and in particular relates to a video image reconstruction method, a video image reconstruction device and electronic equipment.

Background

Limited by camera hardware and exposure parameter limitations, the dynamic range of photographs we take daily is often far below that of natural scenes. The high dynamic range (HIGH DYNAMIC RANGE, HDR) technology can improve the dynamic range of a picture by taking a plurality of Low Dynamic Range (LDR) pictures with different exposures, and obtain an image with better visual quality. However, it is easy to take multiple differently exposed images to reconstruct HDR in a static scene, but some difficulties are encountered in dynamic scenes. The most important problem is misalignment caused by the movement of objects in a scene, which often leads to the occurrence of movement ghosts in the reconstructed HDR result, particularly the exposure difference between different images, so that the traditional alignment method is difficult to have due effects; compared with the HDR image reconstruction, the HDR video reconstruction shoots a video sequence (such as-2 EV, +2EV, -2EV, … …) with alternate exposure, three or five adjacent LDR frames with different exposure are input each time by utilizing a sliding window method, and an HDR result of an intermediate frame is reconstructed through alignment and fusion.

At present, in an HDR video reconstruction mode, a reconstructed HDR data set is distorted, a reference data set cannot be provided for training and evaluating a real video HDR method, and therefore a processing effect on a real scene video is poor easily.

Disclosure of Invention

The disclosure provides a video image reconstruction method, a video image reconstruction device and electronic equipment, which can solve the technical problems that a reconstructed HDR data set is distorted in the existing HDR video reconstruction mode, a reference data set cannot be provided for training and evaluating a real video HDR method, and the processing effect on a real scene video is poor easily.

An embodiment of a first aspect of the present disclosure provides a video image reconstruction method, including:

Constructing a video data set, wherein the video data set comprises a data pair formed by a sample low dynamic range LDR video image and a sample high dynamic range HDR video image under each frame, the sample LDR video image comprises one of a first exposure image and a second exposure image, and the sample HDR video image is an HDR video image obtained by carrying out pixel fusion on the first exposure image and the second exposure image;

Training an HDR image reconstruction model by using the video data set so that the HDR image reconstruction model accords with a preset training standard;

Determining a target LDR video image sequence corresponding to an HDR video image reconstruction task, inputting the target LDR video image sequence into an HDR image reconstruction model conforming to the preset training standard, and obtaining a reconstructed HDR video image.

In some embodiments of the present disclosure, the constructing a video dataset includes:

outputting a first exposure image and a second exposure image corresponding to each frame by using a sensor, alternately selecting one of the first exposure image and the second exposure image corresponding to each frame according to a video frame sequence, wherein the first exposure image is a video image with exposure time being greater than a first preset threshold value, the second exposure image is a video image with exposure time being less than a second preset threshold value, and the first preset threshold value is greater than the second preset threshold value;

and carrying out pixel fusion processing on the first exposure image and the second exposure image under each frame to obtain a sample HDR video image under each frame.

In some embodiments of the present disclosure, the video data set is a Raw domain data set, and the performing pixel fusion processing on the first exposure image and the second exposure image under each frame to obtain a sample HDR video image under each frame includes:

Preprocessing a first exposure image and a second exposure image under each frame;

calculating weight values respectively corresponding to the preprocessed first exposure image and the preprocessed second exposure image according to the designated channel;

and carrying out pixel fusion on the first exposure image and the second exposure image under each frame based on the weight value to obtain a sample HDR video image under each frame.

In some embodiments of the disclosure, the training the HDR image reconstruction model using the video data set to conform the HDR image reconstruction model to a preset training standard includes:

Constructing an alternately exposed sample LDR video image sequence by utilizing sample LDR video images corresponding to frames in the video data set;

Taking the sample LDR video image sequence as an input feature, taking a sample HDR video image under each frame as a feature tag, training an HDR image reconstruction model, and outputting a reconstructed HDR video image;

Calculating a loss function of the HDR image reconstruction model using the reconstructed HDR video image and a corresponding sample HDR video image;

And if the loss function is smaller than a preset threshold value, judging that the HDR image reconstruction model accords with a preset training standard.

In some embodiments of the present disclosure, the training an HDR image reconstruction model with the sample LDR video image sequence as an input feature, the sample HDR video image under each frame as a feature tag, and outputting a reconstructed HDR video image, comprises:

Inputting the sample LDR video image sequence into the HDR image reconstruction model, and correcting continuous multi-frame sample LDR video images in the sample LDR video image sequence by using an exposure coefficient;

extracting image characteristics of the multi-frame sample LDR video image after exposure correction;

performing feature alignment processing on the image features of the multi-frame sample LDR video image, and determining the alignment features of the multi-frame sample LDR video image;

Determining weight values of the alignment features, and carrying out weighted fusion on the alignment features of the multi-frame sample LDR video image based on the weight values to obtain HDR video image features;

and performing tone mapping processing on the HDR video image characteristics to obtain a reconstructed HDR video image.

In some embodiments of the present disclosure, inputting the sequence of sample LDR video images into the HDR image reconstruction model, correcting successive multi-frame sample LDR video images in the sequence of sample LDR video images with exposure coefficients, comprising:

Determining a sample LDR video image sequence corresponding to each frame and exposure coefficients of continuous multi-frame sample LDR video images in the sample LDR video image sequence;

dividing the continuous multi-frame sample LDR video image by the corresponding exposure coefficient to obtain the multi-frame sample LDR video image after exposure correction processing.

In some embodiments of the present disclosure, the performing feature alignment processing on the image features of the multi-frame sample LDR video image, determining the alignment features of the multi-frame sample LDR video image includes:

Inputting the image characteristics of the multi-frame sample LDR video image into an optical flow network to obtain optical flows under multiple scales;

Calculating the optical flow offset of each frame of sample LDR video image under each scale and the middle frame of sample LDR video image in the sample LDR video image sequence;

and carrying out characteristic alignment processing on the image characteristics of the multi-frame sample LDR video image based on the optical flow offset to obtain alignment characteristics respectively corresponding to the multi-frame sample LDR video image.

An embodiment of a second aspect of the present disclosure proposes a video image reconstruction apparatus including:

The device comprises a construction module, a video data set and a display module, wherein the video data set comprises a data pair formed by a sample low dynamic range LDR video image and a sample high dynamic range HDR video image under each frame, the sample LDR video image comprises one of a first exposure image and a second exposure image, and the sample HDR video image is an HDR video image obtained by carrying out pixel fusion on the first exposure image and the second exposure image;

The training module is used for training an HDR image reconstruction model by utilizing the video data set so that the HDR image reconstruction model accords with a preset training standard;

The input module is used for determining a target LDR video image sequence corresponding to an HDR video image reconstruction task, inputting the target LDR video image sequence into an HDR image reconstruction model conforming to the preset training standard, and obtaining a reconstructed HDR video image.

An embodiment of a third aspect of the present disclosure proposes an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described in the embodiments of the first aspect of the present disclosure.

An embodiment of a fourth aspect of the present disclosure proposes a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method described in the embodiment of the first aspect of the present disclosure.

A fifth aspect embodiment of the present disclosure proposes a computer program product comprising a computer program which, when executed by a processor, implements the method described in the first aspect embodiment of the present disclosure.

A sixth aspect of the present disclosure provides a chip comprising one or more interface circuits and one or more processors; the interface circuit is for receiving a signal from a memory of the electronic device and sending the signal to the processor, the signal comprising computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the method described in the embodiments of the first aspect of the disclosure.

In summary, according to the video image reconstruction method, apparatus and electronic device proposed in the present disclosure, a real-world LDR-HDR video data set with continuous reference images may be first constructed to provide a reference data set for training and evaluation of a real video HDR method; furthermore, the HDR image reconstruction model can be trained based on the video data set, and the HDR image reconstruction model is subjected to refined prediction training by means of an exposure correction module, a feature extraction module, an optical flow guiding and aligning module and an inter-frame attention fusion module in sequence, so that the HDR image reconstruction model meets the preset training standard. When a specific HDR video image reconstruction task is involved, a target LDR video image sequence corresponding to the HDR video image reconstruction task can be directly input into an HDR image reconstruction model which accords with a preset training standard, and a reconstructed HDR video image can be obtained. According to the technical scheme, the real LDR video image shot in the Raw domain is subjected to pixel fusion to obtain the HDR video image, so that the real accuracy of the LDR-HDR video data set can be ensured. Furthermore, by training the HDR image reconstruction model by utilizing the video data set, the model accuracy of the HDR image reconstruction model can be ensured, and further, the HDR video image with higher authenticity can be reconstructed, and the processing effect on the real scene video is ensured.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a schematic flow chart of a video image reconstruction method according to an embodiment of the disclosure;

Fig. 2 is a schematic flow chart of a video image reconstruction method according to an embodiment of the disclosure;

fig. 3 is a schematic structural diagram of a video image reconstruction device according to an embodiment of the present disclosure;

Fig. 4 is a schematic structural diagram of a video image reconstruction device according to an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals identify the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present disclosure and are not to be construed as limiting the present disclosure.

With the advent of deep learning, many methods proposed a depth learning-based alignment and fusion method to reconstruct high quality HDR images and videos, however, the depth learning-based HDR image and video processing method often requires a large-scale dataset as a training basis for a model, the existing method mainly generates LDR-HDR data to train the model by synthesizing the dataset, the synthesized dataset obtains the HDR image by turning over a hypothesized camera response function (Camera Response Function, CRF), and then selects exposure parameters to synthesize a corresponding LDR; however, because the simulated CRF curves come in and go out from the actual, the obtained HDR has a certain difference from the actual, and the actual noise situation cannot be modeled. Therefore, in the HDR video reconstruction method, the reconstructed HDR data set is distorted, and the reference data set cannot be provided for training and evaluating the real video HDR method, so that the processing effect on the real scene video is poor.

In view of this, the present invention provides a video image reconstruction method, apparatus and electronic device, which can solve the technical problems that in the current HDR video reconstruction method, the reconstructed HDR data set is distorted, the reference data set cannot be provided for training and evaluating the real video HDR method, and the processing effect on the real scene video is poor.

Fig. 1 is a flowchart illustrating a video image reconstruction method according to an exemplary embodiment, as shown in fig. 1, including the following steps.

Step 101, constructing a video data set, wherein the video data set comprises a data pair formed by a sample low dynamic range LDR video image and a sample high dynamic range HDR video image under each frame, the sample LDR video image comprises one of a first exposure image and a second exposure image, and the sample HDR video image is an HDR video image obtained by carrying out pixel fusion on the first exposure image and the second exposure image.

Compared with the mode that an HDR image is obtained by turning over a hypothesized camera response function (Camera Response Function, CRF) in the existing mechanism, then an exposure parameter is selected to synthesize a corresponding LDR image, and the synthesized data set is constructed by utilizing the HDR image and the synthesized LDR image, the embodiment of the disclosure can firstly determine a sample LDR video image according to a first exposure image and a second exposure image obtained by real shooting, and further obtains the HDR video image by carrying out pixel fusion on the real first exposure image and the second exposure image obtained by shooting in a Raw domain, so that the real accuracy of the LDR-HDR video data set can be ensured, and a reference data set is provided for training and evaluation of a real video HDR method. The first exposure image may be a long exposure video image with an exposure time greater than a first preset threshold, the second exposure image is a short exposure video image with an exposure time less than a second preset threshold, the first preset threshold is greater than the second preset threshold, and specific values may be set according to actual application scenes, which is not limited herein.

Step 102, training the HDR image reconstruction model by using the video data set, so that the HDR image reconstruction model meets a preset training standard.

For the disclosed embodiment, the video data set constructed in the embodiment step 101 may be divided into a training set and a test set, the real world Raw video HDR reconstruction algorithm is designed by training the training set, the model is built further based on the designed real world Raw video HDR reconstruction algorithm, the model is trained by using the deep learning framework Pytorch platform, and the model training result is verified by using the test set. The specific model input frame number may be 3 frames, the input video is cropped to 256 by 256 blocks, and each batch is 16 sets of sample data. An Adam optimizer is selected, and the initial learning rate is set to 0.0001. In order to enhance the diversity of data, the training is performed by overturning, rotating and cutting. And (3) training a model by using a deep learning framework Pytorch platform, iterating 30 epochs on the whole dataset, then reducing the learning rate to 0.00001, and continuing iterating until the loss converges to obtain a final model meeting the preset training standard. The preset training standard may be that a loss function of the HDR image reconstruction model is smaller than a preset threshold, the preset threshold is a value greater than 0 and smaller than 1, and the closer the preset threshold is to 0, the higher the accuracy of the finally obtained HDR image reconstruction model is, the specific value of the preset threshold may be set according to an actual application scenario, which is not limited herein.

Step 103, determining a target LDR video image sequence corresponding to the HDR video image reconstruction task, inputting the target LDR video image sequence into an HDR image reconstruction model which accords with a preset training standard, and obtaining a reconstructed HDR video image.

The HDR video image reconstruction task is to reconstruct an HDR video image under a current frame; the target LDR video image sequence takes the current frame LDR video image as an intermediate frame image of the sequence, and the target LDR video image sequence and the front N frame LDR video image and the rear N frame LDR video image of the current frame LDR video image are formed according to the video frame sequence. For example, the value of N may be 1, for which the target LDR video image sequence may be [ the first 1 frame LDR video image; a current frame LDR video image; the last 1 frame LDR video image ]. Correspondingly, when performing the task of reconstructing an HDR video image, a first exposure image and a second exposure image obtained by capturing a video of a current frame, a previous frame corresponding to the current frame, and a next frame corresponding to the current frame may be obtained first, and one of the first exposure image and the second exposure image of each frame is alternately selected to construct a target LDR video image sequence, for example, [ a first exposure image of a previous 1 frame, a second exposure image of the current frame, a first exposure image of a previous 1 frame ], or [ a second exposure image of a previous 1 frame, a first exposure image of the current frame, a second exposure image of a previous 1 frame ]; and further inputting the target LDR video image sequence into an HDR image reconstruction model which accords with a preset training standard, and obtaining a reconstructed HDR video image.

Based on the embodiment shown in fig. 1, as a refinement and extension of the above embodiment, in order to fully describe the specific implementation procedure of the method of this embodiment, this embodiment provides a specific method as shown in fig. 2. Fig. 2 further defines steps 101, 102 based on the embodiment shown in fig. 1. In the embodiment shown in fig. 2, step 101 comprises steps 201 and 202, and step 102 comprises steps 203 to 206.

As shown in fig. 2, the method comprises the steps of:

Step 201, outputting a first exposure image and a second exposure image corresponding to each frame by using a sensor, and alternately selecting one of the first exposure image and the second exposure image corresponding to each frame according to the video frame sequence as a sample LDR video image under the corresponding video frame.

The first exposure image is a video image with exposure time being larger than a first preset threshold value, and the second exposure image is a video image with exposure time being smaller than a second preset threshold value, and the first preset threshold value is larger than the second preset threshold value.

Step 202, performing pixel fusion processing on the first exposure image and the second exposure image under each frame to obtain a sample HDR video image under each frame.

As one possible implementation, the video data set to be created may be a Raw domain data set. Accordingly, for embodiments of the present disclosure, after the first exposure image and the second exposure image are simultaneously output within one frame time using the staggered sensor, the output first exposure image and second exposure image data pairs may be pixel value weighted to synthesize a corresponding HDR result. Specifically, exposure fusion can be performed on the Raw frame by using a pixel weighting strategy, and the method specifically comprises the following steps: (1) Downsampling the first exposure image and the second exposure image into a quarter-resolution image of four channels according to a Bayer format; (2) Performing white balance correction on the first exposure image and the second exposure image according to different color channels, wherein white balance parameters are obtained by camera metadata; (3) Calculating the corresponding weights of the overexposed region and the underexposed region according to a green channel, wherein the green channel is a channel for storing green image color information in an RGB color mode; (4) And multiplying and fusing the weights and the images to obtain corresponding HDR frames, and finally generating aligned LDR-HDR frames in a Raw domain. Accordingly, the embodiment step 202 may specifically include: preprocessing a first exposure image and a second exposure image under each frame; calculating weight values respectively corresponding to the preprocessed first exposure image and the preprocessed second exposure image according to the designated channel; and carrying out pixel fusion on the first exposure image and the second exposure image under each frame based on the weight value to obtain a sample HDR video image under each frame. The preprocessing comprises downsampling to be one-half resolution images of four channels, and performing white balance correction according to different color channels.

As a possible implementation manner, the video data set to be created may also be an sRGB data set, and correspondingly, for the embodiment of the present disclosure, the raw domain long second exposure image may be output through the sensor, specifically: and obtaining an LDR image of the sRGB domain through black-and-white level correction, demosaicing and white balance, obtaining an HDR result of the sRGB domain in the sRGB domain by using a weighted fusion method provided by Debevec, and taking the HDR result and the LDR as training data pairs of the sRGB domain. Different weighting fusion methods are adopted for Raw domain data and sRGB, on one hand, the weight is calculated according to a green channel after downsampling to four channels, otherwise, the division of overexposure and underexposure areas is affected; secondly, white balance correction is performed before calculation, otherwise color shift occurs during weight fusion.

Step 203, using the sample LDR video image corresponding to each frame in the video dataset to construct an alternately exposed sample LDR video image sequence.

For this embodiment, each sample HDR video image in the video dataset may be sequentially used as an HDR video image of a current frame according to a video frame sequence, a current frame sample LDR video image corresponding to the current frame sample HDR video image may be further used as an intermediate frame image of the sequence, and an image sequence is formed with a previous N frame LDR video image and a next N frame LDR video image of the current frame sample LDR video image according to the video frame sequence, where the image sequence is the sample LDR video image sequence under the current frame. For example, the value of N may be 1, for which the sample LDR video image sequence may be [ the first 1 frame sample LDR video image; a current frame sample LDR video image; the last 1 frame sample LDR video image ]. The sample LDR video image sequence may be an LDR video sequence simulating alternating exposure, in view of the sample LDR video image corresponding to each frame in the video dataset being one of the first exposure image and the second exposure image alternately selected. Such as [ first 1 frame first exposure image, current frame second exposure image, first 1 frame first exposure image ], or [ first 1 frame second exposure image, current frame first exposure image, first 1 frame second exposure image ].

Step 204, training an HDR image reconstruction model by using the sample LDR video image sequence as an input feature and the sample HDR video image under each frame as a feature tag, and outputting a reconstructed HDR video image.

The HDR image reconstruction model includes an exposure correction module, a feature extraction module, an optical flow guiding alignment module, and an inter-frame attention fusion module, and for the embodiment of the present disclosure, the exposure correction module and the feature extraction module may be used to perform exposure correction and feature extraction processing on a multi-frame sample LDR video image in a sample LDR video image sequence first:

the sample LDR video image sequence input each time comprises three continuous frames of Raw domain sample LDR video images For example, the exposure coefficient t _i-1、t_i、t_i+1 of the three frames of Raw domain sample LDR video image may be first determined, and the exposure coefficient is further used to perform exposure correction on the input sample LDR video image, where the correction formula is as follows:

by utilizing the formula, the continuous three-frame sample LDR video images can be divided by the corresponding exposure coefficients respectively, so that the exposure correction of the three-frame sample LDR video images is realized, the three-frame sample LDR video images are in the same exposure environment, and the input three-frame Raw domain sample LDR video images are mapped to the same exposure level.

Further through a feature extraction module, features are extracted by convolution:

Where F _i denotes the features extracted for the i-th frame, the input sample LDR video image is used to help detect overexposed and underexposed areas, and the input sample HDR video image is used to help subsequent alignment.

Furthermore, the optical flow guiding and aligning module can be utilized to perform characteristic alignment processing on image characteristics of multi-frame sample LDR video images in the sample LDR video image sequence, so as to determine alignment characteristics of the multi-frame sample LDR video images:

in order to solve the problems that accurate optical flow estimation between alternative exposure images is difficult and deformable convolution alignment of feature levels is difficult to train, a pyramid deformable convolution structure adopting optical flow guidance is not disclosed, and input features are firstly subjected to an optical flow network to obtain optical flows under five scales:

wherein g represents the optical flow network after pre-training;

raw domain input Converted to nonlinear domain/>, by gamma correction

Will first be at the s-th scaleAccording to/>Curl alignment is/>Then/>Estimating offset/>, concatenated with mid-frame feature F _i

The calculated offsetAs the correction of the optical flow, the final offset of the deformable convolution is used for processing the last frame of features before the convolution alignment to obtain a final alignment result under the current scale:

wherein, For the final alignment result at the current scale,/>For the optical flow of the previous scale,/>For the offset of the last scale, +.2 represents 2-fold bilinear interpolation upsampling,/>For the offset of the current scale after upsampling by 2 times bilinear interpolation,/>Is the current scale optical flow after 2 times bilinear interpolation up-sampling. The alignment result of each scale is further fused with the alignment result of the previous scale through convolution, and displacement under large scale and different exposure conditions can be estimated more accurately through optical flow guiding and combined prediction from thick to thin. Finally, the weight prediction network of the inter-frame attention fusion module can be utilized to obtain the corresponding weight of each spatial position of the alignment feature, so that the network is helped to reconstruct an HDR image without ghosting and accurate exposure:

wherein, Representing the estimated weights,/>Representing the weighted fused features,/>Obtaining final Raw domain HDR video image characteristics after a series of residual blocks, jump connection and sigmoid layer reconstruction;

the HDR video image features may then be mapped to reconstructed HDR video images using a tone mapping process:

Wherein, mu is 5000, Representing the prediction result of the reconstructed HDR video image after tone mapping.

Accordingly, the embodiment step 204 may specifically include: inputting the sample LDR video image sequence into an HDR image reconstruction model, and correcting continuous multi-frame sample LDR video images in the sample LDR video image sequence by using an exposure coefficient; extracting image characteristics of the multi-frame sample LDR video image after exposure correction; performing feature alignment processing on image features of the multi-frame sample LDR video images to determine alignment features of the multi-frame sample LDR video images; determining weight values of all the alignment features, and carrying out weighted fusion on the alignment features of the multi-frame sample LDR video image based on the weight values to obtain HDR video image features; and performing tone mapping processing on the HDR video image characteristics to obtain a reconstructed HDR video image.

As a possible implementation manner, when inputting the sample LDR video image sequence into the HDR image reconstruction model and correcting consecutive multi-frame sample LDR video images in the sample LDR video image sequence by using the exposure coefficient, the embodiment steps may specifically include: determining a sample LDR video image sequence corresponding to each frame and exposure coefficients of continuous multi-frame sample LDR video images in the sample LDR video image sequence; dividing the continuous multi-frame sample LDR video image by the corresponding exposure coefficient to obtain the multi-frame sample LDR video image after exposure correction processing.

As a possible implementation manner, when performing feature alignment processing on image features of the multi-frame sample LDR video image and determining alignment features of the multi-frame sample LDR video image, the embodiment steps may specifically include: inputting the image characteristics of the multi-frame sample LDR video image into an optical flow network to obtain optical flows under multiple scales; calculating optical flow offset of each frame of sample LDR video image under each scale and an intermediate frame of sample LDR video image in a sample LDR video image sequence; and carrying out characteristic alignment processing on the image characteristics of the multi-frame sample LDR video images based on the optical flow offset to obtain alignment characteristics respectively corresponding to the multi-frame sample LDR video images.

Step 205, calculating a loss function of the HDR image reconstruction model using the reconstructed HDR video image and the corresponding sample HDR video image.

For this embodiment, after obtaining the prediction result of the reconstructed HDR video image after tone mapping in the embodiment step 204, the tone mapping process may be further performed on the sample HDR video image (Raw domain truth image) under the corresponding same frame, and further according to the Raw domain truth image and the prediction result after tone mapping:

Wherein L is a loss function, And/>And representing the raw domain truth image and the prediction result after tone mapping.

Step 206, if the loss function is smaller than the preset threshold, determining that the HDR image reconstruction model meets the preset training standard.

The preset threshold is a value greater than 0 and less than 1, and the closer the preset threshold is to 0, the higher the accuracy of the finally obtained HDR image reconstruction model is, the specific value of the preset threshold can be set according to the actual application scene, and the specific value is not limited herein.

Step 207, determining a target LDR video image sequence corresponding to the HDR video image reconstruction task, inputting the target LDR video image sequence into an HDR image reconstruction model conforming to a preset training standard, and obtaining a reconstructed HDR video image.

For the disclosed embodiments, a real world LDR-HDR video data set with continuous reference images may first be constructed to provide a baseline data set for training and evaluation of a real video HDR method; furthermore, the HDR image reconstruction model can be trained based on the video data set, and the HDR image reconstruction model is subjected to refined prediction training by means of an exposure correction module, a feature extraction module, an optical flow guiding and aligning module and an inter-frame attention fusion module in sequence, so that the HDR image reconstruction model meets the preset training standard. When a specific HDR video image reconstruction task is involved, a target LDR video image sequence corresponding to the HDR video image reconstruction task can be directly input into an HDR image reconstruction model which accords with a preset training standard, and a reconstructed HDR video image can be obtained. According to the technical scheme, the real LDR video image obtained by shooting in the Raw domain is subjected to weighted fusion to obtain the HDR video image, so that the real accuracy of the LDR-HDR video data set can be ensured. Furthermore, by training the HDR image reconstruction model by utilizing the video data set, the model accuracy of the HDR image reconstruction model can be ensured, and further, the HDR video image with higher authenticity can be reconstructed, and the processing effect on the real scene video is ensured. Furthermore, based on the real world Raw video HDR dataset, a Raw-HDR method is proposed, with the proposed optical flow guide alignment module, and inter-frame attention fusion module, extending the dynamic range of the real LDR video to a new range.

Based on the specific implementation of the method shown in fig. 1-2, this embodiment provides a video image reconstruction device, as shown in fig. 3, including: a construction module 41, a training module 42, an input module 43.

The construction module 41 is configured to construct a video data set, where the video data set includes a data pair formed by a sample low dynamic range LDR video image and a sample high dynamic range HDR video image under each frame, the sample LDR video image includes one of a first exposure image and a second exposure image, and the sample HDR video image is an HDR video image obtained by performing pixel fusion on the first exposure image and the second exposure image;

A training module 42 operable to train the HDR image reconstruction model with the video dataset such that the HDR image reconstruction model meets a preset training standard;

The input module 43 may be configured to determine a target LDR video image sequence corresponding to an HDR video image reconstruction task, input the target LDR video image sequence into an HDR image reconstruction model that meets a preset training standard, and obtain a reconstructed HDR video image.

In some embodiments of the present disclosure, as shown in fig. 4, the build module 41 includes: a selection unit 411 and a processing unit 412;

The selecting unit 411 may be configured to output, by using the sensor, a first exposure image and a second exposure image corresponding to each frame, and alternately select, according to a video frame sequence, one of the first exposure image and the second exposure image corresponding to each frame, as a sample LDR video image under a corresponding video frame, where the first exposure image is a video image with an exposure time greater than a first preset threshold, and the second exposure image is a video image with an exposure time less than a second preset threshold, and the first preset threshold is greater than the second preset threshold;

the processing unit 412 may be configured to perform pixel fusion processing on the first exposure image and the second exposure image under each frame, so as to obtain a sample HDR video image under each frame.

In some embodiments of the present disclosure, the video dataset is a Raw domain dataset, and the processing unit 412 is specifically configured to pre-process the first exposure image and the second exposure image under each frame; calculating weight values respectively corresponding to the preprocessed first exposure image and the preprocessed second exposure image according to the designated channel; and carrying out pixel fusion on the first exposure image and the second exposure image under each frame based on the weight value to obtain a sample HDR video image under each frame.

In some embodiments of the present disclosure, as shown in fig. 4, training module 42 includes: the device comprises a construction unit 421, a training unit 422, a calculation unit 423 and a judgment unit 424;

A construction unit 421, configured to construct an alternately exposed sample LDR video image sequence using the sample LDR video images corresponding to each frame in the video dataset;

the training unit 422 may be configured to train an HDR image reconstruction model with the sample LDR video image sequence as an input feature, and the sample HDR video image under each frame as a feature tag, and output a reconstructed HDR video image;

A calculation unit 423 operable to calculate a loss function of the HDR image reconstruction model using the reconstructed HDR video image and the corresponding sample HDR video image;

the determining unit 424 may be configured to determine that the HDR image reconstruction model meets a predetermined training standard if the loss function is smaller than a predetermined threshold.

In some embodiments of the present disclosure, the training unit 422 may be configured to input the sample LDR video image sequence into an HDR image reconstruction model, and correct consecutive multi-frame sample LDR video images in the sample LDR video image sequence using the exposure coefficients; extracting image characteristics of the multi-frame sample LDR video image after exposure correction; performing feature alignment processing on image features of the multi-frame sample LDR video images to determine alignment features of the multi-frame sample LDR video images; determining weight values of all the alignment features, and carrying out weighted fusion on the alignment features of the multi-frame sample LDR video image based on the weight values to obtain HDR video image features; and performing tone mapping processing on the HDR video image characteristics to obtain a reconstructed HDR video image.

In some embodiments of the present disclosure, when inputting a sample LDR video image sequence into an HDR image reconstruction model, correcting consecutive multi-frame sample LDR video images in the sample LDR video image sequence with exposure coefficients, the training unit 422 may be used to determine the sample LDR video image sequence corresponding to each frame, and the exposure coefficients of consecutive multi-frame sample LDR video images in the sample LDR video image sequence; dividing the continuous multi-frame sample LDR video image by the corresponding exposure coefficient to obtain the multi-frame sample LDR video image after exposure correction processing.

In some embodiments of the present disclosure, when performing feature alignment processing on image features of a multi-frame sample LDR video image and determining alignment features of the multi-frame sample LDR video image, the training unit 422 may be configured to input the image features of the multi-frame sample LDR video image into an optical flow network to obtain optical flows under multiple scales; calculating optical flow offset of each frame of sample LDR video image under each scale and an intermediate frame of sample LDR video image in a sample LDR video image sequence; and carrying out characteristic alignment processing on the image characteristics of the multi-frame sample LDR video images based on the optical flow offset to obtain alignment characteristics respectively corresponding to the multi-frame sample LDR video images.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to the embodiment of the disclosure, the real LDR video image obtained by shooting in the Raw domain is subjected to weighted fusion to obtain the HDR video image, so that the real accuracy of the LDR-HDR video data set can be ensured. Furthermore, by training the HDR image reconstruction model by utilizing the video data set, the model accuracy of the HDR image reconstruction model can be ensured, and further, the HDR video image with higher authenticity can be reconstructed, and the processing effect on the real scene video is ensured. Furthermore, based on the real world Raw video HDR dataset, a Raw-HDR method is proposed, with the proposed optical flow guide alignment module, and inter-frame attention fusion module, extending the dynamic range of the real LDR video to a new range.

In the embodiment provided by the application, the method and the device provided by the embodiment of the application are introduced. In order to implement the functions in the method provided by the embodiment of the present application, the electronic device may include a hardware structure, a software module, and implement the functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. Some of the functions described above may be implemented in a hardware structure, a software module, or a combination of a hardware structure and a software module.

Fig. 5 is a block diagram illustrating an electronic device 800 for implementing the video image reconstruction method described above, according to an exemplary embodiment.

For example, electronic device 800 may be a mobile phone, computer, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 5, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G,4G LTE, 5G NR (New Radio), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Embodiments of the present disclosure also propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the video image reconstruction method described in the above embodiments of the present disclosure.

Embodiments of the present disclosure also provide a computer program product comprising a computer program which, when executed by a processor, performs the video image reconstruction method described in the above embodiments of the present disclosure.

Embodiments of the present disclosure also provide a chip including one or more interface circuits and one or more processors; the interface circuit is for receiving a signal from a memory of the electronic device and sending the signal to the processor, the signal including computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the video image reconstruction method described in the above embodiments of the present disclosure.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In the description of the present specification, reference is made to the description of the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., meaning that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, system that includes a processing module, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (control method) with one or more wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It should be understood that portions of embodiments of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.

Furthermore, functional units in various embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product. The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

While embodiments of the present disclosure have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the present disclosure, and that variations, modifications, alternatives, and variations of the above embodiments may be made by those of ordinary skill in the art within the scope of the present disclosure.

Claims

1. A method for reconstructing a video image, comprising:

2. The method of claim 1, wherein said constructing a video dataset comprises:

3. The method of claim 2, wherein the video dataset is a Raw domain dataset, and wherein performing pixel fusion processing on the first exposure image and the second exposure image under each frame to obtain a sample HDR video image under each frame comprises:

4. The method of claim 1, wherein training an HDR image reconstruction model using the video dataset to conform the HDR image reconstruction model to a preset training standard, comprises:

5. The method of claim 4, wherein training an HDR image reconstruction model, using the sequence of sample LDR video images as input features, sample HDR video images under each frame as feature labels, and outputting reconstructed HDR video images, comprises:

6. The method of claim 5, wherein said inputting the sequence of sample LDR video images into the HDR image reconstruction model corrects successive ones of the plurality of frames of sample LDR video images in the sequence of sample LDR video images with exposure coefficients, comprising:

7. The method of claim 5, wherein said performing feature alignment processing on image features of said multi-frame sample LDR video image, determining alignment features of said multi-frame sample LDR video image, comprises:

8. A video image reconstruction apparatus, comprising:

9. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.

12. A chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from a memory of an electronic device and to send the signal to the processor, the signal comprising computer instructions stored in the memory, which when executed by the processor, cause the electronic device to perform the method of any of claims 1-7.