WO2024130715A1 - 视频处理方法、视频处理装置和可读存储介质 - Google Patents

视频处理方法、视频处理装置和可读存储介质 Download PDF

Info

Publication number
WO2024130715A1
WO2024130715A1 PCT/CN2022/141522 CN2022141522W WO2024130715A1 WO 2024130715 A1 WO2024130715 A1 WO 2024130715A1 CN 2022141522 W CN2022141522 W CN 2022141522W WO 2024130715 A1 WO2024130715 A1 WO 2024130715A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
image
network
version
frame
Prior art date
Application number
PCT/CN2022/141522
Other languages
English (en)
French (fr)
Inventor
朱丹
陈冠男
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2022/141522 priority Critical patent/WO2024130715A1/zh
Publication of WO2024130715A1 publication Critical patent/WO2024130715A1/zh

Links

Images

Definitions

  • Embodiments of the present disclosure relate to a video processing method, a video processing device, and a non-transitory readable storage medium.
  • High dynamic range (HDR) images can provide more dynamic range and image details than ordinary images, can more accurately record most of the color and lighting information of real scenes, and can show rich color details and light and dark levels.
  • HDR high dynamic range
  • LDR low dynamic range
  • HDR technology can be applied to fields with high requirements for image quality, such as medical imaging, video surveillance, satellite remote sensing, and computer vision.
  • At least one embodiment of the present disclosure provides a video processing method, comprising: dividing a plurality of video frames included in an initial video into a plurality of video segments, each of the video segments including one or more video frames, and the plurality of video frames being continuous; determining a display parameter set of the video segment to which it belongs according to any one of the one or more video frames; adjusting other frames in the video segment according to the display parameter set to obtain an intermediate video segment; performing high dynamic range conversion on the intermediate video segment to obtain a high dynamic range video segment; and generating a high dynamic range video according to the high dynamic range video segment.
  • the display parameter set includes a first display parameter, a second display parameter, and a third display parameter
  • the first display parameter and the third display parameter are used to adjust the brightness of the video frame
  • the second display parameter is used to adjust the contrast of the video frame.
  • the first display parameter is used to adjust the overall brightness level of the video frame
  • the third display parameter is used to partially adjust the brightness level of the video frame.
  • multiple video frames included in an initial video are divided into multiple video segments, including: calculating the similarity between each video frame and the previous video frame in sequence according to the playback order of the multiple video frames included in the initial video; and dividing the initial video into multiple video segments based on the calculated similarity between each two adjacent video frames.
  • the method before calculating the similarity between each video frame and the previous video frame in sequence according to the playback order of the multiple video frames included in the initial video, the method also includes: performing dimensionality reduction processing on each initial video frame in the initial video to obtain the multiple video frames.
  • the sequentially calculating the similarity between each video frame and the previous video frame includes: for each video frame among the multiple video frames, based on the mean of the image data of the video frame and the mean of the image data of the previous video frame, the standard deviation of the image data of the video frame and the standard deviation of the image data of the previous video frame, and the covariance of the image data of the video frame and the image data of the previous video frame, determining the structural similarity between the video frame and the previous video frame; based on the structural similarity between the video frame and the previous video frame, determining the similarity between the video frame and the previous video frame.
  • a display parameter set of a video segment to which any one of the one or more video frames belongs is determined, including: using an image processing network to perform parameter analysis on an initial video frame to obtain the display parameter set; the image processing network includes a first image analysis module and a second image analysis module; the first image analysis module is used to perform feature extraction on the initial video frame; obtain a first intermediate video frame; the second image analysis module is used to perform feature extraction and scale transformation on the first intermediate video frame to output the display parameter set.
  • the first image analysis module includes a first convolution layer, an average pooling layer, an activation layer and a real column regularization normalization layer;
  • the second image analysis module includes a second convolution layer and a global average pooling layer.
  • the image processing network includes a plurality of the first image analysis modules.
  • adjusting other frames in the video segment according to the display parameter set to obtain the intermediate video segment includes:
  • All video frame data in each video segment is adjusted according to the display parameter set corresponding to each video frame according to the following equation:
  • X in represents an input frame
  • X out represents an output frame
  • w1, w2, and w3 are the first display parameter, the second display parameter, and the third display parameter, respectively.
  • the intermediate video segment is converted into a high dynamic range to obtain a high dynamic range video segment, including: using a video processing network to convert the intermediate video segment into a high dynamic range; the video processing network includes a basic network and a weight network; the basic network is used to extract features and reconstruct features of the input frame to obtain a high dynamic range output frame; the weight network is used to extract features of the input frame to obtain feature matrix parameters, and the basic network is corrected according to the feature matrix parameters.
  • the basic network includes at least one information adjustment node, and the information adjustment node is used to integrate the feature extraction information of the input frame by the basic network and the feature matrix parameter information of the weight network.
  • the basic network includes a first information regulation node, a second information regulation node, a third information regulation node, a fourth information regulation node and a fifth information regulation node.
  • the weight network includes at least one feature correction network
  • the feature correction network includes at least one attention module
  • the attention module uses dual channels to extract features from input information, including: using the first channel to perform local feature extraction on the input frame to obtain a first feature; using the second channel to perform global feature extraction on the input frame to obtain a second feature; and fusing the first feature and the second feature to obtain output information.
  • the weight network includes a first feature correction network, a second feature correction network and a third feature correction network
  • the method includes: inputting the input frame into the first feature correction network to obtain a first feature parameter matrix; inputting the first feature parameter matrix into the third information adjustment node; rearranging the feature channels of the first feature parameter matrix and the input frame and inputting them into the second feature correction network to obtain a second feature parameter matrix; inputting the second feature parameter matrix into the second information adjustment node and the fourth information adjustment node; rearranging the feature channels of the second feature parameter matrix and the input frame and inputting them into the third feature correction network to obtain a third feature parameter matrix; inputting the third feature parameter matrix into the first information adjustment node and the fifth information adjustment node.
  • the method provided by at least one embodiment of the present disclosure also includes: obtaining first sample data, wherein the first sample data includes a first version SDR image and a first version HDR image; using the first version HDR image corresponding to the first version SDR image as the first version true image; inputting the first version SDR image into a video processing network to obtain a first version predicted HDR image corresponding to the first version SDR image; inputting the first version predicted HDR image and the first version true image into a first loss function to obtain a first loss function value; and adjusting the model parameters of the video processing network according to the first loss function value; obtaining second sample data, wherein the second sample data includes a second version SDR image and a second version HDR image; using the second version HDR image corresponding to the second version SDR image as a second version true image; inputting the second version SDR image into the image processing network and the trained video processing network to obtain a second version predicted HDR image corresponding to the second version SDR image; fixing the parameters of the video processing network; in
  • the method provided by at least one embodiment of the present disclosure also includes: obtaining third sample data, wherein the third sample data includes a third version SDR image and a third version HDR image; using a third version HDR image corresponding to the third version SDR image as a third version true image; inputting the third version SDR image into the image processing network and the video processing network to obtain a third version predicted HDR image corresponding to the third version SDR image; inputting the third version predicted HDR image and the third version true image into a third loss function to obtain a third loss function value; and adjusting the model parameters of the image processing network and the video processing network according to the third loss function value.
  • At least one embodiment of the present disclosure further provides a video processing device, including a division module, an acquisition module and a processing module.
  • the division module is configured to divide the multiple video frames included in the initial video into multiple video segments, each of the video segments includes one or more video frames, and the multiple video frames are continuous.
  • the acquisition module is configured to determine the display parameter set of the video segment to which it belongs according to any one of the one or more video frames; adjust other frames in the video segment according to the display parameter set to obtain an intermediate video segment.
  • the processing module is configured to perform high dynamic range conversion on the intermediate video segment to obtain a high dynamic range video segment; and generate a high dynamic range video according to the high dynamic range video segment.
  • At least one embodiment of the present disclosure further provides a video processing device, which includes a processor and a memory.
  • the memory includes one or more computer program modules.
  • the one or more computer program modules are stored in the memory and are configured to be executed by the processor, and the one or more computer program modules include instructions for executing the video processing method described in any of the above embodiments.
  • At least one embodiment of the present disclosure further provides a non-transitory readable storage medium on which computer instructions are stored.
  • the computer instructions are executed by a processor, the video processing method described in any of the above embodiments is executed.
  • FIG1 is a schematic block diagram of a method for generating an HDR video according to at least one embodiment of the present disclosure
  • FIG2 is an example flow chart of a video processing method according to at least one embodiment of the present disclosure
  • FIG3 is a flow chart of performing scene segmentation on a video provided by at least one embodiment of the present disclosure
  • FIG4 is a schematic diagram of the structure of an image processing network provided by at least one embodiment of the present disclosure.
  • FIG5 is a schematic diagram of a training process of an image processing network provided by at least one embodiment of the present disclosure
  • FIG6 is a schematic block diagram of another HDR video generation method provided by at least one embodiment of the present disclosure.
  • FIG7 is a schematic block diagram of an HDR model provided by at least one embodiment of the present disclosure.
  • FIG8 is a schematic block diagram of another HDR model provided by at least one embodiment of the present disclosure.
  • FIG9A is a schematic diagram of a structure of an extraction subnetwork provided by at least one embodiment of the present disclosure.
  • FIG9B is a schematic diagram of the structure of a residual network provided by at least one embodiment of the present disclosure.
  • FIG10A is a schematic diagram of the structure of a correction network provided by at least one embodiment of the present disclosure.
  • FIG10B is a schematic diagram of the structure of an attention module provided by at least one embodiment of the present disclosure.
  • FIG11 is a schematic block diagram of a video processing device provided by at least one embodiment of the present disclosure.
  • FIG12 is a schematic block diagram of another video processing device provided by at least one embodiment of the present disclosure.
  • FIG13 is a schematic block diagram of another video processing device provided by at least one embodiment of the present disclosure.
  • FIG14 is a schematic block diagram of a non-transitory readable storage medium provided by at least one embodiment of the present disclosure.
  • FIG. 15 is a schematic block diagram of an electronic device according to at least one embodiment of the present disclosure.
  • FIG. 1 is a schematic block diagram of an HDR video generation method provided by at least one embodiment of the present disclosure.
  • a simple HDR task can be understood as using a single HDR model (i.e., an HDR image generation algorithm) to complete the processing of the entire video.
  • a single HDR model i.e., an HDR image generation algorithm
  • the frame information of each video frame/video frame is input into the HDR model, and the HDR model is used to map a frame of image (such as a standard dynamic range (SDR) image or LDR image, etc.) to an HDR image.
  • the mapping process may include dynamic range expansion, color gamut range expansion, and color adjustment of the picture.
  • the output information of the HDR model is encoded to generate an HDR video.
  • the HDR video generation method shown in FIG1 requires that the video is relatively simple in content, such as high-definition TV series and movies that have been broadcast. Such videos have been broadcast on high-definition channels, and the overall brightness, contrast, and color information of the video can basically remain consistent.
  • Complex HDR tasks can be understood as complex source scenes, such as documentaries, TV series from the same series, or variety shows divided into several parts, each of which may have slight differences in brightness, contrast, color, etc. In this case, using only a single HDR model cannot complete the processing of complex sources. When encountering a scene that a single HDR model cannot handle, seeking professional colorists for color grading will greatly increase the cost.
  • At least one embodiment of the present disclosure provides a video processing method, which includes: dividing a plurality of video frames included in an initial video into a plurality of video segments, each video segment includes one or more video frames, and the plurality of video frames are continuous; determining a display parameter set of the video segment to which it belongs according to any one of the one or more video frames; adjusting other frames in the video segment according to the display parameter set to obtain an intermediate video segment; performing high dynamic range conversion on the intermediate video segment to obtain a high dynamic range video segment; and generating a high dynamic range video according to the high dynamic range video segment.
  • At least one embodiment of the present disclosure further provides a video processing device, a non-transitory readable storage medium, and an electronic device corresponding to the above-mentioned video processing method.
  • the initial video can be divided into one or more video segments according to scene segmentation, a display parameter set corresponding to each video segment can be obtained, and the video frames in the video segment can be adjusted based on the display parameter set to obtain high dynamic range video segments, and further generate high dynamic range videos, so that a single HDR model can process the initial video with complex scenes, effectively improving the quality and efficiency of generating HDR videos.
  • the video processing method provided by the present disclosure is described in a non-restrictive manner below through multiple embodiments and examples thereof. As described below, different features in these specific examples or embodiments may be combined with each other without conflicting with each other to obtain new examples or embodiments, and these new examples or embodiments also fall within the scope of protection of the present disclosure.
  • FIG. 2 is an exemplary flow chart of a video processing method according to at least one embodiment of the present disclosure.
  • the video processing method 10 can be applied to any application scenario that needs to generate HDR images/videos, for example, it can be applied to displays, cameras, cameras, video players, mobile terminals, etc., and can also be applied to other aspects, which are not limited by the embodiment of the present disclosure.
  • the video processing method 10 can include the following operations S101 to S105.
  • Step S101 dividing a plurality of video frames included in an initial video into a plurality of video segments, each video segment including one or more video frames, and the plurality of video frames are continuous.
  • Step S102 determining a display parameter set of a video segment to which any one of the one or more video frames belongs.
  • Step S103 adjusting other frames in the corresponding video segment according to the display parameter set to obtain an intermediate video segment.
  • Step S104 performing high dynamic range conversion on the middle video segment to obtain a high dynamic range video segment
  • Step S105 Generate a high dynamic range video according to the high dynamic range video clip.
  • the initial video may be a photographic work, a video downloaded from the Internet, or a locally stored video, etc., or may be an LDR video, an SDR video, etc., and the embodiments of the present disclosure do not impose any restrictions on this.
  • the initial video may include various video scenes, such as an indoor scene, a scenic spot scene, etc., and the embodiments of the present disclosure do not impose any restrictions on this.
  • the initial video can be divided into video segments according to the video scene.
  • the multiple video frames included in the initial video are divided into one or more video segments, and each video segment includes one or more video frames.
  • each video segment corresponds to a single video scene.
  • the initial video is divided into two video segments, the scene corresponding to the first video segment is in the classroom, and the scene corresponding to the second video segment is on the playground. It should be noted that the embodiments of the present disclosure are not limited to specific scenes and can be set according to actual needs.
  • various algorithms may be used to process the initial video according to scenes, and the embodiments of the present disclosure are not limited to this. As long as the scene division function of the video can be implemented, it can be set according to actual conditions.
  • the multiple video frames included in the initial video are divided into multiple video segments, including: calculating the similarity between each video frame and the previous video frame in turn according to the playback order of the multiple video frames included in the initial video; and dividing the initial video into multiple video segments based on the calculated similarity between each two adjacent video frames.
  • a dimensionality reduction process is performed on each initial video frame in the initial video to obtain multiple video frames.
  • the computing cost can be greatly saved and the efficiency can be improved.
  • the similarity between each video frame and the previous video frame is calculated in sequence, including: for each video frame in a plurality of video frames, based on the mean of image data of the video frame and the mean of image data of the previous video frame, the standard deviation of image data of the video frame and the standard deviation of image data of the previous video frame, and the covariance of image data of the video frame and image data of the previous video frame, determining the structural similarity between the video frame and the previous video frame; and, based on the structural similarity between the video frame and the previous video frame, determining the similarity between the video frame and the previous video frame.
  • FIG. 3 is a flowchart of scene segmentation of a video provided by at least one embodiment of the present disclosure.
  • a structural similarity (SSIM) algorithm may be used to perform scene segmentation processing on the initial video.
  • the SSIM algorithm is applied to segment the video content scene.
  • SSIM is used to measure the structural similarity of two images (video frames) and is an indicator for measuring the degree of similarity between two images (video frames). The larger the SSIM value, the more similar the two images are.
  • the value range of SSIM is [0,1], and the calculation formula is shown in the following equation (1):
  • x and y represent two input images (video frames)
  • SSIM(x, y) represents the similarity between the input images x and y
  • ⁇ x represents the average value of the image data of the input image x
  • ⁇ y represents the average value of the image data of the input image y.
  • ⁇ xy represents the covariance of the image data of the input images x and y
  • the video frame at the current moment i.e., the current video frame
  • the video frame at the moment before the current moment i.e., the video frame before the current video frame
  • the structural similarity between the two adjacent video frames x and y can be calculated through the above equation (1).
  • the current two frames of image y and x are considered not to belong to the same scene, and video frame x is the last frame of the previous scene, and video frame y is the first frame of the next scene.
  • the value of the threshold T can be set according to actual conditions, and the embodiments of the present disclosure do not limit this.
  • each initial video frame in the initial video is subjected to dimensionality reduction processing (such as a downsampling operation), and then the SSIM is calculated, as shown in FIG3 , thereby greatly saving computing costs.
  • SSIM is used as a scene segmentation algorithm.
  • the SSIM algorithm is simple to calculate and only requires information of two consecutive frames of video. It can realize real-time computational processing of the video stream and does not require offline analysis of the video for scene segmentation.
  • step S102 includes: using an image processing network to perform parameter analysis on the initial video frame to obtain a display parameter set.
  • the image processing network includes a first image analysis module and a second image analysis module.
  • the first image analysis module is used to extract features from the initial video frame; obtain a first intermediate video frame; and the second image feature analysis module is used to extract features and scale transform the first intermediate video frame to output a display parameter set.
  • the initial video frame is the frontmost video frame in the video clip to which it belongs according to the video playback order.
  • the video stream data can be processed in real time, and the display parameter set of the current video clip can be obtained based on the frontmost video frame in the current video clip, and then the other frame images in the current video clip can be processed by the display parameter set.
  • preprocessing is performed only based on the first frame of the current video clip (i.e., the frontmost video frame in the playback order), which can prevent the flickering of information such as brightness and contrast between frames during video playback.
  • the same scene has content continuity, and only the information of the first video frame corresponding to the current scene can be used to adjust the display parameter set corresponding to the current scene.
  • the initial video frame may be a randomly selected video frame from the corresponding video segment, and the embodiments of the present disclosure are not limited to this and may be set according to actual needs.
  • the image processing network includes a first image analysis module and a second image analysis module.
  • the first image analysis module is used to extract features from the initial video frame; obtain a first intermediate video frame
  • the second image analysis module is used to extract features and scale transform the first intermediate video frame to output a display parameter set.
  • the first image analysis module includes a first convolutional layer, an average pooling layer, an activation layer, and a real column regularization normalization layer.
  • the second image analysis module includes a second convolutional layer and a global average pooling layer.
  • the image processing network includes multiple first image analysis modules.
  • first image analysis module and the “second image analysis module” are used to represent image analysis modules with specific structures, respectively, and are not limited to a specific one or a certain type of image analysis module, nor are they limited to a specific order, and can be set according to actual conditions.
  • first convolutional layer and the “second convolutional layer” are used to represent convolutional layers with specific convolution parameters, respectively, and are not limited to a specific one or a certain type of convolutional layer, nor are they limited to a specific order, and can be set according to actual conditions.
  • FIG. 4 is a schematic diagram of the structure of an image processing network provided by at least one embodiment of the present disclosure.
  • the image processing network can be any neural network model structure, also referred to as a preprocessing model.
  • a network model structure as shown in FIG4.
  • the image processing network includes multiple first image analysis modules and a second image analysis template.
  • each first image analysis module includes a first convolution layer Conv (k3f64), an average pooling layer AvgPool, an activation layer ReLU, and a real column regularization normalization layer IN.
  • the second image analysis module includes a second convolution layer Conv (k3f3) and a global average pooling layer GlobalAvgPool.
  • the global average pooling layer GlobalAvgPool in the second image analysis module averagely pools the feature map output by the second convolutional layer Conv(k3f3) into parameters w1, w2, and w3 in the display parameter set.
  • the image processing network shown in Figure 4 is only an example.
  • the image processing network can adopt the architecture of any neural network model, and is not limited to the model architecture shown in Figure 4, and the embodiments of the present disclosure do not limit this.
  • the image processing network may include multiple first image analysis modules, for example, 3-6, and the embodiments of the present disclosure do not limit the number of first image analysis modules.
  • the image processing network may include 4 first image analysis modules.
  • the initial video frame (e.g., the frontmost video frame) of the current video clip is input into the image processing network (preprocessing model) to obtain a display parameter set corresponding to the current video clip.
  • the display parameter set can be used to preprocess the current video clip so that a good HDR video is finally generated.
  • the first display parameter set includes a first display parameter w1, a second display parameter w2, and a third display parameter w3.
  • the first display parameter w1 and the third display parameter w3 are used to adjust the brightness of the video frame and the second display parameter w2 is used to adjust the contrast of the video frame.
  • the first display parameter w1 is used to adjust the overall brightness level of the video frame
  • the third display parameter w3 is used to locally adjust (micro-adjust) the brightness level of the video frame.
  • the first display parameter w1 when the first display parameter w1 takes a value greater than 1, the brightness level of the current video frame image can be brightened as a whole, and when the first display parameter w1 takes a value less than 1, the brightness level of the current video frame image can be reduced as a whole.
  • the second display parameter w2 when the second display parameter w2 takes a value greater than 1, the contrast of the current video frame image can be increased, and when the second display parameter w2 takes a value less than 1, the contrast of the current video frame image can be reduced.
  • the brightness level of the current video frame image can be increased, and when the third display parameter w3 takes a value less than 0, the brightness level of the current video frame image can be reduced.
  • first display parameter w1 , the second display parameter w2 or the third display parameter w3 is not limited to a specific one or a certain type of display parameter, nor is it limited to a specific order.
  • the display parameter set may also include other display parameters, such as display parameters for adjusting color components, etc.
  • the embodiments of the present disclosure are not limited to this and can be set according to actual conditions.
  • step S103 other frames in the corresponding video segment are adjusted based on the display parameter set to obtain an intermediate video segment, including: all video frame data in each video segment are adjusted according to the display parameter set corresponding to each video frame according to the following equation (2).
  • a preprocessing operation is performed on the video frame data in the divided video segments.
  • the following equation (2) is applied to each frame in the current video frame to obtain the corresponding intermediate video segment.
  • each video frame in a video clip of a certain scene can be preprocessed/adjusted so that the processed/adjusted video frame meets the brightness and contrast range that can be input by the subsequent HDR model (also referred to as the video processing network in this article).
  • the subsequent HDR model also referred to as the video processing network in this article.
  • the operation of step S103 corresponds to a preprocessing operation or a preprocessing model (image processing network).
  • the preprocessing operation can use the above equation (2) to calculate the output, or other equations or algorithms can be used to calculate the output.
  • the embodiments of the present disclosure are not limited to this and can be set according to actual needs.
  • the preprocessing model can be a neural network, which is not limited to this in the embodiments of the present disclosure and can be set according to actual needs.
  • any frame image in the current video clip is taken out and input into a neural network (e.g., a preprocessing model) to be trained to obtain a first display parameter w1, a second display parameter w2, and a third display parameter w3.
  • a neural network e.g., a preprocessing model
  • At least one implementation of the present disclosure provides a training method for an image processing network (preprocessing model) and a video processing network (HDR model).
  • preprocessing model image processing network
  • HDR model video processing network
  • first sample data is obtained, which includes a first version SDR image and a first version HDR image; the first version HDR image corresponding to the first version SDR image is used as the first version true image; the first version SDR image is input into a video processing network to obtain a first version predicted HDR image corresponding to the first version SDR image; the first version predicted HDR image and the first version true image are input into a first loss function to obtain a first loss function value; and according to the first loss function value, the model parameters of the video processing network (i.e., the above-mentioned HDR model) are adjusted; second sample data is obtained, which includes a second version SDR image and a second version HDR image; the second version HDR image corresponding to the second version SDR image is used as the second version true image; the second version SDR image is input into an image processing network (i.e., the above-mentioned preprocessing model) and a trained video processing network (i.e., the above-mentioned
  • the model parameters of the HDR model are adjusted by the first sample data, and then the model parameters of the image processing network (preprocessing model) are adjusted by the second sample data and with the parameters of the HDR model fixed.
  • the first version true value image, the second version true value image, and the third version true value image may be standard/expected HDR images corresponding to the first version SDR image, the second version SDR image, and the third version SDR image, respectively.
  • HDR images processed by professional colorists, HDR images that meet the needs of customers/designers, etc. do not limit this, and can be set according to actual needs.
  • first sample data is not limited to a specific one or a certain type of sample data, nor are they limited to a specific order, and can be set according to actual conditions.
  • the "first version SDR image”, “second version SDR image” and “third version SDR image” are not limited to a specific one or a certain type of SDR image, nor are they limited to a specific order, and can be set according to actual conditions.
  • the “first version HDR image”, “second version HDR image” and “third version HDR image” are not limited to a specific one or a certain type of HDR image, nor are they limited to a specific order, and can be set according to actual conditions.
  • the first loss function and the second loss function may be the same or different.
  • the first loss function and the second loss function may adopt any loss function, such as a square loss function, a logarithmic loss function, an exponential loss function, etc.
  • the embodiments of the present disclosure do not limit this and can be set according to actual conditions.
  • FIG5 is a schematic diagram of a training process of a preprocessing model provided by at least one embodiment of the present disclosure.
  • the video frame/video frame X in to be processed is input into the preprocessing model to obtain display parameters w1, w2 and w3, and the output frame X out is obtained based on the above equation (2).
  • the preprocessed output frame X out is input into the HDR model, that is, the HDR image generation algorithm is applied to the preprocessed output frame X out , and the output frame Y out is finally generated.
  • the output frame Y out output by the HDR model is compared with the corresponding HDR image of the standard, for example, by calculating the loss function.
  • the display parameters output by the preprocessing model are adjusted/updated according to the comparison result, such as the first display parameter w1, the second display parameter w2, and the third display parameter w3.
  • the desired display parameter set can be obtained, for example, a display parameter set that makes the output frame Y out close to the corresponding HDR image of the standard.
  • the parameters of the HDR model are fixed parameters and do not need to be updated. That is, during the display parameter training, the parameters in the HDR image generation algorithm remain constant.
  • the HDR model is a trained model that is only used for mapping and color adjustment of HDR images. It should be noted that the embodiments of the present disclosure do not impose specific restrictions on the various parameters in the HDR model, and they can be set according to actual conditions.
  • the standard HDR image refers to an HDR image that meets expectations, for example, an HDR image processed by a professional colorist, an HDR image that meets the needs of customers/designers, etc.
  • the embodiments of the present disclosure are not limited to this and can be set according to actual needs.
  • a third sample data is obtained, and the third sample data includes a third version SDR image and a third version HDR image; the third version HDR image corresponding to the third version SDR image is used as the third version true value image; the third version SDR image is input into the image processing network and the video processing network to obtain the third version predicted HDR image corresponding to the third version SDR image; the third version predicted HDR image and the third version true value image are input into the third loss function to obtain the third loss function value; and according to the third loss function value, the model parameters of the image processing network and the video processing network are adjusted.
  • model parameters of the HDR model and the pre-processing model are adjusted simultaneously through a set of sample data (third sample data).
  • the first loss function, the second loss function and the third loss function may be the same or different from each other, and the embodiments of the present disclosure are not limited to this.
  • the first loss function, the second loss function and the third loss function may adopt any loss function, such as a square loss function, a logarithmic loss function, an exponential loss function, etc., and the embodiments of the present disclosure are not limited to this and can be set according to actual conditions.
  • FIG. 6 is a schematic block diagram of another HDR video generation method provided by at least one embodiment of the present disclosure.
  • the decoded initial video i.e., the video to be processed
  • the preprocessing model determines that the video has undergone scene switching, and the current is the first frame of the new scene.
  • the preprocessing model processes the first frame of the current new scene to obtain display parameters w1, w2, and w3, and then all frames of the current scene are adjusted/preprocessed using this set of display parameters (w1, w2, and w3) until the last frame at the end of the scene, enter the next scene, and repeat the operation, thereby preprocessing/adjusting the entire video to be processed, i.e., preprocessing the output video. Therefore, before the video is input into the HDR model, the initial video is preprocessed to adjust the brightness, contrast, etc. of the initial video, so that the adjusted initial video meets the input requirements of the subsequent HDR model (e.g., close to or falling into the inputtable brightness range, contrast range, etc. of the subsequent HDR model), so that a single HDR model can process complex videos of multiple scenes, effectively improving the quality and efficiency of the generated HDR video.
  • the initial video is preprocessed to adjust the brightness, contrast, etc. of the initial video, so that the adjusted initial video meets the input requirements of the subsequent HD
  • a video clip corresponding to a scene in the initial video uses a display parameter set.
  • different video clips corresponding to different scenes in the initial video use different display parameter sets, and in other examples, different video clips corresponding to different scenes use the same display parameter set.
  • the embodiments of the present disclosure are not limited to this and can be adjusted according to actual needs.
  • the HDR image generation algorithm can be implemented by various neural network models, such as an HDR model. It should be noted that the embodiments of the present disclosure do not limit the HDR image generation algorithm, nor the specific network structure of the HDR model, as long as the HDR image can be generated.
  • performing high dynamic range conversion on the intermediate video segment to obtain the high dynamic range video segment includes: performing high dynamic range conversion on the intermediate video segment using a video processing network.
  • the video processing network includes a basic network and a weight network, the basic network is used to extract features and reconstruct features on the input frame to obtain a high dynamic range output frame, the weight network is used to extract features on the input frame to obtain feature matrix parameters, and information correction is performed on the basic network according to the feature matrix parameters.
  • the basic network can be any deep learning network in the prior art that can realize the conversion of SDR video to HDR video.
  • the residual network (ResNet), the ring generative adversarial network (CycleGAN) and the pixel-to-pixel generation network (Pixel 2Pixel) are algorithm models for image-to-image traslation.
  • the High Dynamic Range Network (HDRNet), the Conditional Sequential Retouching Network (CSRNet) and the Adaptive 3D lookup table (Ada-3DLUT) network are algorithm models for photo retouching.
  • HDRNet High Dynamic Range Network
  • CSRNet Conditional Sequential Retouching Network
  • Ada-3DLUT Adaptive 3D lookup table
  • Deep super-resolution inverse tone-mapping (Deep SR-ITM) and GAN-Based Joint Super-Resolution and Inverse Tone-Mapping (JSI-GAN) are algorithm models for converting SDR videos to HDR videos.
  • the embodiments of the present disclosure do not limit the specific structure of the basic network, as long as it contains multiple feature fusion nodes for fusion with weight information, the conversion between SDR videos and HDR videos can be achieved.
  • FIG. 7 is a schematic block diagram of a video processing network (HDR model) provided by at least one embodiment of the present disclosure.
  • a video processing network includes a base network and a weight network.
  • the base network includes a feature extraction network and a feature reconstruction network.
  • the feature extraction network includes multiple extraction subnetworks, for example, including 5 extraction subnetworks.
  • the video processing network shown in FIG7 includes two branches.
  • the right branch is the basic network, which is used to convert an SDR image or an LDR image into an HDR image, that is, to complete the task of generating an HDR image.
  • the basic network is used to extract and reconstruct features of the input frame to obtain a high dynamic range output frame.
  • the left branch is a weight network, which is used to correct the information of the basic network.
  • the weight network is used to extract features from the input frame to obtain feature matrix parameters, and to correct the information of the basic network according to the feature matrix parameters.
  • the video processing network may include only a single branch, for example, only the right branch in FIG. 7 , i.e., the basic network.
  • the embodiments of the present disclosure are not limited to this, as long as the HDR image generation task can be achieved, and can be set according to actual conditions.
  • the basic network shown in FIG. 7 adopts a UNET network structure.
  • the UNET network structure has two symmetrical parts. The first half of the network will perform a downsampling operation on the feature information image when calculating it, and in order to ensure that the size of its network output result is equal to the input, the second half of the UNET network structure will upsample the feature information.
  • This upsampling task is generally completed using deconvolution calculation or linear interpolation. In this way, the input image is restored to the same form as the input after the encoding and decoding process, that is, after the dimensionality reduction and abstraction process, to complete the regression task.
  • the basic network includes at least one information adjustment node, and the information adjustment node is used to integrate the feature extraction information of the basic network on the input frame and the feature matrix parameter information of the weight network.
  • the basic network includes a first information adjustment node, a second information adjustment node, a third information adjustment node, a fourth information adjustment node, and a fifth information adjustment node.
  • the node of the weight network and the basic network is combined This is the information adjustment node mentioned above, which represents the point-to-point multiplication of the feature matrix.
  • the other node Indicates the rearrangement of feature channels, that is, the connection layer.
  • the basic network includes 5 information adjustment nodes.
  • the 5 information adjustment nodes shown in FIG7 From top to bottom they are the first information regulation node, the second information regulation node, the third information regulation node, the fourth information regulation node and the fifth information regulation node. It should be noted that the embodiments of the present disclosure do not limit the number of information regulation nodes, which can be set according to actual conditions.
  • the feature reconstruction network in the HDR model shown in FIG7 is to convert image features into video frame information output.
  • the simplest feature reconstruction network can use a layer of convolution layer Conv, or a series of multiple layers of convolution-activation function Conv-ReLU, or a series of residual networks, as long as the purpose of outputting video frames can be achieved.
  • the embodiments of the present disclosure are not limited to this and can be set according to actual needs.
  • a weight network is used in the video processing network (HDR model) shown in Figure 7.
  • HDR model video processing network
  • there are many scenes in the video including daytime, nighttime, indoor, outdoor, sports, still, people, animals, etc.
  • a weight network is used to make full use of the information of the current video frame and perform parameter information correction on the basic network.
  • the HDR model architecture shown in Figure 7 is only an example.
  • the HDR model can adopt the architecture of any neural network model, and is not limited to the model architecture shown in Figure 7, and the embodiments of the present disclosure do not limit this.
  • FIG8 is a schematic block diagram of another video processing network (HDR model) provided by at least one embodiment of the present disclosure.
  • the weights in the HDR model in addition to the UNET network structure used in the base network, the weights in the HDR model also use the UNET network structure, thereby achieving information correction at different scales.
  • the weight network includes at least one feature correction network, and the sizes of the corresponding input images of the at least one feature correction network are different from each other.
  • the weight network includes a first feature correction network, a second feature correction network, and a third feature correction network.
  • the input frame is output to the first feature correction network to obtain a first feature parameter matrix;
  • the first feature parameter matrix is input to the third information adjustment node;
  • the first feature parameter matrix and the input frame are rearranged in feature channels and then input to the second feature correction network to obtain a second feature parameter matrix;
  • the second feature parameter matrix is input to the second information adjustment node and the fourth information adjustment node;
  • the second feature parameter matrix and the input frame are rearranged in feature channels and then input to the third feature correction network to obtain a third feature parameter matrix;
  • the third feature parameter matrix is simultaneously input to the first information adjustment node and the fifth information adjustment node.
  • the node combining the weight network and the basic network This is the information adjustment node mentioned above, which represents the point-to-point multiplication of the feature matrix.
  • the other node Represents the rearrangement of feature channels, that is, the connection layer.
  • the input frame is input to three correction networks (the first feature correction network, the second feature correction network, and the third feature correction network) after being downsampled once by 4 times, downsampled once by 2 times, and not performing a sampling operation.
  • the sizes of the corresponding input images of the three correction networks are different from each other, so that the first feature correction network, the second feature correction network, and the third feature correction network can respectively provide information correction of different sizes to the basic network.
  • upsampling and downsampling represent 2x upsampling and 2x downsampling, respectively.
  • the size of the input frame is 64 ⁇ 64
  • the image size output by the first extraction subnetwork from top to bottom is 64 ⁇ 64
  • the image size output by the second extraction subnetwork from top to bottom is 32 ⁇ 32
  • the image size output by the third extraction subnetwork from top to bottom is 16 ⁇ 16
  • the image size output by the fourth extraction subnetwork from top to bottom is 32 ⁇ 32.
  • the image size output by the fifth extraction subnetwork from top to bottom is restored to the same size as the input frame, i.e., 64 ⁇ 64.
  • the input frame of size 64 ⁇ 64 is downsampled 4x to obtain an image size of 16 ⁇ 16, so the first correction network from top to bottom shown in FIG8 can provide a correction of size 16 ⁇ 16.
  • the image size is 32 ⁇ 32. Therefore, the second rectification network from top to bottom shown in FIG8 can provide a rectification of size 32 ⁇ 32.
  • the third rectification network from top to bottom shown in FIG8 can provide a rectification of size 64 ⁇ 64.
  • the information rectifications of different sizes output by the feature rectification network are respectively provided to the intermediate results of the same size in the feature extraction network.
  • the information rectification output by the first rectification network (16 ⁇ 16) is provided to the output of the third extraction sub-network (16 ⁇ 16)
  • the information rectification output by the second rectification network (32 ⁇ 32) is provided to the output of the second and fourth extraction sub-networks (32 ⁇ 32)
  • the information rectification output by the third rectification network (64 ⁇ 64) is provided to the output of the first and fifth extraction sub-networks (64 ⁇ 64).
  • the weight network can provide information correction of different sizes (e.g., 16 ⁇ 16, 32 ⁇ 32, or 64 ⁇ 64) to the base network.
  • a weight network is used in the video processing network (HDR model) shown in Figure 8.
  • HDR model video processing network
  • there are many scenes in the video including daytime, nighttime, indoor, outdoor, sports, still, people, animals, etc.
  • a weight network is used to make full use of the information of the current video frame and perform parameter information correction on the basic network.
  • the HDR model architecture shown in Figure 8 is only an example.
  • the HDR model can adopt the architecture of any neural network model, and is not limited to the model architecture shown in Figure 8, and the embodiments of the present disclosure do not limit this.
  • FIG9A is a schematic diagram of the structure of an extraction subnetwork provided by at least one embodiment of the present disclosure
  • FIG9B is a schematic diagram of the structure of a residual network ResNet provided by at least one embodiment of the present disclosure.
  • an extraction subnetwork includes a convolution layer Conv, an activation layer ReLU, a plurality of residual networks ResNet, etc.
  • the feature extraction network shown in FIG9A is only an example.
  • the extraction subnetwork can adopt any reasonable architecture, and is not limited to the model architecture shown in FIG9A .
  • the embodiments of the present disclosure are not limited to this, and can be set according to actual conditions.
  • the structure of the residual network ResNet is shown in Figure 9B.
  • the residual network ResNet includes a convolutional layer Conv, an activation function ReLU, and a superposition of a convolutional layer Conv.
  • the residual network shown in Figure 9B is only an example.
  • the residual network can adopt any reasonable architecture and is not limited to the model architecture shown in Figure 9B.
  • the embodiments of the present disclosure are not limited to this and can be set according to actual conditions.
  • the weight network includes at least one feature correction network
  • the feature correction network includes at least one attention module.
  • the attention module uses dual channels to extract features from input information, including: using the first channel to extract local features from the input frame to obtain a first feature; using the second channel to extract global features from the input frame to obtain a second feature; fusing the first feature and the second feature to obtain output information.
  • FIG. 10A is a schematic diagram of the structure of a feature correction network provided by at least one embodiment of the present disclosure
  • FIG. 10B is a schematic diagram of the structure of an attention module provided by at least one embodiment of the present disclosure.
  • the feature correction network includes a convolution layer Conv, an activation layer ReLU and at least one attention (CSA) module.
  • CSA at least one attention
  • the attention (CSA) module includes two branches (e.g., a first channel and a second channel), one branch (the first channel) includes a class variance (cstd) module, a convolution layer Conv, a pooling layer Pooling, an activation layer ReLU, a bilinear function (Bilinear), and a Sigmoid function, which can perform local feature extraction on the input frame to obtain the first feature.
  • the other branch (the second channel) includes multiple real-column regularization layers InsNorm, a convolution layer Conv, and an activation layer ReLU, which can perform global feature extraction on the input frame to obtain the second feature.
  • the first feature and the second feature are integrated to obtain output information.
  • the attention CSA module shown in FIG10B is only an example.
  • the CSA module can adopt any reasonable architecture, and is not limited to the model architecture shown in FIG10B.
  • the embodiments of the present disclosure are not limited to this and can be set according to actual conditions.
  • the deconvolution layer DConv is DConv(k3f64s2)
  • the network structures described in FIG7 to FIG10B are exemplary, and the embodiments of the present disclosure are not limited to this and can be adjusted according to actual conditions.
  • embodiments of the present disclosure do not limit the specific implementation methods of performing "upsampling” or “downsampling”, as long as “upsampling” or “downsampling” can be achieved. It should also be noted that the embodiments of the present disclosure do not limit the specific multiples of upsampling and downsampling, which can be set according to actual needs.
  • the class variance cstd module provides correction according to each pixel value of the current video frame and the size of the current video frame.
  • the feature correction network can effectively utilize the mean, variance and other information of the current video frame.
  • the class variance cstd module uses the following equation (3) to calculate the output:
  • x represents the current video frame
  • u(x) represents the average value of the current video frame
  • M represents the width of the current video frame
  • N represents the height of the current video frame
  • x i,j represents the pixel with coordinates (i, j) in the current video frame
  • Cstd(x) represents the correlation of the current video frame
  • O x represents the output frame corresponding to the current video frame.
  • At least one embodiment of the present disclosure provides a preprocessing model mechanism to dynamically adjust the video frames in the initial video, so that a single HDR model can process more video scenes, effectively improving the quality and efficiency of generating HDR videos.
  • at least one embodiment of the present disclosure also provides an HDR model that uses a variance cstd-like module, so that the feature correction network of the HDR model can effectively utilize the mean, variance and other information of the current video frame.
  • at least one embodiment of the present disclosure also provides an HDR model that uses a UNET network structure as a whole, so that the feature correction network of the HDR model can provide information correction on different sizes.
  • the execution order of the various steps of the video processing method 10 is not limited. Although the execution process of each step is described in a specific order above, this does not constitute a limitation on the embodiments of the present disclosure.
  • the various steps in the video processing method 10 can be executed serially or in parallel, which can be determined according to actual needs.
  • the video processing method 10 can also include more or fewer steps, and the embodiments of the present disclosure are not limited to this.
  • FIG. 11 is a schematic block diagram of a video processing device according to at least one embodiment of the present disclosure.
  • the video processing device 40 includes a dividing module 401 , an acquiring module 402 , and a processing module 403 .
  • the division module 401 is configured to divide the multiple video frames included in the initial video into multiple video segments, each video segment includes one or more video frames, and the multiple video frames are continuous.
  • the division module 401 can implement step S101, and its specific implementation method can refer to the relevant description of step S101, which will not be repeated here.
  • the acquisition module 402 is configured to determine the display parameter set of the video segment to which it belongs according to any one of the one or more video frames; adjust the other frames in the video segment to which it belongs according to the display parameter set to obtain an intermediate video segment.
  • the acquisition module 402 can implement steps S102 and S103, and its specific implementation method can refer to the relevant description of steps S102 and S103, which will not be repeated here.
  • the processing module 403 is configured to perform high dynamic range conversion on the intermediate video segment to obtain a high dynamic range video segment; and generate a high dynamic range video according to the high dynamic range video segment.
  • the processing module 403 may implement steps S104 and S105.
  • the specific implementation method may refer to the related description of steps S104 and S105, which will not be described in detail here.
  • division modules 401, acquisition modules 402 and processing modules 403 can be implemented by software, hardware, firmware or any combination thereof.
  • they can be implemented as division circuit 401, acquisition circuit 402 and processing circuit 403 respectively.
  • the embodiments of the present disclosure do not limit their specific implementation methods.
  • the video processing device 40 provided in at least one embodiment of the present disclosure can implement the aforementioned video processing method 10, and can also achieve technical effects similar to the aforementioned video processing method 10.
  • the video processing device 40 provided in at least one embodiment of the present disclosure pre-processes the initial video and dynamically adjusts the video frames in the initial video, so that a single HDR model can process more video scenes, effectively improving the quality and efficiency of generating HDR videos.
  • the HDR model adopts a variance cstd-like module, so that the feature correction network of the HDR model can effectively utilize the mean, variance and other information of the current video frame.
  • the HDR model as a whole adopts a UNET network structure, so that the feature correction network of the HDR model can provide information correction on different sizes.
  • the video processing device 40 may include more or fewer circuits or units, and the connection relationship between the various circuits or units is not limited and can be determined according to actual needs.
  • the specific configuration of each circuit is not limited and can be composed of analog devices according to circuit principles, or can be composed of digital chips, or in other applicable ways.
  • FIG. 12 is a schematic block diagram of another video processing device provided by at least one embodiment of the present disclosure.
  • the video processing device 90 includes a processor 910 and a memory 920.
  • the memory 920 includes one or more computer program modules 921.
  • the one or more computer program modules 921 are stored in the memory 920 and are configured to be executed by the processor 910.
  • the one or more computer program modules 921 include instructions for executing the video processing method 10 provided by at least one embodiment of the present disclosure.
  • the instructions are executed by the processor 910, one or more steps in the video processing method 10 provided by at least one embodiment of the present disclosure can be executed.
  • the memory 920 and the processor 910 can be interconnected through a bus system and/or other forms of connection mechanisms (not shown).
  • the processor 910 may be a central processing unit (CPU), a digital signal processor (DSP), or other forms of processing units with data processing capabilities and/or program execution capabilities, such as a field programmable gate array (FPGA), etc.; for example, the central processing unit (CPU) may be an X86 or ARM architecture, etc.
  • the processor 910 may be a general-purpose processor or a dedicated processor, and may control other components in the video processing device 90 to perform desired functions.
  • the memory 920 may include any combination of one or more computer program products, and the computer program product may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, etc.
  • One or more computer program modules 921 may be stored on the computer-readable storage medium, and the processor 910 may run one or more computer program modules 921 to implement various functions of the video processing device 90.
  • FIG. 13 is a schematic block diagram of yet another video processing device provided by at least one embodiment of the present disclosure.
  • the terminal device in the embodiment of the present disclosure may include but is not limited to mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
  • PDAs personal digital assistants
  • PADs tablet computers
  • PMPs portable multimedia players
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the video processing device 600 shown in FIG. 13 is
  • the video processing device 600 includes a processing device (e.g., a central processing unit, a graphics processor, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 to a random access memory (RAM) 603.
  • ROM read-only memory
  • RAM random access memory
  • various programs and data required for the operation of the computer system are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other via a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604.
  • the following components may be connected to the I/O interface 605: an input device 606 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; an output device 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609 including a network interface card such as a LAN card, a modem, etc.
  • the communication device 609 may allow the video processing device 600 to communicate with other devices wirelessly or by wire to exchange data, and perform communication processing via a network such as the Internet.
  • the drive 610 is also connected to the I/O interface 605 as needed.
  • a removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 610 as needed, so that a computer program read therefrom is installed into the storage device 608 as needed.
  • FIG. 13 shows a video processing device 600 including various devices, it should be understood that it is not required to implement or include all the devices shown. More or fewer devices may be implemented or included alternatively.
  • the video processing device 600 may further include a peripheral interface (not shown in the figure), etc.
  • the peripheral interface may be various types of interfaces, such as a USB interface, a lightning interface, etc.
  • the communication device 609 may communicate with a network and other devices through wireless communication, such as the Internet, an intranet and/or a wireless network such as a cellular phone network, a wireless local area network (LAN) and/or a metropolitan area network (MAN).
  • LAN wireless local area network
  • MAN metropolitan area network
  • Wireless communication may use any of a variety of communication standards, protocols and technologies, including but not limited to Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), Wideband Code Division Multiple Access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, Wi-Fi (e.g., based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n standards), Voice over Internet Protocol (VoIP), Wi-MAX, protocols for email, instant messaging and/or Short Message Service (SMS), or any other suitable communication protocol.
  • GSM Global System for Mobile Communications
  • EDGE Enhanced Data GSM Environment
  • W-CDMA Wideband Code Division Multiple Access
  • CDMA Code Division Multiple Access
  • TDMA Time Division Multiple Access
  • Wi-Fi e.g., based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g and/or IEEE 802.11n standards
  • VoIP Voice over Internet Protocol
  • Wi-MAX protocols for email,
  • the video processing device 600 can be any device such as a mobile phone, a tablet computer, a laptop computer, an e-book, a game console, a television, a digital photo frame, a navigator, etc., or it can be any combination of data processing devices and hardware, and the embodiments of the present disclosure are not limited to this.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program can be downloaded and installed from a network through a communication device 609, or installed from a storage device 608, or installed from a ROM 602.
  • the processing device 601 the video processing method 10 disclosed in the embodiment of the present disclosure is executed.
  • the computer-readable medium disclosed above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, device or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device.
  • the program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the computer-readable medium may be included in the video processing device 600; or it may exist independently without being assembled into the video processing device 600.
  • FIG. 14 is a schematic block diagram of a non-transitory readable storage medium provided by at least one embodiment of the present disclosure.
  • FIG14 is a schematic block diagram of a non-transient readable storage medium according to at least one embodiment of the present disclosure.
  • a non-transient readable storage medium 70 stores computer instructions 111, which, when executed by a processor, execute one or more steps in the video processing method 10 as described above.
  • the non-transitory readable storage medium 70 may be any combination of one or more computer-readable storage media, for example, one computer-readable storage medium includes a computer-readable program code for dividing a plurality of video frames included in an initial video into a plurality of video segments, each video segment including one or more video frames, another computer-readable storage medium includes a computer-readable program code for determining a display parameter set of a video segment to which it belongs according to any one of the one or more video frames; adjusting other frames in the video segment to which it belongs according to the display parameter set to obtain an intermediate video segment, and another computer-readable storage medium includes a computer-readable program code for performing high dynamic range conversion on the intermediate video segment to obtain a high dynamic range video segment; and generating a high dynamic range video according to the high dynamic range video segment.
  • the above-mentioned various program codes may also be stored in the same computer-readable medium, and the embodiments of the present disclosure are not limited to this.
  • the computer may execute the program code stored in the computer storage medium, and execute, for example, the video processing method 10 provided in any one of the embodiments of the present disclosure.
  • the storage medium may include a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory (CD-ROM), a flash memory, or any combination of the above storage media, or other applicable storage media.
  • the readable storage medium may also be the memory 920 in FIG. 12 , and the related description may refer to the aforementioned content, which will not be repeated here.
  • FIG. 15 is a schematic block diagram of an electronic device according to at least one embodiment of the present disclosure.
  • the electronic device 120 may include the video processing device 40/90/600 as described above.
  • the electronic device 120 may implement the video processing method 10 provided by any embodiment of the present disclosure.
  • the term “plurality” refers to two or more than two, unless clearly defined otherwise.

Landscapes

  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种视频处理方法、视频处理装置和存储介质。视频处理方法包括:(S101)将初始视频包括的多个视频帧划分为多个视频片段,每个视频片段包括一个或多个视频帧,多个视频帧连续;(S102)根据一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;(S103)根据显示参数集对所属视频片段中的其它帧进行调整,获得中间视频片段;(S104)对中间视频片段进行高动态范围转换,获得高动态范围视频片段;(S105)根据高动态范围视频片段生成高动态范围视频。该视频处理方法通过显示参数集预处理初始视频,使得单一的HDR模型可以处理具有复杂场景的初始视频,有效提升生成HDR视频的质量和效率。

Description

视频处理方法、视频处理装置和可读存储介质 技术领域
本公开的实施例涉及一种视频处理方法、视频处理装置和非瞬时可读存储介质。
背景技术
高动态范围(high dynamic range,HDR)图像,相比普通的图像,可以提供更多的动态范围和图像细节,能够更加准确的记录真实场景的绝大部分色彩和光照信息,并能表现出丰富的色彩细节和明暗层次。通过根据不同的曝光时间的低动态范围(low dynamic range,LDR)图像,利用每个曝光时间相对应最佳细节的LDR图像来合成最终HDR图像,能够更好的反映人在真实环境中的视觉效果。HDR技术可以被应用于对图像质量要求较高的领域,如医学影像、视频监控、卫星遥感和计算机视觉等领域中。
发明内容
本公开至少一个实施例提供一种视频处理方法,包括:将初始视频包括的多个视频帧划分为多个视频片段,每个所述视频片段包括一个或多个视频帧,所述多个视频帧连续;根据所述一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;根据所述显示参数集对所述所属视频片段中的其它帧进行调整,获得中间视频片段;对所述中间视频片段进行高动态范围转换,获得高动态范围视频片段;根据所述高动态范围视频片段生成高动态范围视频。
例如,在本公开至少一个实施例提供的方法中,所述显示参数集包括第一显示参数、第二显示参数、第三显示参数,所述第一显示参数和所述第三显示参数用于调整视频帧的亮度,所述第二显示参数用于调整所述视频帧的对比度。
例如,在本公开至少一个实施例提供的方法中,所述第一显示参数用于调整所述视频帧的整体亮度水平,所述第三显示参数用于局部调整所述视频帧的亮度水平。
例如,在本公开至少一个实施例提供的方法中,将初始视频包括的多个视频帧划分为多个视频片段,包括:按照所述初始视频包括的多个视频帧的播放顺序,依次计算每个视频帧与前一个视频帧的相似度;基于计算得到的每相邻两个视频帧的相似度,将所述初始视频划分为多个视频片段。
例如,在本公开至少一个实施例提供的方法中,在按照所述初始视频包括的多个视频帧的播放顺序,依次计算每个视频帧与前一个视频帧的相似度之前,所述方法还包括:对所述初始视频中的每个初始视频帧进行降维处理,得到所述多个视频帧。
例如,在本公开至少一个实施例提供的方法中,所述依次计算每个视频帧与前一个视频帧的相似度,包括:对于所述多个视频帧中的每个视频帧,基于所述视频帧的图像数据的均值和前一个视频帧的图像数据的均值,所述视频帧的图像数据的标准差和所述前一个视频帧的图像数据的标准差,以及所述视频帧的图像数据和所述前一个视频帧的图像数据的协方差,确定所述视频帧与所述前一个视频帧的结构相似度;基于所述视频帧和所述前一个视频帧的结构相似度,确定所述视频帧与所述前一个视频帧的相似度。
例如,在本公开至少一个实施例提供的方法中,根据所述一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集,包括:利用图像处理网络对初始视频帧进行参数分析以获得所述显示参数集;所述图像处理网络包括第一图像分析模块和第二图像分析模块;所述第一图像分析模块用于对所述初始视频帧进行特征提取;获得第一中间视频帧;所述第二图像分析模块用于对所述第一中间视频帧进行特征提取和尺度变换,以输出所述显示参数集。
例如,在本公开至少一个实施例提供的方法中,所述第一图像分析模块包括第一卷积层、平均池化层、激活层和实列正则归一化层;所述第二图像分析模块包括第二卷积层和全局平均池化层。
例如,在本公开至少一个实施例提供的方法中,所述图像处理网络包括多个所述第一图像分析模块。
例如,在本公开至少一个实施例提供的方法中,根据所述显示参数集对所述所属视频片段中的其它帧进行调整,获得中间视频片段,包括:
每个视频片段中所有视频帧数据根据每个视频帧对应的显示参数集按照以下等式进行调整:
Figure PCTCN2022141522-appb-000001
其中X in表示输入帧,X out表示输出帧,w1、w2、w3分别为所述第一显示参数、所述第二显示参数和所述第三显示参数。
例如,在本公开至少一个实施例提供的方法中,对所述中间视频片段进行高动态范围转换,获得高动态范围视频片段,包括:利用视频处理网络对所述中间视频片段进行高动态范围转换;所述视频处理网络包括基础网络和权重网络;所述基础网络用于对输入帧进行特征提取和特征重构,以获得高动态范围输出帧;所述权重网络用于对输入帧进行特征提取以获得特征矩阵参数,根据所述特征矩阵参数对所述基础网络进行信息矫正。
例如,在本公开至少一个实施例提供的方法中,所述基础网络包括至少一个信息调节节点,所述信息调节节点用于整合所述基础网络对所述输入帧的特征提取信息和所述权重网络的特征矩阵参数信息。
例如,在本公开至少一个实施例提供的方法中,所述基础网络包括第一信息调节节点、第二信息调节节点、第三信息调节节点、第四信息调节节点和第五信息调节节点。
例如,在本公开至少一个实施例提供的方法中,所述权重网络包括至少一个特征矫正网络,所述特征矫正网络包括至少一个注意力模块;所述注意力模块采用双通道对输入信息进行特征提取,包括:利用第一通道对所述输入帧进行局部特征提取获得第一特征;利用第二通道对所述输入帧进行全局特征提取获得第二特征;融合所述第一特征和所述第二特征,获得输出信息。
例如,在本公开至少一个实施例提供的方法中,所述权重网络包括第一特征矫正网络、第二特征矫正网络和第三特征矫正网络,所述方法包括:将所述输入帧输入第一特征矫正网络,获得第一特征参数矩阵;将所述第一特征参数矩阵输入第三信息调节节点;将所述第一特征参数矩阵与所述输入帧进行特征通道重排后输入所述第二特征矫正网络,以获得第二特征参数矩阵;将所述第二特征参数矩阵输入第二信息调节节点和第四信息调节节点;将所述第二特征参数矩阵与所述输入帧进行特征通道重排后输入所述第三特征矫正网络,以获得第三特征参数矩阵;将所述第三特征参数矩阵输入第一信息调节节点和第五信息调节节点。
例如,本公开至少一个实施例提供的方法还包括:获取第一样本数据,所述第一样本数据包括第一版本SDR图像和第一版本HDR图像;将所述第一 版本SDR图像对应的第一版本HDR图像作为第一版本真值图像;将所述第一版本SDR图像输入视频处理网络得到与所述第一版本SDR图像对应的第一版本预测HDR图像;将所述第一版本预测HDR图像和所述第一版本真值图像输入第一损失函数,得到第一损失函数值;以及根据所述第一损失函数值,调整所述视频处理网络的模型参数;获取第二样本数据,其中,所述第二样本数据包括第二版本SDR图像和第二版本HDR图像;将所述第二版本SDR图像对应的第二版本HDR图像作为第二版本真值图像;将所述第二版本SDR图像输入所述图像处理网络和已经训练好的所述视频处理网络得到与所述第二版本SDR图像对应的第二版本预测HDR图像;固定所述视频处理网络的参数;将所述第二版本预测HDR图像和所述第二版本真值图像输入第二损失函数,得到第二损失函数值;以及根据所述第二损失函数值,调整所述图像处理网络的模型参数。
例如,本公开至少一个实施例提供的方法还包括:获取第三样本数据,其中,所述第三样本数据包括第三版本SDR图像和第三版本HDR图像;将所述第三版本SDR图像对应的第三版本HDR图像作为第三版本真值图像;将所述第三版本SDR图像输入所述图像处理网络和视频处理网络,得到与所述第三版本SDR图像对应的第三版本预测HDR图像;将所述第三版本预测HDR图像和所述第三版本真值图像输入第三损失函数,得到第三损失函数值;以及根据所述第三损失函数值,调整所述图像处理网络和所述视频处理网络的模型参数。
例如,本公开至少一个实施例还提供了一种视频处理装置,包括划分模块、获取模块和处理模块。划分模块被配置为将初始视频包括的多个视频帧划分为多个视频片段,每个所述视频片段包括一个或多个视频帧,所述多个视频帧连续。获取模块被配置为根据所述一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;根据所述显示参数集对所述所属视频片段中的其它帧进行调整,获得中间视频片段。处理模块,被配置为对所述中间视频片段进行高动态范围转换,获得高动态范围视频片段;根据所述高动态范围视频片段生成高动态范围视频。
例如,本公开至少一个实施例还提供一种视频处理装置,该视频处理装置包括处理器和存储器。存储器包括一个或多个计算机程序模块。所述一个或多 个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行上述任一实施例所述的视频处理方法的指令。
例如,本公开至少一个实施例还提供一种非瞬时可读存储介质,其上存储有计算机指令。所述计算机指令被处理器执行时执行上述任一实施例所述的视频处理方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1为本公开至少一个实施例提供的一种HDR视频生成方法的示意框图;
图2为本公开至少一个实施例的一种视频处理方法的示例流程图;
图3为本公开至少一个实施例提供的一种对视频进行场景切分的流程图;
图4为本公开至少一个实施例提供的一种图像处理网络的结构示意图;
图5为本公开至少一个实施例提供的一种图像处理网络的训练过程的示意图;
图6为本公开至少一个实施例提供的另一种HDR视频生成方法的示意框图;
图7为本公开至少一个实施例提供的一种HDR模型的示意框图;
图8为本公开至少一个实施例提供的另一种HDR模型的示意框图;
图9A为本公开至少一个实施例提供的一种提取子网络的结构示意图;
图9B为本公开至少一个实施例提供的一种残差网络的结构示意图;
图10A为本公开至少一个实施例提供的一种矫正网络的结构示意图;
图10B为本公开至少一个实施例提供的一种注意力模块的结构示意图;
图11为本公开至少一个实施例提供的一种视频处理装置的示意框图;
图12为本公开至少一个实施例提供另一种视频处理装置的示意框图;
图13为本公开至少一个实施例提供的又一种视频处理装置的示意框图;
图14为本公开至少一个实施例提供的一种非瞬时可读存储介质的示意框 图;以及
图15为根据本公开至少一个实施例的一种电子设备的示意框图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
本公开中使用了流程图来说明根据本申请的实施例的***所执行的操作。应当理解的是,前面或下面操作不一定按照顺序来精确地执行。相反,根据需要,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。同样,“一个”、“一”或者“该”等类似词语也不表示数量限制,而是表示存在至少一个。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
图1为本公开至少一个实施例提供的一种HDR视频生成方法的示意框图。
例如,在本公开至少一个实施例中,如图1所示,一种简单的HDR任务可以理解为使用单一的HDR模型(即HDR图像生成算法)来完成对整个视频的处理。例如,将待处理的视频进行解码后,将每一视频帧/视频帧的帧信息输入至HDR模型,该HDR模型用于将一帧图像(例如标准动态范围(standard dynamic range,SDR)图像或LDR图像等)映射为HDR图像,该映射处理可以包括动态范围扩展、色域范围扩展、画面的调色处理等。然后, 将HDR模型的输出信息编码后生成HDR视频。图1所示的HDR视频生成方法要求视频相对内容简单,例如已经播出过的高清电视剧、电影等,这类视频已经在高清频道播出过,视频的整体亮度、对比度、色彩信息基本能够保持一致。
复杂的HDR任务可以理解为片源场景比较复杂,比如纪录片、同一系列的电视剧、或者分为几部的综艺,每一部可能在亮度、对比度、色彩等方面稍有差别。在这种情况下,仅使用单一的HDR模型无法完成对复杂片源的处理。遇到单一的HDR模型无法处理的场景时,如果寻求专业的调色师进行调色会极大的提升成本。
至少为了克服上述技术问题,本公开至少一个实施例提供了一种视频处理方法,视频处理方法包括:将初始视频包括的多个视频帧划分为多个视频片段,每个视频片段包括一个或多个视频帧,多个视频帧连续;根据一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;根据显示参数集对所属视频片段中的其它帧进行调整,获得中间视频片段;对中间视频片段进行高动态范围转换,获得高动态范围视频片段;以及根据高动态范围视频片段生成高动态范围视频。
相应地,本公开至少一个实施例还提供了一种对应于上述视频处理方法的视频处理装置、非瞬时可读存储介质和电子设备。
通过本公开至少一个实施例提供的视频处理方法,可以根据场景切分,将初始视频划分成一个或多个视频片段,获取每个视频片段对应的显示参数集,基于显示参数集调整视频片段中的视频帧,从而获得高动态范围视频片段,进一步生成高动态范围视频,使得单一的HDR模型可以处理具有复杂场景的初始视频,有效提升生成HDR视频的质量和效率。
下面通过多个实施例及其示例对根据本公开提供的视频处理方法进行非限制性的说明,如下面所描述的,在不相互抵触的情况下这些具体示例或实施例中不同特征可以相互组合,从而得到新的示例或实施例,这些新的示例或实施例也都属于本公开保护的范围。
图2为本公开至少一个实施例的一种视频处理方法的示例流程图。
例如,如图2所示,本公开至少一个实施例提供了一种视频处理方法10。例如,在本公开的实施例中,该视频处理方法10可以应用于任何需要生成 HDR图像/视频的应用场景,例如,可以应用于显示器、摄像机、照相机、视频播放器、移动终端等,还可以应用于其他方面,本公开的实施例对此不作限制。如图2所示,该视频处理方法10可以包括如下操作S101至S105。
步骤S101:将初始视频包括的多个视频帧划分为多个视频片段,每个视频片段包括一个或多个视频帧,多个视频帧连续。
步骤S102:根据一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集。
步骤S103:根据显示参数集对所属视频片段中的其它帧进行调整,获得中间视频片段。
步骤S104:对中间视频片段进行高动态范围转换,获得高动态范围视频片段;
步骤S105:根据高动态范围视频片段生成高动态范围视频。
例如,在本公开的实施例中,初始视频可以是拍摄的摄像作品、从网络下载的视频、或者本地存储的视频等,也可以是LDR视频、SDR视频等,本公开的实施例对此不作任何限制。需要说明的是,初始视频中可以包括各种视频场景,例如某一室内场景、某一景点场景等,本公开的实施例对此不作任何限制。
例如,在本公开至少一个实施例中,对于步骤S101,可以对初始视频按照视频场景进行视频片段的切分。例如,在一些示例中,将初始视频包括的多个视频帧划分为一个或多个视频片段,每一个视频片段包括一个或多个视频帧。例如,每一个视频片段对应于单一的视频场景。例如,在一些示例中,将初始视频划分为两个视频片段,前一个视频片段对应的场景为教室内,后一个视频片段对应的场景为操场上。需要说明的是,本公开的实施例对具体场景不作限制,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,对初始视频按照场景进行划分的处理可以采用各种算法,本公开的实施例对此不作限制,只要能实现视频的场景划分功能即可,可以根据实际情况设置。
例如,在本公开至少一个实施例中,对于步骤S101,将初始视频包括的多个视频帧划分为多个视频片段,包括:按照初始视频包括的多个视频帧的播放顺序,依次计算每个视频帧与前一个视频帧的相似度;以及基于计算得到的 每相邻两个视频帧的相似度,将初始视频划分为多个视频片段。
例如,在本公开至少一个实施例中,在按照初始视频包括的多个视频帧的播放顺序,依次计算每个视频帧与前一个视频帧的相似度之前,对初始视频中的每个初始视频帧进行降维处理,得到多个视频帧。通过降维处理,可以极大节约计算成本,提高效率。
例如,在本公开至少一个实施例中,依次计算每个视频帧与前一个视频帧的相似度,包括:对于多个视频帧中的每个视频帧,基于视频帧的图像数据的均值和前一个视频帧的图像数据的均值,视频帧的图像数据的标准差和前一个视频帧的图像数据的标准差,以及视频帧的图像数据和前一个视频帧的图像数据的协方差,确定视频帧与前一个视频帧的结构相似度;以及,基于视频帧和前一个视频帧的结构相似度,确定视频帧与前一个视频帧的相似度。
图3为本公开至少一个实施例提供的一种视频进行场景切分的流程图。
例如,如图3所示,在本公开至少一个实施例中,可以采用结构相似性(structural similarity,SSIM)算法来执行对初始视频的场景划分处理。例如,在一些示例中,应用SSIM算法来对视频内容场景进行切分。通常,SSIM用于衡量两张图像(视频帧)的结构相似性,是一种用于衡量两张图像(视频帧)相似程度的指标。SSIM的值越大,表示两张图像越相似。SSIM的取值范围为[0,1],计算公式如下等式(1)所示:
Figure PCTCN2022141522-appb-000002
上述等式(1)中x和y分别表示输入的两张图像(视频帧),SSIM(x,y)表示输入图像x和y之间的相似性,μ x表示输入图像x的图像数据的平均值,μ y表示输入图像y的图像数据的平均值,
Figure PCTCN2022141522-appb-000003
表示输入图像x的图像数据的标准差,
Figure PCTCN2022141522-appb-000004
表示输入图像y的图像数据的标准差,σ xy表示输入图像x和y的图像数据的协方差,L表示像素值的动态范围,例如对于8比特图像,L=255-0,对于8比特归一化图像,L=1.0-0,k 1=0.01,k 2=0.03。
例如,在本公开至少一个实施例中,按照初始视频的多个视频帧的播放顺序,可以设定当前时刻的视频帧(即当前视频帧)为上述等式(1)的输入图像x,设定当前时刻的前一时刻的视频帧(即当前视频帧的前一个视频帧)为上述等式(1)的输入图像y,通过上述等式(1)可以计算得到该相邻的两个 视频帧x和y之间的结构相似度。
例如,在本公开至少一个实施例中,设定阈值T=0.5,当相邻两帧(例如视频帧x和y)计算出的SSIM(x,y)≥T时,视为相邻两帧图像x和y属于同一场景。当SSIM(x,y)<T时,视为当前两帧图像y和x不属于同一场景,视频帧x为前一个场景的最后一帧,视频帧y为后一个场景的第一帧。
需要说明的是,阈值T的取值可以根据实际情况来设置,本公开的实施例对此不作限制。
例如,在本公开至少一个实施例中,由于SSIM算法不需要非常精确的像素信息,因此,在计算每个视频帧与前一个视频帧的相似度之前,对初始视频中的每个初始视频帧进行降维处理(例如下采样操作),之后再计算SSIM,如图3所示,从而能够极大的节省计算成本。
例如,在本公开至少一个实施例中,采用SSIM作为场景切分的算法,该SSIM算法计算简单,只需要连续两帧视频帧信息,能够实现对视频流实时的计算处理,并不需要对视频进行离线分析在进行场景切分。
例如,在本公开至少一个实施例中,对于步骤S102,包括:利用图像处理网络对初始视频帧进行参数分析以获得显示参数集。图像处理网络包括第一图像分析模块和第二图像分析模块。第一图像分析模块用于对初始视频帧进行特征提取;获得第一中间视频帧;第二图像特征分析模块用于对第一中间视频帧进行特征提取和尺度变换,以输出显示参数集。
例如,在本公开至少一个实施例中,初始视频帧是所属视频片段中按照视频播放顺序最靠前的一帧视频帧。这样,可以实时处理视频流数据,基于当前视频片段中最靠前的一帧视频帧来获取当前视频片段的显示参数集,然后通过该显示参数集来处理当前视频片段中的其他帧图像。这样,仅基于当前视频片段的第一帧(即按照播放顺序最靠前的视频帧)进行预处理,可以防止视频播放时出现帧间亮度、对比度等信息的闪烁。同一场景具有内容连续性,可以仅使用对应于当前场景的第一帧视频帧的信息,来调整对应当前场景的显示参数集合。
例如,在本公开至少一个实施例中,初始视频帧可以是所属视频片段中随机挑选的一帧视频帧,本公开的实施例对此不作限制,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,图像处理网络包括第一图像分析模块和第二图像分析模块。第一图像分析模块用于对初始视频帧进行特征提取;获得第一中间视频帧,第二图像分析模块用于对第一中间视频帧进行特征提取和尺度变换,以输出显示参数集。例如,在一些示例中,第一图像分析模块包括第一卷积层、平均池化层、激活层和实列正则归一化层。第二图像分析模块包括第二卷积层和全局平均池化层。例如,在一些示例中,图像处理网络包括多个第一图像分析模块。
需要说明的是,在本公开的实施例中,“第一图像分析模块”和“第二图像分析模块”用于分别表示具有特定结构的图像分析模块,并不受限于特定的某一个或某一类图像分析模块,也不受限于特定的顺序,可以根据实际情况来设置。还需要说明的是,“第一卷积层”和“第二卷积层”用于分别表示具有特定卷积参数的卷积层,并不受限于特定的某一个或某一类卷积层,也不受限于特定的顺序,可以根据实际情况来设置。
图4为本公开至少一个实施例提供的一种图像处理网络的结构示意图。
例如,在本公开至少一个实施例中,图像处理网络可以是任意一种神经网络模型结构,也称为预处理模型。例如,如图4所示的网络模型结构。例如,如图4所示,图像处理网络(预处理模型)包括多个第一图像分析模块和一个第二图像分析模板。例如每一个第一图像分析模块包括第一卷积层Conv(k3f64)、平均池化层AvgPool、激活层ReLU和实列正则归一化层IN。第二图像分析模块包括第二卷积层Conv(k3f3)和全局平均池化层GlobalAvgPool。在图4所示的示例中,第一卷积层Conv(k3f64)中k3f64表示卷积核k=3,输出通道数量f=64,第二卷积层Conv(k3f3)中k3f3表示卷积核k=3,输出通道数量f=3。
例如,在图4所示的示例中,第二图像分析模块中的全局平均池化层GlobalAvgPool将第二卷积层Conv(k3f3)输出的特征图平均池化为显示参数集中的参数w1、w2和w3。
需要说明的是,图4所示的图像处理网络仅仅是一种示例。在本公开的实施例中,图像处理网络可以采用任意神经网络模型的架构,并不限于图4所示的模型架构,本公开的实施例对此不作限制。
还需要说明的是,图像处理网络中可以包括多个第一图像分析模块,例如 可以设置为3-6个,本公开的实施例对第一图像分析模块的数量不作限制。例如,在一个示例中,图像处理网络中可以包括4个第一图像分析模块。
例如,在本公开至少一个实施例中,通过将当前视频片段的初始视频帧(例如最靠前的视频帧)输入图像处理网络(预处理模型),获取对应当前视频片段的显示参数集。通过显示参数集,可以对当前视频片段进行预处理,以使得最终生成效果良好的HDR视频。
例如,在本公开至少一个实施例中,第一显示参数集包括第一显示参数w1、第二显示参数w2、第三显示参数w3。第一显示参数w1和第三显示参数w3用于调整视频帧的亮度并且第二显示参数w2用于调整视频帧的对比度。例如,在一些示例中,第一显示参数w1用于调整视频帧的整体亮度水平,第三显示参数w3用于局部调整(微调整)视频帧的亮度水平。例如,在一些示例中,当第一显示参数w1取值大于1时,可以整体提亮当前视频帧图像的亮度水平,当第一显示参数w1取值小于1时,整体降低当前视频帧图像的亮度水平。例如,在一些示例中,当第二显示参数w2取值大于1时,可以提高当前视频帧图像的对比度,当第二显示参数w2取值小于1时,可以降低当前视频帧图像的对比度。例如,在一些示例中,当第三显示参数w3取值大于0时,可以提高当前视频帧图像的亮度水平,当第三显示参数w3取值小于0时,可以降低当前视频帧图像的亮度水平。
需要说明的是,第一显示参数w1、第二显示参数w2或第三显示参数w3并不受限于特定的某一个或某一类显示参数,也不受限于特定的顺序。
还需要说明的是,在本公开的实施例中,显示参数集还可以包括其他的显示参数,例如用于调整色彩分量的显示参数等,本公开的实施例对此不作限制,可以根据实际情况来设置。
例如,在本公开至少一个实施例中,对于步骤S103,基于显示参数集对所属视频片段中的其它帧进行调整,获得中间视频片段,包括:每个视频片段中所有视频帧数据根据每个视频帧对应的显示参数集按照以下等式(2)进行调整。
例如,在本公开至少一个实施例中,对于划分后的视频片段中的视频帧数据执行预处理操作。例如,在一些示例中,对当前视频帧中每一帧,应用如下等式(2),从而获取相应的中间视频片段。
Figure PCTCN2022141522-appb-000005
在上述等式(2)中,X in表示输入帧,X out表示相应的输出帧。w1、w2、w3分别为第一显示参数、第二显示参数和第三显示参数。
例如,在本公开至少一个实施例中,通过显示参数集和上述等式(2),可以对某一场景的视频片段中每一帧视频帧进行预处理/调整,使得处理/调整后的视频帧满足后续HDR模型(本文中也称为视频处理网络)可输入的亮度、对比度范围。这样,使得可以通过单一的HDR模型来处理更多的场景视频。
例如,在本公开至少一个实施例中,步骤S103的操作对应于一种预处理操作或者预处理模型(图像处理网络)。例如,该预处理操作可以使用上述等式(2)来计算输出,也可以采用其他等式或算法来计算输出,本公开的实施例对此不作限制,可以根据实际需求来设置。又例如,该预处理模型可以是一种神经网络,本公开的实施例对此不作限制,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,针对不同场景的不同视频片段,取出当前视频片段中任意一帧图像,例如当前视频片段中最靠前的一帧图像,输入至神经网络(例如,预处理模型),训练得到第一显示参数w1、第二显示参数w2、第三显示参数w3。
例如,本公开至少一个实施提供了一种图像处理网络(预处理模型)和视频处理网络(HDR模型)的训练方法。
例如,在一个实施例中,获取第一样本数据,该第一样本数据包括第一版本SDR图像和第一版本HDR图像;将第一版本SDR图像对应的第一版本HDR图像作为第一版本真值图像;将第一版本SDR图像输入视频处理网络得到与第一版本SDR图像对应的第一版本预测HDR图像;将第一版本预测HDR图像和第一版本真值图像输入第一损失函数,得到第一损失函数值;以及根据第一损失函数值,调整视频处理网络(即上述HDR模型)的模型参数;获取第二样本数据,该第二样本数据包括第二版本SDR图像和第二版本HDR图像;将第二版本SDR图像对应的第二版本HDR图像作为第二版本真值图像;将第二版本SDR图像输入图像处理网络(即上述预处理模型)和已经训练好的视频处理网络(即上述HDR模型)得到与第二版本SDR图像对应的第二版本预测HDR图像;固定视频处理网络参数;将第二版本预测HDR图像和第二版本真值图像输入第二损失函数,得到第二损失函数值;以及根据第 二损失函数值,调整图像处理网络的模型参数。
例如,在本公开至少一个实施例中,通过第一样本数据来调整HDR模型的模型参数,然后通过第二样本数据并且固定HDR模型的参数来调整图像处理网络(预处理模型)的模型参数。
需要说明的是,在本公开的实施例中,第一版本真值图像、第二版本真值图像和第三版本真值图像可以分别是第一版本SDR图像、第二版本SDR图像和第三版本SDR图像相对应的标准的/期望的HDR图像。例如,经过专业调色师处理过的HDR图像、满足客户/设计师需求的HDR图像等,本公开的实施例对此不作限制,可以根据实际需求来设置。
还需要说明的是,在本公开的实施例中,“第一样本数据”、“第二样本数据”和“第三样本数据”并不受限于特定的某一个或某一类样本数据,也不受限于特定的顺序,可以根据实际情况来设置。“第一版本SDR图像”和“第二版本SDR图像”和“第三版本SDR图像”并不受限于特定的某一个或某一类SDR图像,也不受限于特定的顺序,可以根据实际情况来设置。“第一版本HDR图像”和“第二版本HDR图像”和“第三版本HDR图像”并不受限于特定的某一个或某一类HDR图像,也不受限于特定的顺序,可以根据实际情况来设置。
还需要说明的是,在本公开的实施例中,第一损失函数和第二损失函数可以相同,也可以不同。第一损失函数和第二损失函数可以采用任意损失函数,例如平方损失函数、对数损失函数、指数损失函数等,本公开的实施例对此不作限制,可以根据实际情况来设置。
图5为本公开至少一个实施例提供的一种预处理模型的训练过程的示意图。例如,在本公开至少一个实施例中,如图5所示,将待处理的视频帧/视频帧X in输入至预处理模型,得到显示参数w1、w2和w3,并且基于上述等式(2)得到输出帧X out。将预处理后的输出帧X out输入至HDR模型,即对预处理后的输出帧X out应用HDR图像生成算法,最终生成输出帧Y out。将HDR模型输出的输出帧Y out和标准的对应HDR图像进行比较,例如通过损失函数计算。根据比较结果来调整/更新预处理模型输出的显示参数,例如第一显示参数w1、第二显示参数w2、第三显示参数w3。在多次训练迭代后,可以获取所期望的显示参数集合,例如,使得输出帧Y out接近标准的对应HDR图像的显示参数集合。
例如,在本公开至少一个实施例中,在对显示参数集进行训练期间,HDR模型的参数是固定参数,无需更新。即在显示参数训练期间,HDR图像生成算法中的参数保持恒定。例如,在一些示例中,HDR模型是已经训练完成的模型,只用于HDR图像的映射和调色处理。需要说明的是,本公开的实施例对HDR模型中的各个参数不作具体限制,可以根据实际情况来设置。
需要说明的是,在本公开的实施例中,标准的HDR图像是指满足期望的HDR图像,例如,经过专业调色师处理过的HDR图像、满足客户/设计师需求的HDR图像等,本公开的实施例对此不作限制,可以根据实际需求来设置。例如,在另一个实施例中,获取第三样本数据,该第三样本数据包括第三版本SDR图像和第三版本HDR图像;将第三版本SDR图像对应的第三版本HDR图像作为第三版本真值图像;将第三版本SDR图像输入图像处理网络和视频处理网络,得到与第三版本SDR图像对应的第三版本预测HDR图像;将第三版本预测HDR图像和第三版本真值图像输入第三损失函数,得到第三损失函数值;以及根据第三损失函数值,调整图像处理网络和视频处理网络的模型参数。
例如,在本公开至少一个实施例中,通过一组样本数据(第三样本数据)同时调整HDR模型和预处理模型的模型参数。
需要说明的是,在本公开的实施例中,第一损失函数、第二损失函数和第三损失函数可以相同,也可以彼此不同,本公开的实施例对此不作限制。第一损失函数、第二损失函数和第三损失函数可以采用任意损失函数,例如平方损失函数、对数损失函数、指数损失函数等,本公开的实施例对此不作限制,可以根据实际情况来设置。
图6为本公开至少一个实施例提供的另一种HDR视频生成方法的示意框图。
例如,在本公开至少一个实施例中,如图6所示,解码后的初始视频(即待处理视频)经过场景切分(例如,通过应用SSIM算法),输入至预处理模型。该预处理模型判断出视频已经进行了场景切换,当前为新场景的第一帧。该预处理模型对当前新场景的第一帧进行处理,得到显示参数w1、w2和w3,此后当前场景的所有帧均使用这一组显示参数(w1、w2和w3)进行调整/预处理,直到该场景结束的最后一帧,进入下一个场景,并重复该操作,从而预 处理/调整整个待处理的视频,即预处理输出视频。因此,在将视频输入HDR模型之前,通过对初始视频进行预处理,调整初始视频的亮度、对比度等,使得调整后的初始视频满足后续HDR模型的输入要求(例如,接近或落入后续HDR模型的可输入的亮度范围、对比度范围等),使得单一的HDR模型可以处理多种场景的复杂视频,有效提升生成的HDR视频的质量和效率。
例如,在本公开至少一个实施例中,初始视频中对应一个场景的视频片段采用一个显示参数集。例如,在一些示例中,初始视频中对应不同场景的不同视频片段采用不同的显示参数集,又例如,在另一些示例,一些对应不同场景的不同视频片段采用相同的显示参数集,本公开的实施例对此不作限制,可以根据实际需求来调整。
例如,在本公开的至少一个实施例中,HDR图像生成算法可以通过各种神经网络模型来实现,例如HDR模型。需要说明的是,本公开的实施例对HDR图像生成算法不作限制,对HDR模型的具体网络结构也不作限制,只要能生成HDR图像即可。
例如,在本公开至少一个实施例中,对于步骤S104,对中间视频片段进行高动态范围转换,获得高动态范围视频片段,包括:利用视频处理网络对中间视频片段进行高动态范围转换。视频处理网络包括基础网络和权重网络,基础网络用于对输入帧进行特征提取和特征重构,以获得高动态范围输出帧,权重网络用于对输入帧进行特征提取以获得特征矩阵参数,根据特征矩阵参数对基础网络进行信息矫正。
例如,在本公开的实施例中,基础网络可以为任意现有技术中能够实现SDR视频到HDR视频转换的深度学习网络。例如,残差网络(ResNet)、环形生成对抗网络(CycleGAN)和像素到像素生成网络(Pixel 2Pixel)是用于图像到图像的转换(image-to-image traslation)的算法模型。例如,高动态范围网络(High Dynamic Range Net,HDRNet)、条件序列图像修饰网络(ConditionalSequential Retouching Network,CSRNet)和自适应3D查找表(Adaptive 3D lookuptable,Ada-3DLUT)网络是用于图像修饰(photo retouching)的算法模型。又例如,深度超分辨联合逆色调映射方法(Deep super-resolution inverse tone-mapping,Deep SR-ITM)和超分辨联合逆色调映射生成对抗网络(GAN-Based Joint Super-Resolution and InverseTone-Mapping,JSI-GAN)是用 于SDR视频到HDR视频转换的算法模型。本公开的实施例对基础网络的具体结构不作限制,只要包含多个特征融合节点用于与权重信息融合,可以实现SDR视频与HDR视频之间的转换即可。
图7为本公开至少一个实施例提供的一种视频处理网络(HDR模型)的示意框图。
例如,在本公开的至少一个实施例中,如图7所示,一种视频处理网络(HDR模型)包括基础网络和权重网络。例如,在本公开的至少一个实施例中,基础网络包括特征提取网络和特征重构网络。例如,在一些示例中,如图7所示,特征提取网络包括多个提取子网络,例如,包括5个提取子网络。
图7所示的视频处理网络包括两个分支。右侧分支是基础网络,该分支是为了实现从SDR图像或LDR图像转换成HDR图像,即完成HDR图像的生成任务。例如,基础网络用于对输入帧进行特征提取和特征重构,以获得高动态范围输出帧。左侧分支是权重网络,该左侧分支是为了实现对基础网络的信息矫正。例如,权重网络用于对输入帧进行特征提取以获得特征矩阵参数,根据特征矩阵参数对基础网络进行信息矫正。
例如,在本公开至少一个实施例中,视频处理网络(HDR模型)可以只包括单个分支,例如,仅包括图7中的右侧分支,即基础网络,本公开的实施例对此不作限制,只要能实现HDR图像生成任务即可,可以根据实际情况来设置。
例如,在本公开至少一个实施例中,图7所示的基础网络采用UNET网络结构。通常,UNET网络结构拥有对称的两部分,前一半网络计算特征信息图像时会对其进行下采样操作,而为了确保其网络输出结果的尺寸等于输入,UNET网络结构的后一半会对特征信息进行上采样,这种上采样任务一般使用反卷积计算或利用线性插值完成。如此,输入图像经过编码与解码过程,即经过降维与抽象过程后又恢复出与输入相同的形式,完成回归任务。
例如,在本公开至少一个实施例中,基础网络包括至少一个信息调节节点,信息调节节点用于整合基础网络对输入帧的特征提取信息和权重网络的特征矩阵参数信息。例如,在本公开至少一个实施例中,基础网络包括第一信息调节节点、第二信息调节节点、第三信息调节节点、第四信息调节节点和第五信息调节节点。
例如,如图7所示,权重网络和基础网络相结合的节点
Figure PCTCN2022141522-appb-000006
即为上述信息调节节点,表示特征矩阵的点对点乘法,另一节点
Figure PCTCN2022141522-appb-000007
表示特征通道重排组合,也就是连接层。在图7所示的示例中,基础网络包括5个信息调节节点。例如,在一个示例中,图7中所示的5个信息调节节点
Figure PCTCN2022141522-appb-000008
按照从上到下的顺序,依次为第一信息调节节点、第二信息调节节点、第三信息调节节点、第四信息调节节点和第五信息调节节点。需要说明的是,本公开的实施例对信息调节节点的个数不作限制,可以根据实际情况来设置。
例如,在本公开至少一个实施例中,图7所示的HDR模型中的特征重构网络是为了将图像特征转换为视频帧信息输出。例如,在一些示例中,最简单的特征重构网络可以使用一层卷积层Conv,也可以是多层的卷积-激活函数Conv-ReLU的串联,或者残差网络的串联,只要能达到输出视频帧的目的即可,本公开的实施例对此不作限制,可以根据实际需求来设置。
例如,在本公开至少一个实施例中,图7所示的视频处理网络(HDR模型)中采用了权重网络。例如,视频中的场景繁多,有白天、夜晚、室内、室外、运动、静止、人物、动物等,为了使单一的HDR模型能够尽可能多的适应不同的场景下的视频帧信息,采用权重网络,充分利用当前视频帧的信息,对基础网络进行参数的信息矫正。
需要说明的是,图7所示的HDR模型架构仅仅是一种示例。在本公开的实施例中,HDR模型可以采用任意神经网络模型的架构,并不限于图7所示的模型架构,本公开的实施例对此不作限制。
图8为本公开至少一个实施例提供的另一种视频处理网络(HDR模型)的示意框图。
例如,在本公开的至少一个实施例中,如图8所示,除了基础网络采用UNET网络结构以外,HDR模型中的权重也采用UNET网络结构,从而实现了在不同尺度上的信息矫正。例如,在本公开至少一个实施例提供的方法中,权重网络包括至少一个特征矫正网络,至少一个特征矫正网络的相应输入图像的尺寸互不相同。
例如,在本公开的至少一个实施例中,如图8所示,权重网络包括第一特征矫正网络、第二特征矫正网络和第三特征矫正网络。将输入帧输出第一特征矫正网络,获得第一特征参数矩阵;将第一特征参数矩阵输入第三信息调节节 点;将第一特征参数矩阵与输入帧进行特征通道重排后输入第二特征矫正网络,以获得第二特征参数矩阵;将第二特征参数矩阵输入第二信息调节节点和第四信息调节节点;将第二特征参数矩阵与输入帧进行特征通道重排后输入第三特征矫正网络,以获得第三特征参数矩阵;将第三特征参数矩阵同时输入第一信息调节节点和第五信息调节节点。与图7类似,在图8所示的示例中,权重网络和基础网络相结合的节点
Figure PCTCN2022141522-appb-000009
即为上述信息调节节点,表示特征矩阵的点对点乘法,另一节点
Figure PCTCN2022141522-appb-000010
表示特征通道重排组合,也就是连接层。
例如,在图8所示的HDR模型中,输入帧分别经过一次4倍下采样、一次2倍下采样、以及不执行采样操作后输入至三个矫正网络(第一特征矫正网络、第二特征矫正网络和第三特征矫正网络),该三个矫正网络的相应输入图像的尺寸互不相同,从而第一特征矫正网络、第二特征矫正网络和第三特征矫正网络可以分别向基础网络提供不同尺寸的信息矫正。需要说明的是,在图7和图8所示的示例结构中,上采样和下采样分别表示2倍上采样和2倍下采样。
例如,在本公开的至少一个实施例中,如图8所示,输入帧的尺寸为64×64,在右侧分支(基础网络)上,由上到下的第一个提取子网络输出的图像尺寸为64×64。经过第一次2倍下采样后,由上到下的第二个提取子网络输出的图像尺寸为32×32。在经过第二次2倍下采样后,由上到下的第三个提取子网络输出的图像尺寸为16×16。经过第一次2倍上采样后由上到下的第四个提取子网络输出的图像尺寸为32×32。再经过第二次2倍上采样后,由上到下的第五个提取子网络输出的图像尺寸恢复到与输入帧的尺寸一致,即64×64。在左侧分支(权重网络)上,尺寸为64×64的输入帧经过4倍下采样后,得到图像尺寸为16×16,因此,图8中所示的由上到下的第一矫正网络可以提供尺寸为16×16的矫正。尺寸为64×64的输入帧经过2倍下采样后,得到图像尺寸为32×32,因此,图8中所示的由上到下的第二矫正网络可以提供尺寸为32×32的矫正。图8中所示的由上到下的第三矫正网络可以提供尺寸为64×64的矫正。例如,如图8所示,特征矫正网络输出的不同尺寸的信息矫正分别提供至特征提取网络中的具有相同尺寸的中间结果。例如,在图8所述的示例中,按由上到下的顺序,第一矫正网络输出的信息矫正(寸为16×16)提供至第三个提取子网络的输出(寸为16×16),第二矫正网络输出的信息矫 正(寸为32×32)提供至第二个和第四个提取子网络的输出(寸为32×32),第三矫正网络输出的信息矫正(寸为64×64)提供至第一个和第五个提取子网络的输出(寸为64×64)。如此,权重网络可以向基础网络提供不同尺寸(例如,16×16、32×32或64×64)的信息矫正。
例如,在本公开至少一个实施例中,图8所示的视频处理网络(HDR模型)中采用了权重网络。例如,视频中的场景繁多,有白天、夜晚、室内、室外、运动、静止、人物、动物等,为了使单一的HDR模型能够尽可能多的适应不同的场景下的视频帧信息,采用权重网络,充分利用当前视频帧的信息,对基础网络进行参数的信息矫正。
需要说明的是,图8所示的HDR模型架构仅仅是一种示例。在本公开的实施例中,HDR模型可以采用任意神经网络模型的架构,并不限于图8所示的模型架构,本公开的实施例对此不作限制。
图9A为本公开至少一个实施例提供的一种提取子网络的结构示意图,图9B为本公开至少一个实施例提供的一种残差网络ResNet的结构示意图。
例如,在本公开至少一个实施例中,如图9A所示,一个提取子网络包括卷积层Conv,激活层ReLU、多个残差网络ResNet等。需要说明的是,图9A所示的特征提取网络仅仅是一种示例。在本公开的实施例中,提取子网络可以采用任意合理的架构,并不限于图9A所示的模型架构,本公开的实施例对此不作限制,可以根据实际情况来设置。
例如,在本公开至少一个实施例中,残差网络ResNet的结构如图9B所示。例如,残差网络ResNet包括卷积层Conv、激活函数ReLU和卷积层Conv的叠加。需要说明的是,图9B所示的残差网络仅仅是一种示例。在本公开的实施例中,残差网络可以采用任意合理的架构,并不限于图9B所示的模型架构,本公开的实施例对此不作限制,可以根据实际情况来设置。
例如,在本公开的至少一个实施例中,权重网络包括至少一个特征矫正网络,特征矫正网络包括至少一个注意力模块。注意力模块采用双通道对输入信息进行特征提取,包括:利用第一通道对输入帧进行局部特征提取获得第一特征;利用第二通道对输入帧进行全局特征提取获得第二特征;融合第一特征和第二特征,获得输出信息。
图10A为本公开至少一个实施例提供的一种特征矫正网络的结构示意图, 图10B为本公开至少一个实施例提供的一种注意力模块的结构示意图。
例如,在本公开至少一个实施例中,如图10A所示,特征矫正网络包括卷积层Conv,激活层ReLU和至少一个注意力(CSA)模块。需要说明的是,图10A所示的特征矫正网络仅仅是一种示例。在本公开的实施例中,特征矫正网络可以采用任意合理的架构,并不限于图10A所示的模型架构,本公开的实施例对此不作限制,可以根据实际情况来设置。
例如,在本公开至少一个实施例中,如图10B所示,注意力(CSA)模块包括两个分支(例如第一通道和第二通道),一个分支(第一通道)包括类方差(cstd)模块、卷积层Conv、池化层Pooling、激活层ReLU、双线性函数(Bilinear)、Sigmoid函数,可以对输入帧进行局部特征提取获得第一特征。另一分支(第二通道)包括多个实列正则归一化层InsNorm、卷积层Conv、激活层ReLU,可以对输入帧进行全局特征提取获得第二特征。通过图10B中所示出的节点
Figure PCTCN2022141522-appb-000011
融合第一特征和第二特征,获得输出信息。
需要说明的是,图10B所示的注意力CSA模块仅仅是一种示例。在本公开的实施例中,CSA模块可以采用任意合理的架构,并不限于图10B所示的模型架构,本公开的实施例对此不作限制,可以根据实际情况来设置。
例如,在本公开至少一个实施例中,在图9A、9B、10A和10B所示的示例结构中,所出现的卷积层Conv采用Conv(k3f64s1),k3f64s1表示卷积核k=3,输出通道数量f=64,步长s=1。例如,图7和图8中所示的上采样可以实施为反卷积层DConv,例如,反卷积层DConv为DConv(k3f64s2),k3f64s2表示卷积核k=3,输出通道数量f=64,步长s=2,或者可以实施为插值算法,例如Bilinear/Bicubic(上采样2倍),以实现2倍上采样。例如,图7和图8中所示的下采样可以实施为池化pooling、函数Bilinear/Bicubic(下采样2倍),或者卷积层Conv(k3f64s2),k3f64s1表示卷积核k=3,输出通道数量f=64,步长s=2,以实现2倍下采样。需要说明的是,在本公开的实施例中,图7至图10B所述的网络结构均是示例性的,本公开的实施例对此不作限制,可以根据实际情况来调整。
需要说明的是,本公开的实施例并不限制执行“上采样”或“下采样”的具体实施方式,只要能够实现“上采样”或“下采样”即可。还需要说明的是,本公开的实施例并不限制上采样和下采样的具体倍数,可以根据实际需求来设置。
例如,在本公开至少一个实施中,如图10B所示,在注意力CSA模块中采用类方差cstd模块,可以有效利用当前视频帧的均值、方差等信息,从而实现更有效的信息矫正。
例如,在本公开至少一个实施中,类方差cstd模块根据当前视频帧的每个像素值以及当前视频帧的尺寸来提供矫正。如此,特征矫正网络能够有效利用当前视频帧的均值、方差等信息。
例如,在本公开至少一个实施中,类方差cstd模块利用如下等式(3)来计算输出:
Figure PCTCN2022141522-appb-000012
Figure PCTCN2022141522-appb-000013
在上述等式(3)中,x表示当前视频帧,u(x)表示当前视频帧的平均值,M表示当前视频帧的宽度,N表示当前视频帧的高度,x i,j表示当前视频帧中坐标为(i,j)的像素,Cstd(x)表示当前视频帧的相关性,O x表示对应于当前视频帧的输出帧。
例如,本公开至少一个实施例提供了一种预处理模型机制,对初始视频中的视频帧进行动态调整,使得单一的HDR模型能够处理更多的视频场景,有效提升生成HDR视频的质量和效率。例如,本公开至少一个实施例还提供了采用类方差cstd模块的HDR模型,使得HDR模型的特征矫正网络能够有效利用当前视频帧的均值、方差等信息。又例如,本公开至少一个实施例还提供了整体采用UNET网络结构的HDR模型,使得HDR模型的特征矫正网络能够提供不同尺寸上的信息矫正。
需要说明的是,在本公开的各个实施例中,视频处理方法10的各个步骤的执行顺序不受限制,虽然上文以特定顺序描述了各个步骤的执行过程,但这并不构成对本公开实施例的限制。视频处理方法10中的各个步骤可以串行执行或并行执行,这可以根据实际需求而定。例如,视频处理方法10还可以包括更多或更少的步骤,本公开的实施例对此不作限制。
图11为根据本公开至少一个实施例的一种视频处理装置的示意框图。
例如,在本公开至少一个实施例中,如图11所示,视频处理装置40包括划分模块401、获取模块402和处理模块403。
例如,在本公开至少一个实施例中,划分模块401配置为将初始视频包括的多个视频帧划分为多个视频片段,每个视频片段包括一个或多个视频帧,多个视频帧连续。例如,该划分模块401可以实现步骤S101,其具体实现方法可以参考步骤S101的相关描述,在此不再赘述。获取模块402配置为根据一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;根据显示参数集对所属视频片段中的其它帧进行调整,获得中间视频片段。例如,该获取模块402可以实现步骤S102和S103,其具体实现方法可以参考步骤S102和S103的相关描述,在此不再赘述。处理模块403被配置为对中间视频片段进行高动态范围转换,获得高动态范围视频片段;根据高动态范围视频片段生成高动态范围视频。例如,该处理模块403可以实现步骤S104和S105,其具体实现方法可以参考步骤S104和S105的相关描述,在此不再赘述。
需要说明的是,这些划分模块401、获取模块402和处理模块403可以通过软件、硬件、固件或它们的任意组合实现,例如,可以分别实现为划分电路401、获取电路402和处理电路403,本公开的实施例对它们的具体实施方式不作限制。
应当理解的是,本公开至少一个实施例提供的视频处理装置40可以实施前述视频处理方法10,也可以实现与前述视频处理方法10相似的技术效果。例如,本公开至少一个实施例提供的视频处理装置40通过预处理初始视频,对初始视频中的视频帧进行动态调整,使得单一的HDR模型能够处理更多的视频场景,有效提升生成HDR视频的质量和效率。例如,在本公开至少一个实施例提供的视频处理装置40中,HDR模型采用类方差cstd模块,使得HDR模型的特征矫正网络能够有效利用当前视频帧的均值、方差等信息。又例如,在本公开至少一个实施例提供的视频处理装置40中,HDR模型整体采用UNET网络结构,使得HDR模型的特征矫正网络能够提供不同尺寸上的信息矫正。
需要注意的是,在本公开的实施例中,该用于视频处理装置40可以包括更多或更少的电路或单元,并且各个电路或单元之间的连接关系不受限制,可 以根据实际需求而定。各个电路的具体构成方式不受限制,可以根据电路原理由模拟器件构成,也可以由数字芯片构成,或者以其他适用的方式构成。
图12是本公开至少一个实施例提供另一种视频处理装置的示意框图。
本公开至少一个实施例还提供了一种视频处理装置90。如图12所示,视频处理装置90包括处理器910和存储器920。存储器920包括一个或多个计算机程序模块921。一个或多个计算机程序模块921被存储在存储器920中并被配置为由处理器910执行,该一个或多个计算机程序模块921包括用于执行本公开的至少一个实施例提供的视频处理方法10的指令,其被处理器910执行时,可以执行本公开的至少一个实施例提供的视频处理方法10中的一个或多个步骤。存储器920和处理器910可以通过总线***和/或其它形式的连接机构(未示出)互连。
例如,处理器910可以是中央处理单元(CPU)、数字信号处理器(DSP)或者具有数据处理能力和/或程序执行能力的其它形式的处理单元,例如现场可编程门阵列(FPGA)等;例如,中央处理单元(CPU)可以为X86或ARM架构等。处理器910可以为通用处理器或专用处理器,可以控制视频处理装置90中的其它组件以执行期望的功能。
例如,存储器920可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质上可以存储一个或多个计算机程序模块921,处理器910可以运行一个或多个计算机程序模块921,以实现视频处理装置90的各种功能。在计算机可读存储介质中还可以存储各种应用程序和各种数据以及应用程序使用和/或产生的各种数据等。视频处理装置90的具体功能和技术效果可以参考上文中关于视频处理方法10的描述,此处不再赘述。
图13为本公开至少一个实施例提供的又一种视频处理装置的示意框图。
本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携 式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图13示出的视频处理装置600仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
例如,如图13所示,在一些示例中,视频处理装置600包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有计算机***操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604被此相连。输入/输出(I/O)接口605也连接至总线604。
例如,以下部件可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括诸如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信装置609。通信装置609可以允许视频处理装置600与其他设备进行无线或有线通信以交换数据,经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器610上,以便于从其上读出的计算机程序根据需要被安装入存储装置608。虽然图13示出了包括各种装置的视频处理装置600,但是应理解的是,并不要求实施或包括所有示出的装置。可以替代地实施或包括更多或更少的装置。
例如,该视频处理装置600还可以进一步包括外设接口(图中未示出)等。该外设接口可以为各种类型的接口,例如为USB接口、闪电(lighting)接口等。该通信装置609可以通过无线通信来与网络和其他设备进行通信,该网络例如为因特网、内部网和/或诸如蜂窝电话网络之类的无线网络、无线局域网(LAN)和/或城域网(MAN)。无线通信可以使用多种通信标准、协议和技术中的任何一种,包括但不局限于全球移动通信***(GSM)、增强型数据GSM环境(EDGE)、宽带码分多址(W-CDMA)、码分多址(CDMA)、时分多址(TDMA)、蓝牙、Wi-Fi(例如基于IEEE 802.11a、IEEE 802.11b、IEEE 802.11g和/或IEEE 802.11n标准)、基于因特网协议的语音传输(VoIP)、Wi-MAX,用于电子邮件、即时消息传递和/或短消息服务(SMS)的协议,或任 何其他合适的通信协议。
例如,视频处理装置600可以为手机、平板电脑、笔记本电脑、电子书、游戏机、电视机、数码相框、导航仪等任何设备,也可以为任意的数据处理装置及硬件的组合,本公开的实施例对此不作限制。
例如,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例所公开的视频处理方法10。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开的实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述视频处理装置600中所包含的;也可以 是单独存在,而未装配入该视频处理装置600中。
图14为本公开至少一个实施例提供的一种非瞬时可读存储介质的示意框图。
本公开的实施例还提供一种非瞬时可读存储介质。图14是根据本公开至少一个实施例的一种非瞬时可读存储介质的示意框图。如图14所示,非瞬时可读存储介质70上存储有计算机指令111,该计算机指令111被处理器执行时执行如上所述的视频处理方法10中的一个或多个步骤。
例如,该非瞬时可读存储介质70可以是一个或多个计算机可读存储介质的任意组合,例如,一个计算机可读存储介质包含用于将初始视频包括的多个视频帧划分为多个视频片段,每个视频片段包括一个或多个视频帧的计算机可读的程序代码,另一个计算机可读存储介质包含用于根据一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;根据显示参数集对所属视频片段中的其它帧进行调整,获得中间视频片段的计算机可读的程序代码,又一个计算机可读存储介质包含用于对中间视频片段进行高动态范围转换,获得高动态范围视频片段;根据高动态范围视频片段生成高动态范围视频的计算机可读的程序代码。。当然,上述各个程序代码也可以存储在同一个计算机可读介质中,本公开的实施例对此不作限制。
例如,当该程序代码由计算机读取时,计算机可以执行该计算机存储介质中存储的程序代码,执行例如本公开任一个实施例提供的视频处理方法10。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。例如,该可读存储介质也可以为图12中的存储器920,相关描述可以参考前述内容,此处不再赘述。
本公开的实施例还提供一种电子设备。图15是根据本公开至少一个实施例的一种电子设备的示意框图。如图15所示,该电子设备120可以包括如上所述的视频处理装置40/90/600。例如,该电子设备120可以实施本公开任一个实施例提供的视频处理方法10。
在本公开中,术语“多个”指两个或两个以上,除非另有明确的限定。
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种视频处理方法,包括:
    将初始视频包括的多个视频帧划分为多个视频片段,每个所述视频片段包括一个或多个视频帧,其中,所述多个视频帧连续;
    根据所述一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;
    根据所述显示参数集对所述所属视频片段中的其它帧进行调整,获得中间视频片段;
    对所述中间视频片段进行高动态范围转换,获得高动态范围视频片段;
    根据所述高动态范围视频片段生成高动态范围视频。
  2. 根据权利要求1所述的方法,其中,所述显示参数集包括第一显示参数、第二显示参数、第三显示参数,所述第一显示参数和所述第三显示参数用于调整视频帧的亮度,所述第二显示参数用于调整所述视频帧的对比度。
  3. 根据权利要求2所述的方法,其中,所述第一显示参数用于调整所述视频帧的整体亮度水平,所述第三显示参数用于局部调整所述视频帧的亮度水平。
  4. 根据权利要求1-3中任一项所述的方法,其中,将初始视频包括的多个视频帧划分为多个视频片段,包括:
    按照所述初始视频包括的多个视频帧的播放顺序,依次计算每个视频帧与前一个视频帧的相似度;
    基于计算得到的每相邻两个视频帧的相似度,将所述初始视频划分为多个视频片段。
  5. 根据权利要求4所述的方法,其中,在按照所述初始视频包括的多个视频帧的播放顺序,依次计算每个视频帧与前一个视频帧的相似度之前,所述方法还包括:
    对所述初始视频中的每个初始视频帧进行降维处理,得到所述多个视频帧。
  6. 根据权利要求4或5所述的方法,其中,所述依次计算每个视频帧与前一个视频帧的相似度,包括:
    对于所述多个视频帧中的每个视频帧,基于所述视频帧的图像数据的均 值和前一个视频帧的图像数据的均值,所述视频帧的图像数据的标准差和所述前一个视频帧的图像数据的标准差,以及所述视频帧的图像数据和所述前一个视频帧的图像数据的协方差,确定所述视频帧与所述前一个视频帧的结构相似度;
    基于所述视频帧和所述前一个视频帧的结构相似度,确定所述视频帧与所述前一个视频帧的相似度。
  7. 根据权利要求1-6中任一项所述的方法,其中,根据所述一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集,包括:
    利用图像处理网络对初始视频帧进行参数分析以获得所述显示参数集;
    所述图像处理网络包括第一图像分析模块和第二图像分析模块;
    所述第一图像分析模块用于对所述初始视频帧进行特征提取;获得第一中间视频帧;
    所述第二图像分析模块用于对所述第一中间视频帧进行特征提取和尺度变换,以输出所述显示参数集。
  8. 根据权利要求7所述的方法,其中,所述第一图像分析模块包括第一卷积层、平均池化层、激活层和实列正则归一化层;所述第二图像分析模块包括第二卷积层和全局平均池化层。
  9. 根据权利要求7或8所述的方法,其中,所述图像处理网络包括多个所述第一图像分析模块。
  10. 根据权利要求1-9中任一项所述的方法,其中,根据所述显示参数集对所述所属视频片段中的其它帧进行调整,获得中间视频片段,包括:
    每个视频片段中所有视频帧数据根据每个视频帧对应的显示参数集按照以下等式进行调整:
    Figure PCTCN2022141522-appb-100001
    其中X in表示输入帧,X out表示输出帧,w1、w2、w3分别为所述第一显示参数、所述第二显示参数和所述第三显示参数。
  11. 根据权利要求1-10中任一项所述的方法,其中,对所述中间视频片段进行高动态范围转换,获得高动态范围视频片段,包括:
    利用视频处理网络对所述中间视频片段进行高动态范围转换;
    所述视频处理网络包括基础网络和权重网络;
    所述基础网络用于对输入帧进行特征提取和特征重构,以获得高动态范 围输出帧;
    所述权重网络用于对输入帧进行特征提取以获得特征矩阵参数,根据所述特征矩阵参数对所述基础网络进行信息矫正。
  12. 根据权利要求11所述的方法,其中,所述基础网络包括至少一个信息调节节点,所述信息调节节点用于整合所述基础网络对所述输入帧的特征提取信息和所述权重网络的特征矩阵参数信息。
  13. 根据权利要求12所述的方法,其中,所述基础网络包括第一信息调节节点、第二信息调节节点、第三信息调节节点、第四信息调节节点和第五信息调节节点。
  14. 根据权利要求11-13中任一项所述的方法,其中,所述权重网络包括至少一个特征矫正网络,所述特征矫正网络包括至少一个注意力模块;
    所述注意力模块采用双通道对输入信息进行特征提取,包括:
    利用第一通道对所述输入帧进行局部特征提取获得第一特征;
    利用第二通道对所述输入帧进行全局特征提取获得第二特征;
    融合所述第一特征和所述第二特征,获得输出信息。
  15. 根据权利要求11-14中任一项所述的方法,其中,所述权重网络包括第一特征矫正网络、第二特征矫正网络和第三特征矫正网络,所述方法包括:
    将所述输入帧输入第一特征矫正网络,获得第一特征参数矩阵;
    将所述第一特征参数矩阵输入第三信息调节节点;
    将所述第一特征参数矩阵与所述输入帧进行特征通道重排后输入所述第二特征矫正网络,以获得第二特征参数矩阵;
    将所述第二特征参数矩阵输入第二信息调节节点和第四信息调节节点;
    将所述第二特征参数矩阵与所述输入帧进行特征通道重排后输入所述第三特征矫正网络,以获得第三特征参数矩阵;
    将所述第三特征参数矩阵输入第一信息调节节点和第五信息调节节点。
  16. 根据权利要求7-9中任一项所述的方法,其中,所述方法还包括:
    获取第一样本数据,其中,所述第一样本数据包括第一版本SDR图像和第一版本HDR图像;将所述第一版本SDR图像对应的第一版本HDR图像作为第一版本真值图像;
    将所述第一版本SDR图像输入视频处理网络得到与所述第一版本SDR图像对应的第一版本预测HDR图像;
    将所述第一版本预测HDR图像和所述第一版本真值图像输入第一损失函数,得到第一损失函数值;以及
    根据所述第一损失函数值,调整所述视频处理网络的模型参数;
    获取第二样本数据,其中,所述第二样本数据包括第二版本SDR图像和第二版本HDR图像;将所述第二版本SDR图像对应的第二版本HDR图像作为第二版本真值图像;
    将所述第二版本SDR图像输入所述图像处理网络和已经训练好的所述视频处理网络得到与所述第二版本SDR图像对应的第二版本预测HDR图像;
    固定所述视频处理网络的参数;
    将所述第二版本预测HDR图像和所述第二版本真值图像输入第二损失函数,得到第二损失函数值;以及
    根据所述第二损失函数值,调整所述图像处理网络的模型参数。
  17. 根据权利要求7-9和16中任一项所述的方法,还包括
    获取第三样本数据,其中,所述第三样本数据包括第三版本SDR图像和第三版本HDR图像;将所述第三版本SDR图像对应的第三版本HDR图像作为第三版本真值图像;
    将所述第三版本SDR图像输入所述图像处理网络和视频处理网络,得到与所述第三版本SDR图像对应的第三版本预测HDR图像;
    将所述第三版本预测HDR图像和所述第三版本真值图像输入第三损失函数,得到第三损失函数值;以及
    根据所述第三损失函数值,调整所述图像处理网络和所述视频处理网络的模型参数。
  18. 一种视频处理装置,包括:
    划分模块,被配置为将初始视频包括的多个视频帧划分为多个视频片段,每个所述视频片段包括一个或多个视频帧,其中,所述多个视频帧连续;
    获取模块,被配置为根据所述一个或多个视频帧中的任意一帧,确定其所属视频片段的显示参数集;根据所述显示参数集对所述所属视频片段中的其它帧进行调整,获得中间视频片段;
    处理模块,被配置为对所述中间视频片段进行高动态范围转换,获得高动态范围视频片段;根据所述高动态范围视频片段生成高动态范围视频。
  19. 一种视频处理装置,包括:
    处理器;
    存储器,包括一个或多个计算机程序模块;
    其中,所述一个或多个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述一个或多个计算机程序模块包括用于执行权利要求1-17中任一项所述的视频处理方法的指令。
  20. 一种非瞬时可读存储介质,其上存储有计算机指令,其中,所述计算机指令被处理器执行时执行权利要求1-17中任一项所述的视频处理方法。
PCT/CN2022/141522 2022-12-23 2022-12-23 视频处理方法、视频处理装置和可读存储介质 WO2024130715A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/141522 WO2024130715A1 (zh) 2022-12-23 2022-12-23 视频处理方法、视频处理装置和可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/141522 WO2024130715A1 (zh) 2022-12-23 2022-12-23 视频处理方法、视频处理装置和可读存储介质

Publications (1)

Publication Number Publication Date
WO2024130715A1 true WO2024130715A1 (zh) 2024-06-27

Family

ID=91587605

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141522 WO2024130715A1 (zh) 2022-12-23 2022-12-23 视频处理方法、视频处理装置和可读存储介质

Country Status (1)

Country Link
WO (1) WO2024130715A1 (zh)

Similar Documents

Publication Publication Date Title
CN108022212B (zh) 高分辨率图片生成方法、生成装置及存储介质
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
CN110222758B (zh) 一种图像处理方法、装置、设备及存储介质
CN110189246B (zh) 图像风格化生成方法、装置及电子设备
CN112399120B (zh) 电子装置及其控制方法
JP6811796B2 (ja) 拡張現実アプリケーションのためのビデオにおけるリアルタイムオーバーレイ配置
RU2697928C1 (ru) Способ сверхразрешения изображения, имитирующего повышение детализации на основе оптической системы, выполняемый на мобильном устройстве, обладающем ограниченными ресурсами, и мобильное устройство, его реализующее
WO2021115242A1 (zh) 一种超分辨率图像处理方法以及相关装置
CN112686824A (zh) 图像校正方法、装置、电子设备和计算机可读介质
CN113962859B (zh) 一种全景图生成方法、装置、设备及介质
WO2023065604A1 (zh) 图像处理方法及装置
WO2022099710A1 (zh) 图像重建方法、电子设备和计算机可读存储介质
Xu et al. Exploiting raw images for real-scene super-resolution
CN114298900A (zh) 图像超分方法和电子设备
CN115375536A (zh) 图像处理方法及设备
CN114519667A (zh) 一种图像超分辨率重建方法及***
CN110211017B (zh) 图像处理方法、装置及电子设备
CN110958363A (zh) 图像处理方法及装置、计算机可读介质和电子设备
Zhang et al. Multi-scale-based joint super-resolution and inverse tone-mapping with data synthesis for UHD HDR video
CN111696034B (zh) 图像处理方法、装置及电子设备
CN117768774A (zh) 图像处理器、图像处理方法、拍摄装置和电子设备
CN113628115A (zh) 图像重建的处理方法、装置、电子设备和存储介质
CN115861891B (zh) 视频目标检测方法、装置、设备及介质
WO2024130715A1 (zh) 视频处理方法、视频处理装置和可读存储介质
CN115049572A (zh) 图像处理方法、装置、电子设备和计算机可读存储介质