CN115174839A

CN115174839A - Video conversion method, device, equipment and storage medium

Info

Publication number: CN115174839A
Application number: CN202210759825.6A
Authority: CN
Inventors: 张琦
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-10-11

Abstract

The disclosure provides a video conversion method, a video conversion device, video conversion equipment and a storage medium, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to scenes such as video processing. The video conversion method comprises the following steps: acquiring a first video to be converted, and extracting a first video frame from the first video; carrying out space transformation, brightness mapping and color gamut conversion on the first video frame to generate a second video frame corresponding to the first video frame; generating a second video based on the second video frame; wherein one of the first video and the second video is a standard dynamic range video and the other is a high dynamic range video. The video conversion method provided by the embodiment of the disclosure can realize reversible conversion between the standard dynamic range video and the high dynamic range video, and simplify the conversion process.

Description

Video conversion method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, image processing, and computer vision technologies, which may be applied to video processing and other scenes, and in particular, to a video conversion method, apparatus, device, storage medium, and computer program product.

Background

Today, with the rapid development of ultra-high-definition video technology, people have an increasing demand for ultra-high-definition video. However, the ultra-high definition content with excellent duration is still very deficient, and therefore, the existing high-definition or low-definition video resources need to be converted into the ultra-high definition video through technical means.

In the current video conversion scheme, the same neural network can only realize one-way conversion from an SDR (Standard Dynamic Range) video to an HDR (High Dynamic Range) video or from an HDR video to an SDR video, but cannot realize reversible conversion between the SDR video and the HDR video.

Disclosure of Invention

The present disclosure provides a video conversion method, apparatus, device, storage medium, and computer program product, which implement reversible conversion of SDR video and HDR video.

According to a first aspect of the present disclosure, there is provided a video conversion method, including:

acquiring a first video to be converted, and extracting a first video frame from the first video;

carrying out space transformation, brightness mapping and color gamut conversion on the first video frame to generate a second video frame corresponding to the first video frame;

generating a second video based on the second video frame;

wherein one of the first video and the second video is a standard dynamic range video and the other is a high dynamic range video.

According to a second aspect of the present disclosure, there is provided a video conversion apparatus including:

the device comprises an acquisition module, a conversion module and a conversion module, wherein the acquisition module is configured to acquire a first video to be converted and extract a first video frame from the first video;

the first generation module is configured to perform spatial transformation, brightness mapping and color gamut conversion on the first video frame and generate a second video frame corresponding to the first video frame;

a second generation module configured to generate a second video based on the second video frame;

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method provided by the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as provided by the first aspect.

According to a fifth aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method provided according to the first aspect.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 illustrates an exemplary system architecture to which the video conversion method of the present disclosure may be applied;

fig. 2 shows a flow chart of a first embodiment of a video conversion method according to the present disclosure;

fig. 3 shows a flow chart of a second embodiment of a video conversion method according to the present disclosure;

fig. 4 shows a schematic diagram of a convolution process of a gamut conversion process in a video conversion method according to the present disclosure;

fig. 5 shows a flow chart of a third embodiment of a video conversion method according to the present disclosure;

fig. 6 shows a schematic diagram of a reversible conversion model of a video conversion method according to the present disclosure;

FIG. 7 illustrates a schematic block diagram of one embodiment of a video conversion apparatus according to the present disclosure;

fig. 8 shows a block diagram of an electronic device for implementing a video conversion method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and the features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the video conversion method or video conversion apparatus of the present disclosure may be applied.

As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is used to provide communication links between terminal devices 101 and server 103, and may include various types of connections, such as wired communication links, wireless communication links, or fiber optic cables, among others.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or transmit information or the like. Various client applications may be installed on the terminal device 101.

The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it can be various electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the terminal device 101 is software, it can be installed in the electronic device described above. It may be implemented as a plurality of software or software modules or as a single software or software module. And is not particularly limited herein.

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or may be implemented as a single software or software module. And is not particularly limited herein.

The video conversion method provided by the embodiment of the present disclosure is generally executed by the server 103, and accordingly, the video conversion apparatus is generally disposed in the server 103.

It should be noted that the numbers of the terminal apparatus 101, the network 102, and the server 103 in fig. 1 are merely illustrative. There may be any number of terminal devices 101, networks 102, and servers 103, as desired for an implementation.

In the embodiment of the present disclosure, the video conversion method is executed by the server 103, and the converted video is sent to the terminal device 101 installed with the client, for example, the converted video is sent to the client installed with a corresponding video player or video playing application.

Fig. 2 shows a flow 200 of an embodiment of a video conversion method according to the present disclosure, which, referring to fig. 2, comprises the steps of:

s201, a first video to be converted is obtained, and a first video frame is extracted from the first video.

In the present embodiment, the execution subject of the video conversion method acquires a first video to be converted, and extracts a first video frame from the first video.

The manner in which the executing entity obtains the first video to be converted may be actively obtained from a user side, a network, or another server, for example, in response to a received link representing a storage location of the first video, actively obtaining from a storage location represented by the link; or passively received, for example, receiving a first video sent by a user.

In the embodiment of the present disclosure, the first video may be a standard dynamic range video or a high dynamic range video.

The executing entity may extract a plurality of first video frames from the first video at the same time, for example, extract a first video frame sequence including the plurality of first video frames from the first video, and then sequentially perform conversion processing on each first video frame in the first video frame sequence; or the first video frames can be extracted from the first video one by one in sequence and subjected to conversion processing. Wherein the content of different first video frames may be the same or different.

S202, the first video frame is subjected to space transformation, brightness mapping and color gamut conversion, and a second video frame corresponding to the first video frame is generated.

In the present embodiment, the execution subject of the video conversion method performs spatial transformation, luminance mapping, and color gamut conversion on a first video frame, and then generates a second video frame corresponding to the first video frame.

It should be noted that, in the embodiment of the present disclosure, the processes of spatial transformation, luminance mapping and color gamut conversion are all reversible, that is, the corresponding second video frame can be generated by the first video frame through the processes, and the first video frame can also be generated reversely by the second video frame.

The order of the execution main body performing the spatial transformation, the luminance mapping and the color gamut conversion on the first video frame is not necessarily performed according to the order of the text description.

For example, if the first video frame is a standard dynamic range video frame and the second video frame is a high dynamic range video frame, the processing order of the execution main body on the first video frame may be spatial transformation, luminance mapping, and color gamut conversion in sequence, or may be spatial transformation, color gamut conversion, and luminance mapping in sequence.

For another example, if the first video frame is a high dynamic range video frame and the second video frame is a standard dynamic range video frame, the processing order of the execution main body on the first video frame may be color gamut conversion, luminance mapping, and spatial conversion in sequence, or may be luminance mapping, color gamut conversion, and spatial conversion in sequence.

In this embodiment, the adjustment of the spatial resolution of the first video frame is realized through spatial transformation, the luminance adjustment of the first video frame is realized through luminance mapping, and the color gamut of the first video frame is changed through color gamut conversion, so that the conversion of the first video frame to the corresponding second video frame is completed.

And S203, generating a second video based on the second video frame.

In the present embodiment, the execution subject of the video conversion method generates the second video based on the second video frame generated in step S202. Since the second video frame corresponds to the first video frame, the generated second video corresponds to the first video, thereby completing the conversion of the first video into the second video corresponding to the first video.

It should be noted that, in the video conversion method provided by the embodiment of the present disclosure, one of the first video and the second video is a standard dynamic range video (SDR video), and the other is a high dynamic range video (HDR video). For example, if the acquired first video is an SDR video, the second video generated after conversion is an HDR video; and if the acquired first video is the HDR video, the converted generated second video is the SDR video.

According to the video conversion method provided by the embodiment of the disclosure, reversible space transformation, brightness mapping and color gamut conversion are performed on a first video frame of a first video to be converted to generate a second video corresponding to the first video, so that the conversion precision and accuracy are improved; through the process, the reversible conversion between the standard dynamic range video and the high dynamic range video can be realized, and the conversion process is simplified.

Fig. 3 shows a flow 300 of a second embodiment of a video conversion method according to the present disclosure, and referring to fig. 3, in this embodiment, the first video is a standard dynamic range video (SDR video), and the second video is a high dynamic range video (HDR video), the video conversion method includes the following steps:

s301, a first video to be converted is obtained, and a first video frame is extracted from the first video.

In this embodiment, the main execution body of the video conversion method obtains a first video to be converted, and extracts a first video frame from the first video. The first video is an SDR video, and thus the first video frame is an SDR video frame.

Step S301 is substantially the same as step S201 in the embodiment shown in fig. 2, and the specific implementation manner may refer to the foregoing description of step S201, which is not described herein again.

S302, spatial dimension information of the first video frame is obtained.

In this embodiment, the executing body of the video conversion method acquires the spatial dimension information of the first video frame extracted in step S301, so as to complete the spatial transformation of the first video frame based on the spatial dimension information of the first video frame.

Illustratively, the spatial dimension information of the first video frame includes width information, height information, channel information, etc. of the features of the first video frame.

S303, generating channel dimension information of the first video frame based on the spatial dimension information of the first video frame.

In this embodiment, the execution subject of the video conversion method generates channel dimension information of the first video frame based on the spatial dimension information of the first video frame.

The number of channels of the features in the channel dimension information of the first video frame is an integral multiple of the number of channels of the features in the spatial dimension information of the first video frame.

According to the embodiment of the disclosure, the spatial dimension information of the first video frame is transformed, the number of channels of the characteristics of the first video frame is increased, the resolution of the first video frame is effectively improved, resolution support is provided for converting an SDR video frame into an HDR video frame, the resolution of the converted HDR video frame is ensured, and the conversion precision and accuracy are improved.

In some optional implementations of embodiments of the present disclosure, the spatial dimension information of the first video frame is downsampled to generate channel dimension information of the first video frame.

Illustratively, the transformation of the spatial dimension information of the first video frame into the channel dimension information may be implemented by a trained convolutional neural network having a spatial transformation function.

For example, if the spatial dimension information of the first video frame includes hxwxt, where H is a feature height of the first video frame, W is a feature width of the first video frame, and T is a feature channel number of the first video frame; the input of the convolution neural network with the space transformation function is y, and y belongs to R ^H×W×T ，R ^H×W×T Represents a space of H × W × T; the output y' is

Exemplarily, y' _i,j,0 ＝y _2i,2j ，y′ _i,j,1 ＝y _2i,2j+1 ，y′ _i,j,2 ＝y _2i+1,2j ，y′ _i,j,3 ＝y _2i+1,2j+1 Wherein, in the process,

wherein, y' _i,j,0 And channel dimension information of the characteristic with the height position i and the width position j on the 0 th channel.

Taking the number of characteristic channels T =1 in the spatial dimension information of the first video frame as an example, after the spatial dimension information of the first video frame is input to the convolutional neural network, the output channel dimension information includes that the number of channels is 4, the characteristic width is W/2, and the characteristic height is H/2.

In the process of converting the spatial dimension information into the channel dimension information, the number of pixels of the first video frame is unchanged, the number of channels is increased to 4 times of the original number, and the feature width and the feature height are both reduced to 1/2 of the original number, so that the resolution of the first video frame can be increased to 4 times of the original resolution through the conversion of the convolutional neural network, and the technical effect of increasing the resolution of the first video frame is achieved.

It should be noted that the spatial transformation process is a reversible process, that is, after the channel dimension information is generated according to the spatial dimension information of the first video frame, the spatial dimension information may also be reversely generated according to the channel dimension information.

S304, performing brightness mapping and color gamut conversion on the channel dimension information to generate a second video frame.

In the present embodiment, the execution subject of the video conversion method generates the second video frame by performing luminance mapping and color gamut conversion on the channel dimension information generated in step S303.

It should be noted that, in this embodiment, the order of performing the luminance mapping and the color gamut conversion on the channel dimension information may be arbitrarily performed, for example, the luminance mapping may be performed first, and then the color gamut conversion is performed, or the color gamut conversion may be performed first, and then the luminance mapping is performed.

The brightness mapping is performed on the channel dimension information, and the channel dimension information of the first video frame can be subjected to feature extraction to simulate the imaging process of a static HDR image. For example, luminance information extraction is performed on the channel dimension information. Illustratively, a plurality of SDR images with different exposure degrees can be generated by one SDR image through simulation by extracting brightness information in channel dimension information, and then the plurality of SDR images with different exposure degrees are fused into one HDR image, so that conversion from the SDR video frame to the HDR video frame is realized.

It is noted that the above-described luminance mapping process is a reversible process. That is, the luma mapping process may generate an HDR image through the feature extraction static simulation from one SDR image as described above, or may reversely statically simulate one SDR image through the corresponding feature decomposition from one HDR image.

In an alternative exemplary embodiment, the luminance mapping process described above may be implemented using a trained convolutional neural network with feature extraction. Illustratively, the convolutional neural network may have several convolutional layers for performing feature extraction by convolution, so as to ensure the feature extraction precision, i.e. ensure the luminance mapping precision. That is to say, the channel dimension information of the SDR video frame is input into the convolutional neural network with the feature extraction function, and the channel dimension information after luminance mapping can be output, so as to generate the HDR video frame.

The color gamut conversion of the channel dimension information is to realize a conversion process from the color gamut of the SDR video frame to the color gamut of the HDR video frame, that is, to realize the conversion from the BT709 color gamut to the BT2020 color gamut.

Illustratively, the color gamut conversion is also implemented by a convolutional neural network having several convolutional layers, and is done in a layer-by-layer progressive manner.

Fig. 4 shows a schematic diagram of a convolution process 400 of the color gamut conversion process in an alternative embodiment, as shown in fig. 4, in this embodiment, if the channel dimension information after the luminance mapping is Q _H×W×M Wherein M is the number of channels. In the color gamut conversion process, the channel dimension information may be divided into two groups, including: s _a ＝Q _H×W×[0:a] And s _b ＝Q _H×W×[a:M] A is the number of channels, and a is more than or equal to 0 and less than or equal to M. Assuming that there is a total of K-layer mappings during luminance mapping, the relationship between the K-th layer and the (K + 1) -th layer during gamut conversion can be formulated as follows:

in the formula (I), the compound is shown in the specification,

for the channel dimension information of the k-th layer in the convolution process,

is a pair of

And (4) channel dimension information of the (k + 1) th layer after convolution.

For example, in the calculation

While passing through the theta convolutional layer pair

Convolving it and finding the exp function of its delta function

Multiplying, and using mu convolution layer pair

The convolution results of (a) are added. Wherein the delta function is an activation function.

Wherein K +1 is less than or equal to K.

After the layer-by-layer convolution, the output is s _a And s _b The information after the convolution result connection represents the channel dimension information of the second video frame, and accordingly the second video frame can be generated. For example, the two sets of convolution results may be combined by the concat function to generate the second video frame.

Note that, in the present embodiment, the parameters of each convolutional layer are not shared. For example, in the calculation

When, to

The adopted convolution layers comprise a theta convolution layer and a mu convolution layer; in the calculation of

In time, to

The convolutional layers used include v convolutional layers and λ convolutional layers.

In the embodiment of the present disclosure, the color gamut conversion process is a reversible process, i.e., the convolution process may be performed in reverse. That is, the channel dimension information of the first video frame may be generated after performing inverse color gamut conversion and inverse luminance mapping with respect to the second video frame.

S305, generating a second video based on the second video frame.

In the present embodiment, the execution subject of the video conversion method generates the second video, which corresponds to the first video, based on the second video frame generated in step S304.

For example, in this embodiment, the first video is an SDR video, and the finally generated second video is an HDR video.

Step S305 is substantially the same as step S203 in the embodiment shown in fig. 2, and for a specific implementation, reference may be made to the foregoing description of step S203, which is not described herein again.

In the embodiment of the disclosure, the executing body generates channel dimension information according to the extracted spatial dimension information of the SDR video frame, raises the resolution of the first video frame, then performs luminance mapping and color gamut conversion on the channel dimension information, raises the luminance of the first video frame, converts the first video frame from the BT709 color gamut to the BT2020 color gamut, generates an HDR video frame, generates an HDR video according to the generated HDR video, completes the conversion of the SDR video to the HDR video, and raises the conversion precision and accuracy.

Fig. 5 shows an implementation flow 500 of an embodiment of the video conversion method of the present disclosure, in which the second video is a standard dynamic range video and the first video is a high dynamic range video. Referring to fig. 5, the video conversion method includes the steps of:

s501, a first video to be converted is obtained, and a first video frame is extracted from the first video.

In this embodiment, the executing body of the video conversion method obtains the first video to be converted, for example, obtains the first video from a server or a user terminal, and then extracts the first video frame from the first video, so as to process the first video frame and complete the conversion from the first video to the second video. Wherein the first video is an HDR video, and thus the first video frame is an HDR video frame.

Step S501 is substantially the same as step S201 in the embodiment shown in fig. 2, and the detailed implementation manner may refer to the foregoing description of step S201, which is not described herein again.

S502, performing brightness mapping and color gamut conversion on the first video frame to obtain channel dimension information of the first video frame.

In this embodiment, the executing body of the video conversion method performs luminance mapping and color gamut conversion on the first video frame extracted in step S501, so as to obtain channel dimension information of the first video frame.

The order of the luminance mapping and the color gamut conversion is not sequential, that is, the luminance mapping may be performed first and then the color gamut conversion may be performed, or the color gamut conversion may be performed first and then the luminance mapping may be performed.

The process of gamut conversion of the first video frame, that is, the process of converting the gamut of the first video frame from the BT2020 gamut to the BT709 gamut, is described. This process is the inverse of the gamut conversion process in step S304 in the embodiment shown in fig. 3, i.e., the inverse of the convolution process 400 shown in fig. 4. That is, the process is a process of performing convolution calculation in the reverse direction of the direction indicated by the arrow in fig. 4, i.e., by

To the direction of

And (5) convolution calculation.

For the specific calculation process, reference may be made to the related description of the convolution process 400 shown in fig. 4, and details thereof are not repeated herein.

The luminance mapping is performed on the first video frame, which is the reverse execution process of the luminance mapping process in step S304 in the embodiment shown in fig. 3, that is, the luminance information of the first video frame is adjusted to the luminance information of the second video frame by means of feature decomposition. For example, the extraction and decomposition of luminance information may be performed by a convolutional neural network.

S503, generating spatial dimension information of the first video frame based on the channel dimension information of the first video frame.

In this embodiment, the executing body of the video conversion method generates the spatial dimension information of the first video frame based on the channel dimension information of the first video frame generated in step S502, so as to reduce the spatial resolution of the first video frame and improve the conversion accuracy and precision.

The execution process of step S503 can refer to the description in step S302 of the embodiment shown in fig. 3, and is the reverse process of step S302, i.e. the output in step S302 corresponds to the input in step S503, and the input in step S302 corresponds to the output in step S503.

That is to say, in the embodiment of the present disclosure, the channel dimension information with the number of channels being 4T, the feature width being W/2, and the feature height being H/2 may be converted into the spatial dimension information with the number of channels being T, the feature width being W, and the feature height being H, the spatial resolution of the first video frame may be reduced, and the conversion of the spatial resolution of the HDR video frame into the spatial resolution of the SDR video frame may be completed.

S504, generating a second video frame according to the space dimension information of the first video frame.

In the present embodiment, the execution subject of the video conversion method generates the second video frame from the spatial dimension information of the first video frame generated in step S503.

The execution main body can directly generate the corresponding SDR video frame according to the spatial dimension information after the spatial resolution is converted into the spatial resolution of the SDR video frame.

And S505, generating a second video based on the second video frame.

In the present embodiment, the executing body of the video conversion method generates, based on the second video frame generated in step S504, a second video corresponding to the first video, that is, an SDR video corresponding to the HDR video.

Step S505 is substantially the same as step S203 in the embodiment shown in fig. 2, and the specific implementation manner may refer to the foregoing description of step S203, which is not described herein again.

In the embodiment of the present disclosure, the HDR video frame is subjected to luminance mapping, color gamut conversion and spatial transformation to generate an SDR video frame, and accordingly, an SDR video is generated, so that conversion from the HDR video to the SDR video is completed, which is the reverse process of the embodiment shown in fig. 3. Therefore, the video conversion method provided by the embodiment of the disclosure can complete the interconversion between the SDR video and the HDR video, i.e. can perform the reversible conversion between the SDR video and the HDR video, improve the conversion precision and accuracy, and simplify the conversion process.

In an optional embodiment, the first video frame is subjected to spatial transformation, luminance mapping and color gamut conversion to generate a second video frame corresponding to the first video frame, which may be implemented by a pre-trained convolutional neural network model. Illustratively, the convolutional neural network model may be an inverse transform model. In the implementation process, the first video frame is input into a pre-trained reversible conversion model, and then a corresponding second video frame can be generated.

Fig. 6 shows a schematic diagram of a reversible conversion model 600 in the video conversion method of the present disclosure, and referring to fig. 6, the reversible conversion model 600 includes a spatial conversion module 601, a luminance mapping module 602, and a color gamut conversion module 603. The spatial transform module 601, the luminance mapping module 602, and the color gamut conversion module 603 are all modules that can be executed in reverse. That is, the respective outputs of the spatial transform module 601, the luminance mapping module 602, and the color gamut conversion module 603 may perform input, the inputs may perform output, and the specific implementation thereof may be reversible.

After the first video frame is input into the pre-trained reversible conversion model 600, the reversible conversion model 600 performs the following process:

in response to the first video being a standard dynamic range video (SDR video), the spatial dimension information of the first video frame is transformed into channel dimension information by the spatial transform module 601, and the channel dimension information is luminance-mapped and color-gamut-converted by the luminance mapping module 602 and the color-gamut conversion module 603, generating a second video frame; and the execution process is executed under supervision of the first loss function;

in response to that the first video is a high dynamic range video (HDR video), luminance mapping and color gamut conversion are performed on the first video frame through the luminance mapping module 602 and the color gamut conversion module 603 to obtain channel dimension information of the first video frame, and the channel dimension information of the first video frame is converted into spatial dimension information through the spatial conversion module 601 to generate a second video frame, where this execution process is executed under supervision of a second loss function.

It should be noted that, during the execution, the execution order of the spatial transform module 601 is executed in the order of the directions indicated by the arrows in the figure, and the execution order of the gamut conversion module 603 and the luminance mapping module 602 may be executed in the order indicated by the arrows in the figure or may not be executed in the order indicated by the arrows in the figure.

In the embodiment of the disclosure, the reversible conversion between the SDR video frame and the HDR video frame can be realized through the reversible conversion model, that is, the same reversible conversion model can be used to convert the SDR video frame into the HDR video frame and also convert the HDR video frame into the SDR video frame, thereby realizing the reversible conversion between the SDR video and the HDR video and simplifying the conversion process.

In this embodiment, the above-mentioned reversible transformation model is obtained by training as follows:

acquiring sample video frames, wherein the sample video frames comprise a plurality of standard dynamic range video frames and a plurality of high dynamic range video frames, and the plurality of high dynamic range video frames correspond to the plurality of standard dynamic range video frames one to one;

taking a plurality of standard dynamic range video frames as input samples, taking a plurality of high dynamic range video frames as output samples, taking a space transformation module as a model inlet, and training a reversible transformation model;

taking a plurality of high dynamic range video frames as input samples, taking a plurality of standard dynamic range video frames as output samples, taking a space transformation module as a model outlet, and training a reversible transformation model;

and responding to the loss function of the reversible conversion model to meet the preset condition, and obtaining the trained reversible conversion model.

Illustratively, the loss function of the reversible conversion model comprises a first loss function and a second loss function, wherein the first loss function is used for convergence of the training process when a plurality of standard dynamic range video frames are used as input samples; the second penalty function is used for convergence of the training process when a plurality of high dynamic range video frames are used as input samples.

In the embodiment of the present disclosure, the first loss function and the second loss function respectively converge the training processes in different conversion directions, so that the conversion precision and the conversion effect of the trained reversible conversion model in the two conversion directions can be effectively ensured.

In an optional implementation manner of the embodiment of the present disclosure, the first loss function adopts an L2 norm, and the second loss function adopts an L2 norm, so as to effectively prevent overfitting in the training process and improve the generalization capability of the inverse transformation model.

The training method disclosed by the embodiment of the disclosure is adopted to train the reversible conversion model, so that the fitting precision and the conversion effect of the reversible conversion model are effectively ensured, the conversion effect of the reversible conversion model on an SDR video frame or an HDR video frame is ensured, and the conversion effect between the SDR video and the HDR video is further ensured.

As an implementation of the methods illustrated in the above figures, fig. 7 illustrates one embodiment of a video conversion device according to the present disclosure. The video conversion apparatus corresponds to the method embodiment shown in fig. 2, and the apparatus can be applied to various electronic devices.

Referring to fig. 7, the video conversion apparatus 700 includes: an acquisition module 701, a first generation module 702, and a second generation module 703. Wherein the content of the first and second substances,

the obtaining module 701 is configured to obtain a first video to be converted and extract a first video frame from the first video; the first generation module 702 is configured to perform spatial transformation, luminance mapping and color gamut conversion on the first video frame to generate a second video frame corresponding to the first video frame; the second generating module 703 is configured to generate a second video based on the second video frame. Wherein one of the first video and the second video is a standard dynamic range video and the other is a high dynamic range video.

In this embodiment, in the video conversion apparatus 700, the detailed processing of the obtaining module 701, the first generating module 702, and the second generating module 703 and the technical effects thereof can refer to the related descriptions of steps S201 to S203 in the corresponding embodiment of fig. 2, which are not repeated herein.

In some optional implementation manners of the embodiment of the present disclosure, if the first video is a standard dynamic range video, the second video is a high dynamic range video, and the first generating module includes a first obtaining submodule, a first generating submodule, and a second generating submodule. The first obtaining submodule is configured to obtain spatial dimension information of a first video frame; the first generation submodule is configured to generate channel dimension information of the first video frame based on the spatial dimension information of the first video frame; the second generation submodule is configured to perform luminance mapping and color gamut conversion on the channel dimension information to generate a second video frame.

In this embodiment, in the video conversion apparatus 700, specific processing of the first obtaining sub-module, the first generating sub-module, and the second generating sub-module and technical effects thereof may refer to the related descriptions of steps S302-S304 in the embodiment corresponding to fig. 3, and are not described herein again.

In some optional implementations of embodiments of the present disclosure, the first generation sub-module is configured to down-sample the spatial dimension information of the first video frame, and generate the channel dimension information of the first video frame.

In some optional implementation manners of the embodiment of the present disclosure, if the second video is a standard dynamic range video, the first video is a high dynamic range video, and the first generating module includes a first obtaining submodule, a third generating submodule, and a fourth generating submodule. The first obtaining submodule is configured to perform brightness mapping and color gamut conversion on the first video frame to obtain channel dimension information of the first video frame; the third generation submodule is configured to generate spatial dimension information of the first video frame based on the channel dimension information of the first video frame; the fourth generation submodule is configured to generate the second video frame according to the spatial dimension information of the first video frame.

In the video conversion apparatus 700 of this embodiment, specific processing of the first obtaining sub-module, the third generating sub-module, and the fourth generating sub-module and technical effects thereof may refer to related descriptions of steps S502 to S504 in the embodiment corresponding to fig. 5, and are not described herein again.

In some optional implementations of embodiments of the present disclosure, the first generation module is configured to input the first video frame into a pre-trained reversible conversion model, wherein the reversible conversion model comprises a spatial transformation module, a luminance mapping module, and a color gamut conversion module; responding to the fact that the first video is a standard dynamic range video, converting space dimension information of a first video frame into channel dimension information through a space conversion module, and performing brightness mapping and color gamut conversion on the channel dimension information through a brightness mapping module and a color gamut conversion module to generate a second video frame; and responding to the fact that the first video is the high dynamic range video, performing brightness mapping and color gamut conversion on the first video frame through a brightness mapping module and a color gamut conversion module to obtain channel dimension information of the first video frame, converting the channel dimension information of the first video frame into space dimension information through a space conversion module, and generating a second video frame.

In the video conversion apparatus of this embodiment, the detailed processing of the first generating module and the technical effects thereof can refer to the relevant description in the embodiment corresponding to fig. 6, and are not repeated herein.

In some optional implementations of the embodiments of the present disclosure, the reversible transformation model is obtained by training as follows:

obtaining a sample video frame, wherein the sample video frame comprises a plurality of standard dynamic range video frames and a plurality of high dynamic range video frames, and the plurality of high dynamic range video frames correspond to the plurality of standard dynamic range video frames one to one;

The present disclosure also provides an electronic device, a non-transitory computer-readable storage medium storing computer instructions, and a computer program product, in accordance with embodiments of the present disclosure.

Wherein, this electronic equipment includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the video conversion method.

In some embodiments, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the above-described video conversion method.

In some embodiments, a computer program product comprises a computer program which, when executed by a processor, implements the video conversion method described above.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the video conversion method. For example, in some embodiments, the video conversion method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM803 and executed by the computing unit 801, a computer program may perform one or more steps of the video conversion method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the video conversion method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A video conversion method, comprising:

performing spatial transformation, brightness mapping and color gamut conversion on the first video frame to generate a second video frame corresponding to the first video frame;

generating a second video based on the second video frame;

2. The video conversion method according to claim 1, wherein in a case where the first video is a standard dynamic range video, the second video is a high dynamic range video;

the performing spatial transformation, luminance mapping and color gamut conversion on the first video frame to generate a second video frame corresponding to the first video frame includes:

acquiring spatial dimension information of the first video frame;

generating channel dimension information of the first video frame based on the spatial dimension information of the first video frame;

and performing brightness mapping and color gamut conversion on the channel dimension information to generate the second video frame.

3. The video conversion method of claim 2, wherein the generating channel dimension information for the first video frame based on the spatial dimension information for the first video frame comprises:

and downsampling the spatial dimension information of the first video frame to generate channel dimension information of the first video frame.

4. The video conversion method according to claim 1, wherein in a case where the second video is a standard dynamic range video, the first video is a high dynamic range video;

performing brightness mapping and color gamut conversion on the first video frame to obtain channel dimension information of the first video frame;

generating spatial dimension information of the first video frame based on channel dimension information of the first video frame;

and generating a second video frame according to the spatial dimension information of the first video frame.

5. The video conversion method according to claim 1, wherein said performing spatial transformation, luminance mapping and color gamut conversion on the first video frame to generate a second video frame corresponding to the first video frame comprises:

inputting the first video frame into a pre-trained reversible conversion model, wherein the reversible conversion model comprises a space transformation module, a brightness mapping module and a color gamut conversion module;

responding to the fact that the first video is a standard dynamic range video, converting the spatial dimension information of the first video frame into channel dimension information through the spatial conversion module, and performing brightness mapping and color gamut conversion on the channel dimension information through the brightness mapping module and the color gamut conversion module to generate a second video frame;

and in response to the fact that the first video is a high dynamic range video, performing brightness mapping and color gamut conversion on the first video frame through the brightness mapping module and the color gamut conversion module to obtain channel dimension information of the first video frame, converting the channel dimension information of the first video frame into space dimension information through the space conversion module, and generating the second video frame.

6. The video conversion method of claim 5, wherein the reversible conversion model is trained by:

obtaining a sample video frame, wherein the sample video frame comprises a plurality of standard dynamic range video frames and a plurality of high dynamic range video frames, and the plurality of high dynamic range video frames and the plurality of standard dynamic range video frames are in one-to-one correspondence;

taking the plurality of standard dynamic range video frames as input samples, taking the plurality of high dynamic range video frames as output samples, taking the spatial transformation module as a model inlet, and training the reversible transformation model;

taking the plurality of high dynamic range video frames as input samples, taking the plurality of standard dynamic range video frames as output samples, taking the spatial transformation module as a model outlet, and training the reversible transformation model;

and responding to the fact that the loss function of the reversible conversion model meets a preset condition, and obtaining the trained reversible conversion model.

7. A video conversion device, comprising:

8. The video conversion apparatus according to claim 7, wherein in a case where the first video is a standard dynamic range video, the second video is a high dynamic range video, and the first generation module includes:

a first obtaining sub-module configured to obtain spatial dimension information of the first video frame;

a first generation submodule configured to generate channel dimension information of the first video frame based on the spatial dimension information of the first video frame;

and the second generation submodule is configured to perform brightness mapping and color gamut conversion on the channel dimension information to generate the second video frame.

9. The video conversion device of claim 8, wherein the first generation sub-module is configured to down-sample spatial dimension information of the first video frame to generate channel dimension information of the first video frame.

10. The video conversion apparatus according to claim 7, wherein in a case where the second video is a standard dynamic range video, the first video is a high dynamic range video, and the first generation module includes:

the first obtaining submodule is configured to perform brightness mapping and color gamut conversion on the first video frame to obtain channel dimension information of the first video frame;

a third generation submodule configured to generate spatial dimension information of the first video frame based on the channel dimension information of the first video frame;

a fourth generation submodule configured to generate a second video frame according to the spatial dimension information of the first video frame.

11. The video conversion apparatus of claim 7, wherein the first generation module is configured to,

inputting the first video frame into a pre-trained reversible conversion model, wherein the reversible conversion model comprises a spatial transformation module, a brightness mapping module and a color gamut conversion module;

responding to the fact that the first video is a standard dynamic range video, converting the space dimension information of the first video frame into channel dimension information through the space conversion module, and performing brightness mapping and color gamut conversion on the channel dimension information through the brightness mapping module and the color gamut conversion module to generate a second video frame;

and in response to that the first video is a high dynamic range video, performing brightness mapping and color gamut conversion on the first video frame through the brightness mapping module and the color gamut conversion module to obtain channel dimension information of the first video frame, and converting the channel dimension information of the first video frame into space dimension information through the space conversion module to generate the second video frame.

12. The video conversion apparatus according to claim 11, wherein the reversible conversion model is trained by:

obtaining sample video frames, wherein the sample video frames comprise a plurality of standard dynamic range video frames and a plurality of high dynamic range video frames, and the plurality of high dynamic range video frames correspond to the plurality of standard dynamic range video frames one to one;

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.