CN116962689A

CN116962689A - Method and device for dividing depth map coding units in three-dimensional video frame

Info

Publication number: CN116962689A
Application number: CN202310900673.1A
Authority: CN
Inventors: 宋俊锋; 龚鑫铠; 季苏华; 叶振; 王国相; 吴子健
Original assignee: Zhejiang Dianchuang Information Technology Co ltd
Current assignee: Zhejiang Dianchuang Information Technology Co ltd
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-10-27

Abstract

The invention discloses a method and a device for dividing depth map coding units in a three-dimensional video frame, firstly, a coding tree unit division structure prediction network module is constructed based on a Swin transform module and a convolutional neural network; secondly, a partition structure prediction module is added in the intra-frame prediction of the encoder, and original pixels of a block to be encoded of the current depth map are obtained in encoding to predict the partition depth of the block to be encoded. The method has the advantages that a division structure is built by utilizing the Swin transform module, global information of a depth map coding tree unit is captured, the defect that CNN cannot well extract the global information is overcome, the optimal division structure of the current coding tree unit is predicted, an encoder can obtain the optimal division structure of each coding tree unit only by transmitting a depth map, the coding time of the depth map in a 3D-HEVC frame is greatly reduced under the condition that the coding quality is basically unchanged, and the purpose of reducing the complexity of 3D-HEVC coding is achieved.

Description

Method and device for dividing depth map coding units in three-dimensional video frame

Technical Field

The invention belongs to the technical field of video coding based on a neural network, and particularly relates to a method and a device for dividing depth map coding units in a three-dimensional video frame.

Background

In recent years, with the rapid development of three-dimensional (Three Dimensional, 3D) video services, 3D video has entered thousands of households. The 3D video provides a stereoscopic immersive viewing experience, with different videos presented to the viewer through 3D glasses to achieve 3D scene perception. Meanwhile, 3D video also puts higher demands on video coding technology. To address this challenge, the international joint video coding group (the Joint Collaborative Team on Video Coding, JCT-VC) developed a three-dimensional efficient video coding standard (3D-High Efficiency Video Coding, 3D-HEVC). The 3D-HEVC typically contains 2-3 views, each having a Texture map (Texture map) and a corresponding Depth map (Depth map). The Depth map is represented by a gray scale Image, which captures the distance between the camera and the real object, and can synthesize a virtual Image using Depth-Image-Based rendering (DIBR) technology. Unlike texture maps, depth maps have a large number of flat regions and very steep boundaries, and 3D-HEVC provides a number of complex depth map coding techniques for distinguishing features of texture maps, resulting in an increase in 3D-HEVC coding complexity.

Intra prediction is the most central component in the video coding standard HEVC/AV 1/AVs. Quadtree-based coding tree unit structure in HEVC employed in 3D-HEVC depth map intra coding. The coding tree unit is a basic unit that is divided into several coding units, which can be represented by a recursive quadtree structure. These coding units may be represented as different square sizes or Depth Levels (DL) of 64×64 (dl=0) to 8×8 (dl=3). The 3D-HEVC Test Model (HTM) partitions each frame of the video sequence into maximum coding units (64 x 64), and then recursively partitions the coding units of each layer into 4 subunits until DL is 3. In order to determine the optimal partition structure of the current coding tree unit, starting from the root node, the rate-distortion cost of the undivided coding unit and the sum of the rate-distortion costs after division into 4 coding units need to be compared, if, in the rate-distortion cost, "RD-cost (dl=n) > RD-cost (dl=n+1), n=0, 1,2", the coding unit of "depth=n" needs to be divided into 4 sub-coding units, and otherwise, the division is terminated. This will lead to a significant increase in the complexity of the depth map coding and therefore it is necessary to reduce the complexity of the depth map coding unit partition in 3D-HEVC.

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purpose of improving the intra-frame prediction coding efficiency in video coding, the invention adopts the following technical scheme:

a depth map coding unit partitioning method, comprising the steps of:

step S1: constructing a partition structure prediction network, and carrying out partition prediction on a depth map coding unit;

step S2: obtaining a depth map coding unit to be partitioned, obtaining a predicted optimal partition structure through a partition structure prediction network, comparing the current depth of the depth map coding unit with the depth of the predicted optimal partition structure, calculating the rate distortion cost of the current depth map coding unit when the depths are the same, otherwise, determining the optimal partition structure of the current depth map coding unit based on the calculated rate distortion cost without calculating.

Further, in the step S2, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.

Further, the step S1 includes the steps of:

step S1.1: extracting features, namely acquiring an original pixel gray level image of a depth image coding unit, partitioning the original pixel gray level image, performing linear transformation on channel data of each pixel, extracting the features through a Swin transform shift window Transformer, and merging based on the partitioned blocks corresponding to the features to obtain first features; performing feature extraction on the original pixel gray level map of the depth map coding unit based on the convolution group to obtain a second feature; fusing the first feature with the second feature;

step S1.2: and carrying out partition prediction on the fused features to obtain a predicted partition map.

The Swin transducer shift window Transformer can capture global and local context information in the input feature map through a plurality of self-attention mechanisms and feedforward neural network layers, so that edge information in the depth map can be effectively extracted for more accurate prediction in subsequent steps.

Further, in the step S1, a multi-scale L is constructed ₁ Loss function MS-L ₁ Training for partition structure prediction, the predicted partition image pixels are between 0-3:

wherein, maxPool _k=i And MinPool _k＝i Respectively representing the maximum pooling and the minimum pooling of the kernel size i, y represents the division result of network output and training,representing real partitioning results, maximum and minimum pooling is aimed at enhancing local consistency due to the characteristic of local consistency of depth maps, for L ₁ Defined, A _k=i 、B _k=i Respectively represent L ₁ Corresponding MaxPool in the loss function _k=i (y)、/>Or MinPool _k=i (y)、/>

The method for dividing the depth map coding units in the three-dimensional video frames is based on the method for dividing the depth map coding units, wherein the step S2 comprises the following steps:

step S2.1: based on a video sequence provided by a pre-coding communication standard, extracting a depth map coding unit and an optimal dividing structure as training data for training the dividing structure prediction network;

step S2.2: reading a depth map to be encoded from a depth map video frame by frame, dividing the depth map into depth map encoding units, and obtaining a predicted optimal dividing structure through a trained dividing structure prediction network;

step S2.3: obtaining the depth of a depth map coding unit to be coded and the predicted optimal partition structure thereof from the depth map, comparing the current depth of the depth map coding unit with the predicted optimal partition structure depth, and calculating the rate distortion cost of the current depth map coding unit when the depth is the same, otherwise, not calculating;

step S2.4: and (3) obtaining an optimal division structure of the current depth map coding unit based on the calculated rate distortion cost, and returning to the step (S2.3) until the optimal division structure of the depth map coding unit corresponding to all the depth map videos is obtained.

Further, in the step S2.3, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.

The device for dividing the depth map coding units in the three-dimensional video frames comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the method for dividing the depth map coding units in the three-dimensional video frames when executing the executable codes.

A depth map coding unit dividing device comprises a dividing structure prediction module and an optimal dividing structure generation module;

the partition structure prediction module is used for carrying out partition prediction on the depth map coding unit;

the optimal division structure generation module is used for acquiring the depth map coding units to be divided, obtaining a predicted optimal division structure through the division structure prediction module, comparing the current depth of the depth map coding units with the depth of the predicted optimal division structure, and calculating the rate distortion cost of the current depth map coding units when the depths are the same, otherwise, not calculating the rate distortion cost, and determining the optimal division structure of the current depth map coding units based on the calculated rate distortion cost.

Further, in the optimal partition structure generation module, if the current depth is smaller than the predicted depth, skipping rate distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.

Further, the partition structure prediction module comprises a feature extraction module, a fusion module and a partition prediction module, wherein the feature extraction module comprises a first feature extraction branch and a second feature extraction branch, the first feature extraction branch comprises a blocking layer, a linear transformation layer and a Swin transformation module shift window Transformer which are sequentially connected, the second feature extraction branch comprises a convolution layer, a normalization layer, an activation layer and a pooling layer which are sequentially connected, and the fusion module fuses the features extracted by the two branches and is used for predicting the partition structure by the partition prediction module. The features extracted by the Swin transform shift window converter are subjected to dimension reduction through a convolution layer so as to reduce the computational complexity.

The Swin transducer shift window converter can capture global and local context information in the input feature map through a multi-layer self-attention mechanism and a feedforward neural network layer, so that edge information in the depth map is effectively extracted, and more accurate prediction is performed in the subsequent steps; filtering and downsampling the depth map through convolution operation, normalizing through BN layer, nonlinear activating through ReLU activating layer and space dimension reducing through pooling layer, wherein the components can extract edge characteristic information when processing the depth map coding tree unit; finally, through the combination of the two branches, the network can obtain rich characteristic information.

The invention has the advantages that:

according to the method and the device for dividing the depth map coding units in the three-dimensional video frames, in 3D video depth map intra-frame prediction, the network module based on the Swin transform is utilized to predict the optimal dividing structure of the depth map coding tree units, so that redundant rate distortion cost calculation is skipped, the 3D-HEVC intra-frame depth map coding time is greatly reduced under the condition that the coding quality is basically unchanged, the 3D-HEVC coding complexity is reduced, and the intra-frame prediction coding efficiency in video coding is improved.

Drawings

Fig. 1 is a flowchart of a method for dividing a depth map coding unit in a three-dimensional video frame according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a partition structure prediction network according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of network input and output corresponding to a depth map of a frame according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a device for dividing depth map coding units in a three-dimensional video frame according to an embodiment of the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

A method for dividing the depth map coding units in three-dimensional video frames includes such steps as adding a prediction module of dividing structure in the prediction of frame in coder/decoder, capturing the global information of the depth map coding tree unit by using Swin converter (Shifted windows Transformer shift window converter) module, compensating the defect of CNN (Convolutional Neural Network) in extracting global information, and predicting the optimal dividing structure of current coding tree unit. The encoder can obtain the optimal division structure of each coding tree unit only by transmitting the depth map, thereby effectively saving the time of video coding.

In practical use, the encoder will call the method of the present invention to complete the intra prediction of the depth map, as shown in fig. 1, and the method specifically comprises the following steps:

step S1: partition structure prediction based on Swin transducer. The neural network based on the Swin Transformer and used for generating the coding tree unit division structure is adopted, the network structure is shown in fig. 2, the network takes the original pixel gray level image of the current tree unit to be coded as input, the characteristic information of the depth image is extracted through convolution, pooling and Swin Transformer modules, and then the prediction depth division structure information is output through convolution and pooling after splicing.

The input of the network is a depth map coding tree unit, is a gray level image with the height of 64, the width of 64 and the channel of 1, and outputs a partition map with the width of 1 multiplied by 4, and the result of the quadtree segmentation of the depth map coding tree unit is shown, namely, the partition structure prediction network outputs a partition structure diagram of the current coding tree unit, which is a gray level image with the height of 4, the width of 4 and the channel of 1. The partition structure prediction mainly comprises the following steps:

step S1.1: and (5) extracting characteristics. The feature extraction module of the network contains 2 parallel branches. The first branch firstly inputs the feature map into a Patch Partition module for blocking, then carries out Linear transformation on channel data of each pixel through a Linear embedding layer, then constructs the feature map through Swin Transformer Block, and finally carries out downsampling through a Patch embedding layer. Swin Transformer Block is a feature extraction module based on a transducer structure, which shows excellent performance in the field of computer vision. The module can capture global and local context information in the input feature map through a plurality of self-attention mechanisms and feedforward neural network layers, so that edge information in the depth map can be effectively extracted, and more accurate prediction can be performed in subsequent steps. The second branch is feature extracted by 3 x 3 convolutional groups, each group containing 1 3 x 3 convolutional layer, 1 Batch Normalization (BN) layer, 1 Rectifield Linear Unit (ReLU) active layer, and a 1 layer kernel-2 max-pooling layer. The depth map is filtered and downsampled through convolution operation, meanwhile, normalization operation is carried out through BN layer, nonlinear activation is carried out through ReLU activation layer, space dimension reduction is carried out through pooling layer, and edge characteristic information can be extracted when the depth map coding tree unit is processed through the components. Through the combination of the two branches, the network can obtain rich characteristic information.

After the features of the first part are extracted, the dimension of the features extracted by the Swin transform branches is reduced through a 1X 1 convolution layer, so that the computational complexity is reduced, and the features extracted by the two branches are fused through concat operation, so that more abundant and multi-level feature characterization is obtained.

Step S1.2: partition prediction. The partition prediction consists of 1 set of 3 x 3 convolutions and 2 1 x 1 convolutions layers. These layers are used to further process the fused features to predict a 1 x 4 partition map.

The choice of the loss function is determined by the problem to be handled by the neural network. Coding tree unitThe partition structure prediction is expressed as a regression problem, and meanwhile, the predicted partition image pixels are between 0 and 3, and a loss function used by a common regression problem cannot train the partition structure prediction network well. Thus, an improved L is used ₁ The loss function is trained and is called a multi-scale L-loss function (MS-L) ₁ ) The definition is:

at multiple scales L ₁ In the loss function, maxPool _k＝i And MinPool _k＝i Represented as maximum pooling and minimum pooling of kernel size i, y represents the partitioning label of the network output and training,representing real partitioning labels, maximum and minimum pooling is aimed at enhancing local consistency due to the feature of depth map with local consistency, L ₁ Expressed as:

wherein A is _k＝i 、B _k＝i Respectively represent L ₁ Corresponding MaxPool in the loss function _k＝i (y)、Or MinPool _k＝i (y)、/>

Step S2: the depth map intra-frame coding tree unit is rapidly partitioned, inter-coding block residual prediction based on space-time correlation is performed, and the encoder comprises the following operation steps:

step S2.1: and pre-coding the video sequence provided by the universal standard, and extracting a depth map coding tree unit and an optimal partition structure chart as training data. The depth prediction network constructed by the method shown in fig. 2 is used for network training, and then the trained depth neural network is embedded into the intra-frame prediction of the encoder.

Step S2.2: and reading the depth map to be encoded from the input depth map video frame by frame, dividing the depth map into 64 multiplied by 64 coding tree units, and predicting the coding tree units through a depth prediction network to obtain the optimal division structure of the to-be-coded tree units.

Step S2.3: and reading the coding tree unit to be coded and the predicted depth thereof from the input depth map, comparing the current depth of the coding unit with the depth predicted by the depth prediction network, and deciding the rate distortion cost calculation of the current coding unit.

The partition structure is flattened into a 1 x 16 array, each number being represented as a depth of the coding unit. The current depth of the coding unit is compared with the depth predicted by the depth prediction network. When the current depth is the same as the predicted depth, rate distortion cost calculation is performed; otherwise, no cost calculation will be performed.

Step S2.4: and (3) obtaining the optimal division structure of the current coding tree unit, and returning to the step S2.3.

Specifically, after the 3D video is read, the encoder end transmits a depth map to the partition structure prediction module, splits the depth map into 64×64 coding tree units, predicts the coding tree units, performs a flattening operation on the predicted partition map to obtain a 1×16 array, rounds and rounds the array to obtain integers between 0 and 3, each number represents the depth of one coding unit, and finally accords with the arrangement of the partition depths of the coding tree units through some adjustments; in the coding branch, firstly, determining whether the currently coded video is a depth map, and for the coding tree units which are not the depth map, namely the coding tree units of the texture video, obtaining the optimal coding tree unit partition result by using the original iterative coding tree unit partition method. And for the coding tree unit of the depth map, the coding tree unit to be coded and the predicted depth thereof are read, and when the coding tree unit is coded to the current coding unit, the depth of the current coding unit and the predicted depth are compared. If the current depth is equal to the predicted depth, RD cost of the current depth is calculated and quadtree partitioning is not performed any more; if the current depth is smaller than the predicted depth, skipping RD cost calculation of the current depth, continuing to divide the quadtree, and searching for the next depth; if the current depth is greater than the predicted depth, it not only skips the calculation of the current RD cost, but also stops further depth searching. This ensures that the calculation is only performed when the current depth is the depth predicted by the network model, thus skipping unnecessary calculation of the RD cost, and finally obtaining the optimal coding tree unit partition structure, and the operation of the encoder side is shown in fig. 3.

Corresponding to the embodiment of the method for dividing the depth map coding units in the three-dimensional video frames, the invention also provides an embodiment of a device for dividing the depth map coding units in the three-dimensional video frames.

Referring to fig. 4, a device for dividing a depth map coding unit in a three-dimensional video frame according to an embodiment of the present invention includes a memory and one or more processors, where executable codes are stored in the memory, and the one or more processors are configured to implement a method for dividing a depth map coding unit in a three-dimensional video frame according to the above embodiment when executing the executable codes.

The embodiment of the device for dividing the depth map coding units in the three-dimensional video frames can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 4, a hardware structure diagram of an apparatus with data processing capability where the depth map coding unit dividing device in a three-dimensional video frame of the present invention is located is shown in fig. 4, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, any apparatus with data processing capability in the embodiment generally includes other hardware according to the actual function of the any apparatus with data processing capability, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements a method for dividing a depth map coding unit in a three-dimensional video frame in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The method for dividing the depth map coding units is characterized by comprising the following steps:

2. The method for partitioning a depth map coding unit according to claim 1, wherein: in the step S2, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.

3. The method for partitioning a depth map coding unit according to claim 1, wherein: the step S1 includes the steps of:

step S1.1: extracting features, namely acquiring a depth map coding unit, partitioning, performing linear transformation on channel data of each pixel, extracting the features through a shift window transformer, and merging based on the partitions corresponding to the features to obtain first features; performing feature extraction on the depth map coding unit based on the convolution group to obtain a second feature; fusing the first feature with the second feature;

4. The method for partitioning a depth map coding unit according to claim 1, wherein: in the step S1, a multi-scale L is constructed ₁ Loss function MS-L ₁ Training for partition structure prediction:

wherein, maxPool _k＝i And MinPool _k＝i Respectively representing the maximum pooling and the minimum pooling of the kernel size i, y represents the division result of network output and training,representing the true division result, for L ₁ Defined, A _k＝i 、B _k＝i Respectively represent L ₁ Corresponding MaxPool in the loss function _k＝i (y)、/>Or MinPool _k＝i (y)、/>

5. A method for dividing depth map coding units in a three-dimensional video frame is characterized by comprising the following steps: a depth map coding unit partitioning method according to claim 1, wherein said step S2 comprises the steps of:

6. The method for partitioning a depth map coding unit in a three-dimensional video frame according to claim 5, wherein: in the step S2.3, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.

7. A three-dimensional video intra-frame depth map coding unit partitioning apparatus, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein the one or more processors are configured to implement the three-dimensional video intra-frame depth map coding unit partitioning method of claim 5 or 6 when the executable code is executed.

8. The utility model provides a depth map coding unit partition device, includes division structure prediction module and optimal division structure generation module, its characterized in that:

9. The depth map coding unit partitioning apparatus of claim 8, wherein: in the optimal partition structure generation module, if the current depth is smaller than the predicted depth, skipping rate distortion cost calculation of the current depth, and continuing searching of the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.

10. A depth map coding unit partitioning apparatus as claimed in claim 8 or 9, wherein: the partition structure prediction module comprises a feature extraction module, a fusion module and a partition prediction module, wherein the feature extraction module comprises a first feature extraction branch and a second feature extraction branch, the first feature extraction branch comprises a blocking layer, a linear transformation layer and a shift window transformer which are sequentially connected, the second feature extraction branch comprises a convolution layer, a normalization layer, an activation layer and a pooling layer which are sequentially connected, and the fusion module fuses the extracted features of the two branches and is used for predicting the partition structure by the partition prediction module.