CN116962689A - Method and device for dividing depth map coding units in three-dimensional video frame - Google Patents

Method and device for dividing depth map coding units in three-dimensional video frame Download PDF

Info

Publication number
CN116962689A
CN116962689A CN202310900673.1A CN202310900673A CN116962689A CN 116962689 A CN116962689 A CN 116962689A CN 202310900673 A CN202310900673 A CN 202310900673A CN 116962689 A CN116962689 A CN 116962689A
Authority
CN
China
Prior art keywords
depth map
depth
coding unit
map coding
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310900673.1A
Other languages
Chinese (zh)
Inventor
宋俊锋
龚鑫铠
季苏华
叶振
王国相
吴子健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dianchuang Information Technology Co ltd
Original Assignee
Zhejiang Dianchuang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dianchuang Information Technology Co ltd filed Critical Zhejiang Dianchuang Information Technology Co ltd
Priority to CN202310900673.1A priority Critical patent/CN116962689A/en
Publication of CN116962689A publication Critical patent/CN116962689A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • H04N19/11Selection of coding mode or of prediction mode among a plurality of spatial predictive coding modes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/157Assigned coding mode, i.e. the coding mode being predefined or preselected to be further used for selection of another element or parameter
    • H04N19/159Prediction type, e.g. intra-frame, inter-frame or bidirectional frame prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a method and a device for dividing depth map coding units in a three-dimensional video frame, firstly, a coding tree unit division structure prediction network module is constructed based on a Swin transform module and a convolutional neural network; secondly, a partition structure prediction module is added in the intra-frame prediction of the encoder, and original pixels of a block to be encoded of the current depth map are obtained in encoding to predict the partition depth of the block to be encoded. The method has the advantages that a division structure is built by utilizing the Swin transform module, global information of a depth map coding tree unit is captured, the defect that CNN cannot well extract the global information is overcome, the optimal division structure of the current coding tree unit is predicted, an encoder can obtain the optimal division structure of each coding tree unit only by transmitting a depth map, the coding time of the depth map in a 3D-HEVC frame is greatly reduced under the condition that the coding quality is basically unchanged, and the purpose of reducing the complexity of 3D-HEVC coding is achieved.

Description

Method and device for dividing depth map coding units in three-dimensional video frame
Technical Field
The invention belongs to the technical field of video coding based on a neural network, and particularly relates to a method and a device for dividing depth map coding units in a three-dimensional video frame.
Background
In recent years, with the rapid development of three-dimensional (Three Dimensional, 3D) video services, 3D video has entered thousands of households. The 3D video provides a stereoscopic immersive viewing experience, with different videos presented to the viewer through 3D glasses to achieve 3D scene perception. Meanwhile, 3D video also puts higher demands on video coding technology. To address this challenge, the international joint video coding group (the Joint Collaborative Team on Video Coding, JCT-VC) developed a three-dimensional efficient video coding standard (3D-High Efficiency Video Coding, 3D-HEVC). The 3D-HEVC typically contains 2-3 views, each having a Texture map (Texture map) and a corresponding Depth map (Depth map). The Depth map is represented by a gray scale Image, which captures the distance between the camera and the real object, and can synthesize a virtual Image using Depth-Image-Based rendering (DIBR) technology. Unlike texture maps, depth maps have a large number of flat regions and very steep boundaries, and 3D-HEVC provides a number of complex depth map coding techniques for distinguishing features of texture maps, resulting in an increase in 3D-HEVC coding complexity.
Intra prediction is the most central component in the video coding standard HEVC/AV 1/AVs. Quadtree-based coding tree unit structure in HEVC employed in 3D-HEVC depth map intra coding. The coding tree unit is a basic unit that is divided into several coding units, which can be represented by a recursive quadtree structure. These coding units may be represented as different square sizes or Depth Levels (DL) of 64×64 (dl=0) to 8×8 (dl=3). The 3D-HEVC Test Model (HTM) partitions each frame of the video sequence into maximum coding units (64 x 64), and then recursively partitions the coding units of each layer into 4 subunits until DL is 3. In order to determine the optimal partition structure of the current coding tree unit, starting from the root node, the rate-distortion cost of the undivided coding unit and the sum of the rate-distortion costs after division into 4 coding units need to be compared, if, in the rate-distortion cost, "RD-cost (dl=n) > RD-cost (dl=n+1), n=0, 1,2", the coding unit of "depth=n" needs to be divided into 4 sub-coding units, and otherwise, the division is terminated. This will lead to a significant increase in the complexity of the depth map coding and therefore it is necessary to reduce the complexity of the depth map coding unit partition in 3D-HEVC.
Disclosure of Invention
In order to solve the defects in the prior art and achieve the purpose of improving the intra-frame prediction coding efficiency in video coding, the invention adopts the following technical scheme:
a depth map coding unit partitioning method, comprising the steps of:
step S1: constructing a partition structure prediction network, and carrying out partition prediction on a depth map coding unit;
step S2: obtaining a depth map coding unit to be partitioned, obtaining a predicted optimal partition structure through a partition structure prediction network, comparing the current depth of the depth map coding unit with the depth of the predicted optimal partition structure, calculating the rate distortion cost of the current depth map coding unit when the depths are the same, otherwise, determining the optimal partition structure of the current depth map coding unit based on the calculated rate distortion cost without calculating.
Further, in the step S2, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.
Further, the step S1 includes the steps of:
step S1.1: extracting features, namely acquiring an original pixel gray level image of a depth image coding unit, partitioning the original pixel gray level image, performing linear transformation on channel data of each pixel, extracting the features through a Swin transform shift window Transformer, and merging based on the partitioned blocks corresponding to the features to obtain first features; performing feature extraction on the original pixel gray level map of the depth map coding unit based on the convolution group to obtain a second feature; fusing the first feature with the second feature;
step S1.2: and carrying out partition prediction on the fused features to obtain a predicted partition map.
The Swin transducer shift window Transformer can capture global and local context information in the input feature map through a plurality of self-attention mechanisms and feedforward neural network layers, so that edge information in the depth map can be effectively extracted for more accurate prediction in subsequent steps.
Further, in the step S1, a multi-scale L is constructed 1 Loss function MS-L 1 Training for partition structure prediction, the predicted partition image pixels are between 0-3:
wherein, maxPool k=i And MinPool k=i Respectively representing the maximum pooling and the minimum pooling of the kernel size i, y represents the division result of network output and training,representing real partitioning results, maximum and minimum pooling is aimed at enhancing local consistency due to the characteristic of local consistency of depth maps, for L 1 Defined, A k=i 、B k=i Respectively represent L 1 Corresponding MaxPool in the loss function k=i (y)、/>Or MinPool k=i (y)、/>
The method for dividing the depth map coding units in the three-dimensional video frames is based on the method for dividing the depth map coding units, wherein the step S2 comprises the following steps:
step S2.1: based on a video sequence provided by a pre-coding communication standard, extracting a depth map coding unit and an optimal dividing structure as training data for training the dividing structure prediction network;
step S2.2: reading a depth map to be encoded from a depth map video frame by frame, dividing the depth map into depth map encoding units, and obtaining a predicted optimal dividing structure through a trained dividing structure prediction network;
step S2.3: obtaining the depth of a depth map coding unit to be coded and the predicted optimal partition structure thereof from the depth map, comparing the current depth of the depth map coding unit with the predicted optimal partition structure depth, and calculating the rate distortion cost of the current depth map coding unit when the depth is the same, otherwise, not calculating;
step S2.4: and (3) obtaining an optimal division structure of the current depth map coding unit based on the calculated rate distortion cost, and returning to the step (S2.3) until the optimal division structure of the depth map coding unit corresponding to all the depth map videos is obtained.
Further, in the step S2.3, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.
The device for dividing the depth map coding units in the three-dimensional video frames comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the method for dividing the depth map coding units in the three-dimensional video frames when executing the executable codes.
A depth map coding unit dividing device comprises a dividing structure prediction module and an optimal dividing structure generation module;
the partition structure prediction module is used for carrying out partition prediction on the depth map coding unit;
the optimal division structure generation module is used for acquiring the depth map coding units to be divided, obtaining a predicted optimal division structure through the division structure prediction module, comparing the current depth of the depth map coding units with the depth of the predicted optimal division structure, and calculating the rate distortion cost of the current depth map coding units when the depths are the same, otherwise, not calculating the rate distortion cost, and determining the optimal division structure of the current depth map coding units based on the calculated rate distortion cost.
Further, in the optimal partition structure generation module, if the current depth is smaller than the predicted depth, skipping rate distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.
Further, the partition structure prediction module comprises a feature extraction module, a fusion module and a partition prediction module, wherein the feature extraction module comprises a first feature extraction branch and a second feature extraction branch, the first feature extraction branch comprises a blocking layer, a linear transformation layer and a Swin transformation module shift window Transformer which are sequentially connected, the second feature extraction branch comprises a convolution layer, a normalization layer, an activation layer and a pooling layer which are sequentially connected, and the fusion module fuses the features extracted by the two branches and is used for predicting the partition structure by the partition prediction module. The features extracted by the Swin transform shift window converter are subjected to dimension reduction through a convolution layer so as to reduce the computational complexity.
The Swin transducer shift window converter can capture global and local context information in the input feature map through a multi-layer self-attention mechanism and a feedforward neural network layer, so that edge information in the depth map is effectively extracted, and more accurate prediction is performed in the subsequent steps; filtering and downsampling the depth map through convolution operation, normalizing through BN layer, nonlinear activating through ReLU activating layer and space dimension reducing through pooling layer, wherein the components can extract edge characteristic information when processing the depth map coding tree unit; finally, through the combination of the two branches, the network can obtain rich characteristic information.
The invention has the advantages that:
according to the method and the device for dividing the depth map coding units in the three-dimensional video frames, in 3D video depth map intra-frame prediction, the network module based on the Swin transform is utilized to predict the optimal dividing structure of the depth map coding tree units, so that redundant rate distortion cost calculation is skipped, the 3D-HEVC intra-frame depth map coding time is greatly reduced under the condition that the coding quality is basically unchanged, the 3D-HEVC coding complexity is reduced, and the intra-frame prediction coding efficiency in video coding is improved.
Drawings
Fig. 1 is a flowchart of a method for dividing a depth map coding unit in a three-dimensional video frame according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a partition structure prediction network according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of network input and output corresponding to a depth map of a frame according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a device for dividing depth map coding units in a three-dimensional video frame according to an embodiment of the present invention.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
A method for dividing the depth map coding units in three-dimensional video frames includes such steps as adding a prediction module of dividing structure in the prediction of frame in coder/decoder, capturing the global information of the depth map coding tree unit by using Swin converter (Shifted windows Transformer shift window converter) module, compensating the defect of CNN (Convolutional Neural Network) in extracting global information, and predicting the optimal dividing structure of current coding tree unit. The encoder can obtain the optimal division structure of each coding tree unit only by transmitting the depth map, thereby effectively saving the time of video coding.
In practical use, the encoder will call the method of the present invention to complete the intra prediction of the depth map, as shown in fig. 1, and the method specifically comprises the following steps:
step S1: partition structure prediction based on Swin transducer. The neural network based on the Swin Transformer and used for generating the coding tree unit division structure is adopted, the network structure is shown in fig. 2, the network takes the original pixel gray level image of the current tree unit to be coded as input, the characteristic information of the depth image is extracted through convolution, pooling and Swin Transformer modules, and then the prediction depth division structure information is output through convolution and pooling after splicing.
The input of the network is a depth map coding tree unit, is a gray level image with the height of 64, the width of 64 and the channel of 1, and outputs a partition map with the width of 1 multiplied by 4, and the result of the quadtree segmentation of the depth map coding tree unit is shown, namely, the partition structure prediction network outputs a partition structure diagram of the current coding tree unit, which is a gray level image with the height of 4, the width of 4 and the channel of 1. The partition structure prediction mainly comprises the following steps:
step S1.1: and (5) extracting characteristics. The feature extraction module of the network contains 2 parallel branches. The first branch firstly inputs the feature map into a Patch Partition module for blocking, then carries out Linear transformation on channel data of each pixel through a Linear embedding layer, then constructs the feature map through Swin Transformer Block, and finally carries out downsampling through a Patch embedding layer. Swin Transformer Block is a feature extraction module based on a transducer structure, which shows excellent performance in the field of computer vision. The module can capture global and local context information in the input feature map through a plurality of self-attention mechanisms and feedforward neural network layers, so that edge information in the depth map can be effectively extracted, and more accurate prediction can be performed in subsequent steps. The second branch is feature extracted by 3 x 3 convolutional groups, each group containing 1 3 x 3 convolutional layer, 1 Batch Normalization (BN) layer, 1 Rectifield Linear Unit (ReLU) active layer, and a 1 layer kernel-2 max-pooling layer. The depth map is filtered and downsampled through convolution operation, meanwhile, normalization operation is carried out through BN layer, nonlinear activation is carried out through ReLU activation layer, space dimension reduction is carried out through pooling layer, and edge characteristic information can be extracted when the depth map coding tree unit is processed through the components. Through the combination of the two branches, the network can obtain rich characteristic information.
After the features of the first part are extracted, the dimension of the features extracted by the Swin transform branches is reduced through a 1X 1 convolution layer, so that the computational complexity is reduced, and the features extracted by the two branches are fused through concat operation, so that more abundant and multi-level feature characterization is obtained.
Step S1.2: partition prediction. The partition prediction consists of 1 set of 3 x 3 convolutions and 2 1 x 1 convolutions layers. These layers are used to further process the fused features to predict a 1 x 4 partition map.
The choice of the loss function is determined by the problem to be handled by the neural network. Coding tree unitThe partition structure prediction is expressed as a regression problem, and meanwhile, the predicted partition image pixels are between 0 and 3, and a loss function used by a common regression problem cannot train the partition structure prediction network well. Thus, an improved L is used 1 The loss function is trained and is called a multi-scale L-loss function (MS-L) 1 ) The definition is:
at multiple scales L 1 In the loss function, maxPool k=i And MinPool k=i Represented as maximum pooling and minimum pooling of kernel size i, y represents the partitioning label of the network output and training,representing real partitioning labels, maximum and minimum pooling is aimed at enhancing local consistency due to the feature of depth map with local consistency, L 1 Expressed as:
wherein A is k=i 、B k=i Respectively represent L 1 Corresponding MaxPool in the loss function k=i (y)、Or MinPool k=i (y)、/>
Step S2: the depth map intra-frame coding tree unit is rapidly partitioned, inter-coding block residual prediction based on space-time correlation is performed, and the encoder comprises the following operation steps:
step S2.1: and pre-coding the video sequence provided by the universal standard, and extracting a depth map coding tree unit and an optimal partition structure chart as training data. The depth prediction network constructed by the method shown in fig. 2 is used for network training, and then the trained depth neural network is embedded into the intra-frame prediction of the encoder.
Step S2.2: and reading the depth map to be encoded from the input depth map video frame by frame, dividing the depth map into 64 multiplied by 64 coding tree units, and predicting the coding tree units through a depth prediction network to obtain the optimal division structure of the to-be-coded tree units.
Step S2.3: and reading the coding tree unit to be coded and the predicted depth thereof from the input depth map, comparing the current depth of the coding unit with the depth predicted by the depth prediction network, and deciding the rate distortion cost calculation of the current coding unit.
The partition structure is flattened into a 1 x 16 array, each number being represented as a depth of the coding unit. The current depth of the coding unit is compared with the depth predicted by the depth prediction network. When the current depth is the same as the predicted depth, rate distortion cost calculation is performed; otherwise, no cost calculation will be performed.
Step S2.4: and (3) obtaining the optimal division structure of the current coding tree unit, and returning to the step S2.3.
Specifically, after the 3D video is read, the encoder end transmits a depth map to the partition structure prediction module, splits the depth map into 64×64 coding tree units, predicts the coding tree units, performs a flattening operation on the predicted partition map to obtain a 1×16 array, rounds and rounds the array to obtain integers between 0 and 3, each number represents the depth of one coding unit, and finally accords with the arrangement of the partition depths of the coding tree units through some adjustments; in the coding branch, firstly, determining whether the currently coded video is a depth map, and for the coding tree units which are not the depth map, namely the coding tree units of the texture video, obtaining the optimal coding tree unit partition result by using the original iterative coding tree unit partition method. And for the coding tree unit of the depth map, the coding tree unit to be coded and the predicted depth thereof are read, and when the coding tree unit is coded to the current coding unit, the depth of the current coding unit and the predicted depth are compared. If the current depth is equal to the predicted depth, RD cost of the current depth is calculated and quadtree partitioning is not performed any more; if the current depth is smaller than the predicted depth, skipping RD cost calculation of the current depth, continuing to divide the quadtree, and searching for the next depth; if the current depth is greater than the predicted depth, it not only skips the calculation of the current RD cost, but also stops further depth searching. This ensures that the calculation is only performed when the current depth is the depth predicted by the network model, thus skipping unnecessary calculation of the RD cost, and finally obtaining the optimal coding tree unit partition structure, and the operation of the encoder side is shown in fig. 3.
Corresponding to the embodiment of the method for dividing the depth map coding units in the three-dimensional video frames, the invention also provides an embodiment of a device for dividing the depth map coding units in the three-dimensional video frames.
Referring to fig. 4, a device for dividing a depth map coding unit in a three-dimensional video frame according to an embodiment of the present invention includes a memory and one or more processors, where executable codes are stored in the memory, and the one or more processors are configured to implement a method for dividing a depth map coding unit in a three-dimensional video frame according to the above embodiment when executing the executable codes.
The embodiment of the device for dividing the depth map coding units in the three-dimensional video frames can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 4, a hardware structure diagram of an apparatus with data processing capability where the depth map coding unit dividing device in a three-dimensional video frame of the present invention is located is shown in fig. 4, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, any apparatus with data processing capability in the embodiment generally includes other hardware according to the actual function of the any apparatus with data processing capability, which is not described herein.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements a method for dividing a depth map coding unit in a three-dimensional video frame in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims (10)

1. The method for dividing the depth map coding units is characterized by comprising the following steps:
step S1: constructing a partition structure prediction network, and carrying out partition prediction on a depth map coding unit;
step S2: obtaining a depth map coding unit to be partitioned, obtaining a predicted optimal partition structure through a partition structure prediction network, comparing the current depth of the depth map coding unit with the depth of the predicted optimal partition structure, calculating the rate distortion cost of the current depth map coding unit when the depths are the same, otherwise, determining the optimal partition structure of the current depth map coding unit based on the calculated rate distortion cost without calculating.
2. The method for partitioning a depth map coding unit according to claim 1, wherein: in the step S2, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.
3. The method for partitioning a depth map coding unit according to claim 1, wherein: the step S1 includes the steps of:
step S1.1: extracting features, namely acquiring a depth map coding unit, partitioning, performing linear transformation on channel data of each pixel, extracting the features through a shift window transformer, and merging based on the partitions corresponding to the features to obtain first features; performing feature extraction on the depth map coding unit based on the convolution group to obtain a second feature; fusing the first feature with the second feature;
step S1.2: and carrying out partition prediction on the fused features to obtain a predicted partition map.
4. The method for partitioning a depth map coding unit according to claim 1, wherein: in the step S1, a multi-scale L is constructed 1 Loss function MS-L 1 Training for partition structure prediction:
wherein, maxPool k=i And MinPool k=i Respectively representing the maximum pooling and the minimum pooling of the kernel size i, y represents the division result of network output and training,representing the true division result, for L 1 Defined, A k=i 、B k=i Respectively represent L 1 Corresponding MaxPool in the loss function k=i (y)、/>Or MinPool k=i (y)、/>
5. A method for dividing depth map coding units in a three-dimensional video frame is characterized by comprising the following steps: a depth map coding unit partitioning method according to claim 1, wherein said step S2 comprises the steps of:
step S2.1: based on a video sequence provided by a pre-coding communication standard, extracting a depth map coding unit and an optimal dividing structure as training data for training the dividing structure prediction network;
step S2.2: reading a depth map to be encoded from a depth map video frame by frame, dividing the depth map into depth map encoding units, and obtaining a predicted optimal dividing structure through a trained dividing structure prediction network;
step S2.3: obtaining the depth of a depth map coding unit to be coded and the predicted optimal partition structure thereof from the depth map, comparing the current depth of the depth map coding unit with the predicted optimal partition structure depth, and calculating the rate distortion cost of the current depth map coding unit when the depth is the same, otherwise, not calculating;
step S2.4: and (3) obtaining an optimal division structure of the current depth map coding unit based on the calculated rate distortion cost, and returning to the step (S2.3) until the optimal division structure of the depth map coding unit corresponding to all the depth map videos is obtained.
6. The method for partitioning a depth map coding unit in a three-dimensional video frame according to claim 5, wherein: in the step S2.3, if the current depth is smaller than the predicted depth, skipping rate-distortion cost calculation of the current depth, and continuing to search for the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.
7. A three-dimensional video intra-frame depth map coding unit partitioning apparatus, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein the one or more processors are configured to implement the three-dimensional video intra-frame depth map coding unit partitioning method of claim 5 or 6 when the executable code is executed.
8. The utility model provides a depth map coding unit partition device, includes division structure prediction module and optimal division structure generation module, its characterized in that:
the partition structure prediction module is used for carrying out partition prediction on the depth map coding unit;
the optimal division structure generation module is used for acquiring the depth map coding units to be divided, obtaining a predicted optimal division structure through the division structure prediction module, comparing the current depth of the depth map coding units with the depth of the predicted optimal division structure, and calculating the rate distortion cost of the current depth map coding units when the depths are the same, otherwise, not calculating the rate distortion cost, and determining the optimal division structure of the current depth map coding units based on the calculated rate distortion cost.
9. The depth map coding unit partitioning apparatus of claim 8, wherein: in the optimal partition structure generation module, if the current depth is smaller than the predicted depth, skipping rate distortion cost calculation of the current depth, and continuing searching of the next depth; if the current depth is larger than the predicted depth, skipping rate distortion cost calculation of the current depth, stopping further depth search, and completing division of the current depth map coding unit.
10. A depth map coding unit partitioning apparatus as claimed in claim 8 or 9, wherein: the partition structure prediction module comprises a feature extraction module, a fusion module and a partition prediction module, wherein the feature extraction module comprises a first feature extraction branch and a second feature extraction branch, the first feature extraction branch comprises a blocking layer, a linear transformation layer and a shift window transformer which are sequentially connected, the second feature extraction branch comprises a convolution layer, a normalization layer, an activation layer and a pooling layer which are sequentially connected, and the fusion module fuses the extracted features of the two branches and is used for predicting the partition structure by the partition prediction module.
CN202310900673.1A 2023-07-21 2023-07-21 Method and device for dividing depth map coding units in three-dimensional video frame Pending CN116962689A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310900673.1A CN116962689A (en) 2023-07-21 2023-07-21 Method and device for dividing depth map coding units in three-dimensional video frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310900673.1A CN116962689A (en) 2023-07-21 2023-07-21 Method and device for dividing depth map coding units in three-dimensional video frame

Publications (1)

Publication Number Publication Date
CN116962689A true CN116962689A (en) 2023-10-27

Family

ID=88444022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310900673.1A Pending CN116962689A (en) 2023-07-21 2023-07-21 Method and device for dividing depth map coding units in three-dimensional video frame

Country Status (1)

Country Link
CN (1) CN116962689A (en)

Similar Documents

Publication Publication Date Title
CN104244007B (en) Image coding method and device and decoding method and device
CN108921910B (en) JPEG coding compressed image restoration method based on scalable convolutional neural network
US11922599B2 (en) Video super-resolution processing method and apparatus
CN104581177B (en) Image compression method and device combining block matching and string matching
TWI834087B (en) Method and apparatus for reconstruct image from bitstreams and encoding image into bitstreams, and computer program product
US10785502B2 (en) Method and apparatus for encoding and decoding a light field based image, and corresponding computer program product
CN110830808A (en) Video frame reconstruction method and device and terminal equipment
Hu et al. An adaptive two-layer light field compression scheme using GNN-based reconstruction
US11212527B2 (en) Entropy-inspired directional filtering for image coding
Ma et al. CVEGAN: a perceptually-inspired gan for compressed video enhancement
CN113709458A (en) Displacement vector prediction method, device and equipment in video coding and decoding
CN112601095B (en) Method and system for creating fractional interpolation model of video brightness and chrominance
CN115880381A (en) Image processing method, image processing apparatus, and model training method
CN107682699B (en) A kind of nearly Lossless Image Compression method
CN115442609A (en) Characteristic data encoding and decoding method and device
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
CN112990299A (en) Depth map acquisition method based on multi-scale features, electronic device and storage medium
CN116962689A (en) Method and device for dividing depth map coding units in three-dimensional video frame
CN116918329A (en) Video frame compression and video frame decompression method and device
KR20170044028A (en) Method and apparatus for de-noising an image using video epitome
CN112468826A (en) VVC loop filtering method and system based on multilayer GAN
CN115471765B (en) Semantic segmentation method, device and equipment for aerial image and storage medium
WO2023050381A1 (en) Image and video coding using multi-sensor collaboration
CN113556551B (en) Encoding and decoding method, device and equipment
CN117692652B (en) Visible light and infrared video fusion coding method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination