CN114513660B

CN114513660B - Interframe image mode decision method based on convolutional neural network

Info

Publication number: CN114513660B
Application number: CN202210407485.0A
Authority: CN
Inventors: 蒋先涛; 张纪庄; 郭咏梅; 郭咏阳
Original assignee: Ningbo Kangda Kaineng Medical Technology Co ltd
Current assignee: Ningbo Kangda Kaineng Medical Technology Co ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-09-06
Anticipated expiration: 2042-04-19
Also published as: CN114513660A

Abstract

The invention discloses an interframe image mode decision method based on a convolutional neural network, which relates to the technical field of image processing and comprises the following steps: acquiring a coded image and a residual image of a next coding depth after the target inter-frame image executes a merging mode; extracting bottom layer characteristics in the input information through a convolution layer in the multilayer tree CNN by taking the connected coded image and the residual image as the input information; performing layer-by-layer convolution based on bottom layer characteristics through residual layers of preset layer levels in the multilayer tree CNN; performing full connection of each layer of convolution output through a full connection layer in the multilayer tree-shaped CNN and obtaining the current coding depth of a target inter-frame image and a partition mode of a coding block under partition; and coding the target inter-frame image according to the partition mode of each coding block under each coding depth. The method utilizes the advantages of low coding bit rate requirement of the merging mode and the characteristic learning of the convolutional neural network, maintains the distortion performance and reduces the coding time.

Description

Interframe image mode decision method based on convolutional neural network

Technical Field

The invention relates to the technical field of image processing, in particular to an interframe image mode decision method based on a convolutional neural network.

Background

With the development of multimedia technology, new video formats such as Ultra High Definition (UHD), Virtual Reality (VR), 360-degree video, etc. have appeared. Accordingly, there is an increasing demand for new video coding standards that support higher resolution and higher coding efficiency. Universal video coding (VVC) was developed by the joint video development team (jfet) of VCEG and MPEG. This protocol was finalized 7 months in 2020. VVC, as the latest video coding standard, employs several new coding schemes and tools, such as a Coding Tree Unit (CTU) with a maximum size of 128 × 128, a quadtree and multi-type tree structure (QT + MTT) divided by a Coding Unit (CU), affine motion compensated prediction, and the like. These new techniques achieve approximately 50% gain over the HEVC standard in terms of bit rate reduction. However, the computational complexity of encoding and decoding also increases correspondingly dramatically.

The VVC encoder takes advantage of the redundancy that exists between pictures. After the block division, motion compensation is performed on each coding block. There are two main coding methods for intra prediction mode: an Advanced Motion Vector Prediction (AMVP) mode and a merge mode. In the AMVP mode, optimal values of a plurality of motion vector candidates, motion vector difference values, reference picture numbers, and unidirectional/bidirectional prediction modes are encoded. In merge mode, only the optimal values of the plurality of candidate motion vectors are encoded. The AMVP mode has the advantages of free parameter determination and encoding, but the number of bits required for encoding parameters is high, and a complicated encoding process and motion estimation are required. For Merge mode, the number of bits required for encoding is very small, but the prediction value is not accurate. Some studies on coding under the VVC coding standard have been completed, however, few studies have considered the characteristics of mutual prediction, and related studies have shown that CNN-based methods are suitable for processing images. Therefore, we define the problem as using Convolutional Neural Network (CNN) to decide the split mode of the Code Tree Unit (CTU), and propose a new VVC inter-frame prediction fast mode decision method of convolutional neural network.

Disclosure of Invention

In order to reduce the high coding computation complexity of interframe coding caused by adopting an advanced motion vector prediction mode, the invention provides an interframe image mode decision method based on a convolutional neural network, which comprises the following steps of:

s1: acquiring a coded image and a residual image of a next coding depth after a target inter-frame image executes a merging mode;

s2: extracting bottom layer characteristics in the input information through a convolutional layer in the multilayer tree-shaped CNN by taking the connected coded image and the residual image as the input information;

s3: performing layer-by-layer convolution based on bottom layer characteristics through residual layers of preset layer levels in the multilayer tree-shaped CNN, and acquiring convolution output of each layer;

s4: performing full connection of each layer of convolution output through a full connection layer in the multilayer tree-shaped CNN and obtaining the current coding depth of a target inter-frame image and a partition mode of a coding block under partition;

s5: and judging whether the current coding depth reaches the maximum depth, if so, coding the target inter-frame image according to the partition mode of each coding block under each coding depth, and if not, entering the next coding depth and returning to the step of S1.

Further, the step of S1 is preceded by the steps of:

s0: and training a multilayer tree CNN based on partition division mode selection results under each coding depth acquired by the advanced motion vector mode and corresponding inter-frame images.

Further, in the step S0, the multi-layer tree CNN is trained based on a weighted classification cross entropy loss function, where the weighted classification cross entropy loss function may be expressed as the following formula:

where loss is the weighted classification loss, L is the total number of residual layers in the multi-layer tree CNN,

is a constant that is initially 1 and,

is as follows

The weight of the layer residual layer(s),

for multi-layer tree CNN at

Cross entropy loss when layers are residual.

Further, the partition dividing mode includes a non-dividing mode, a quadtree mode, a horizontal binary tree mode, a vertical binary tree mode, a horizontal ternary tree mode, and a vertical ternary tree mode.

Further, the step of S4 is followed by the step of:

s41: judging whether the partition mode of the coding block under the current coding depth and the partition is a non-partition mode, if so, entering a step S42, and if not, entering a step S5;

s42: and stopping the partition mode decision of the subsequent coding depth of the coding block, and after the partition mode decision of all the coding blocks, coding the target inter-frame image according to the partition mode of each coding block under each coding depth.

Further, after the step of S3, the method further includes the steps of:

s31: and performing information vector connection on the convolution output, the image number information and the quantization parameters of the coding block under the current coding depth and partition division.

Further, the multi-layer tree CNN includes:

the convolution layer comprises a convolution kernel of 3 multiplied by 3 and is used for extracting bottom layer characteristics in the input information;

the transition residual error layer is used for outputting a first residual error block according to the bottom layer characteristics;

a head-end residual layer for outputting a first convolution output and a second residual block by convolution between the bottom layer features and the first residual block;

a middle residual layer for outputting a second convolution output and a third residual block through convolution between the bottom layer features and the second residual block;

a final residual layer for outputting a third convolution output by convolution between the bottom layer features and a third residual block;

the full connection layer is used for fully connecting the first convolution output, the second convolution output and the third convolution output and outputting a partition division mode decision;

the convolutional layer, the transition residual layer, the head residual layer, the middle residual layer and the tail residual layer are sequentially connected.

Furthermore, an information vector connection layer is respectively connected between the head end residual error layer, the middle residual error layer, the tail end residual error layer and the full connection layer, and the information vector connection layer is used for performing information vector connection on convolution output, corresponding image number information and quantization parameters of the coding blocks.

Compared with the prior art, the invention at least has the following beneficial effects:

(1) the interframe image mode decision method based on the convolutional neural network combines a merging mode (Merge) in an internal prediction mode with the Convolutional Neural Network (CNN), and utilizes the advantages of low coding bit rate requirement of the merging mode and feature learning of the convolutional neural network to reduce the time required by coding while maintaining the rate distortion performance;

(2) adding image number information and quantization parameters of coding blocks under the current coding depth and partition division into the multilayer tree-shaped CNN, so that the multilayer tree-shaped CNN can better accord with the parameter characteristics in the actual coding process, and the decision accuracy is further improved;

(3) aiming at the partition problem of the coding block, considering that the partition of the block is similar to a tree-shaped split structure, and learning the characteristics of a layered split tree when the coding block is partitioned through a multilayer tree-shaped CNN by setting a weight different from iteration;

(4) in the training process of the multilayer tree-shaped CNN, higher weights are set for corresponding levels at different training stages, so that the trained multilayer tree-shaped CNN can solve complex problems more effectively.

Drawings

FIG. 1 is a diagram of method steps for an interframe image mode decision method based on a convolutional neural network;

fig. 2 is a schematic diagram of a multi-layer tree CNN architecture.

Detailed Description

The following are specific embodiments of the present invention and are further described with reference to the drawings, but the present invention is not limited to these embodiments.

Example one

The VVC inherits the quadtree partitioning of HEVC, and meanwhile, in order to better adapt to encoding of ultra high definition video, the allowed maximum coding tree unit size of the VVC is 128 × 128. For the VVC inter-coding and QT + MTT partition problems, a new computational complexity optimization method is needed. With the QT + MTT partition structure, the coding unit can partition between a Quadtree (QT), a Binary Tree (BT), and a Ternary Tree (TT). In addition, horizontal (H) and vertical (V) direction splitting may also be used for BT and TT. Therefore, the coding unit has 6 split modes in total (i.e., the Non-split mode Non-split, the quadtree mode QT, the horizontal binary tree mode BT _ H, the vertical binary tree mode BT _ V, the horizontal ternary tree mode TT _ H, and the vertical ternary tree mode TT _ V, which are referred to by numerals from 0 to 5, respectively, in the present invention). More specifically, the coding tree units are first partitioned by the QT structure. Then, the coding unit in each QT leaf node is further partitioned by a QT or MTT structure.

Since the VVC expands the maximum coding tree unit allowable size of HEVC and introduces quadtree partitioning, in order to better encode an inter-frame image, the VVC generally adopts an Advanced Motion Vector Prediction (AMVP) mode that requires a complex encoding process and motion estimation and requires a high number of bits for encoding parameters. This results in VVC using AMVP mode for inter prediction requiring a large amount of computation for the optimal coding mode, which in turn results in a coding efficiency that is not as high as desired. Meanwhile, considering that the block division of the coding unit is similar to a tree-shaped split structure, as shown in fig. 1, the invention provides an interframe image mode decision method based on a convolutional neural network, which comprises the following steps:

s4: performing full connection of convolution output of each layer through a full connection layer in the multilayer tree-shaped CNN and obtaining the current coding depth of a target inter-frame image and a partition mode of a coding block under partition;

The multilayer tree-shaped CNN appearing in the step is designed for a QT + MTT structure in the VVC coding standard. As shown in fig. 2, the network is mainly composed of one convolutional layer and four residual layers (ResBlock, including a BN layer, a ReLU layer, and a Conv layer connected in sequence) of different sizes, and is divided into three hierarchical layers. Among them, the convolutional layers (Conv 3, 32), the transition residual layers (ResBlock, 32), the head residual layers (ResBlock, 64), the intermediate residual layers (ResBlock, 128), and the tail residual layers (ResBlock, 256) are connected in this order. Firstly, obtaining a motion vector prediction result with lower prediction precision obtained after a merging mode is executed on a corresponding inter-frame image under the current coding depth, wherein the motion vector prediction result comprises a coding image and a residual image of the next coding depth. Then, the coded image and the residual image are connected (both need to be used, but are guaranteed to be independent) as input information of the multi-layer tree-shaped CNN.

In the multilayer tree CNN, extraction of pixel-level bottom layer features is performed by a convolution layer having a convolution kernel size of 3 × 3. And then acquiring a first residual block based on the bottom layer characteristics through the transition residual layer. Then, by the head end residual layer, the middle residual layer and the tail end residual layer, according to the residual block (namely, the first residual block, the second residual block and the third residual block) output by the previous residual layer, the image residual information in the residual block is convoluted with the bottom layer characteristics, and the convolution output of the corresponding residual block and the hierarchy (namely, the first convolution output of the head end residual layer, the second convolution output of the middle residual layer and the third convolution output of the tail end residual layer) is output. Finally, the convolution outputs of the three levels of residual layers are fully connected through a full connection layer (FC), and the current coding depth of the target inter-frame image and the partition mode of the coding block under partition are output according to the convolution outputs.

Since the primary merging mode + the multi-layer tree CNN can only make a decision on motion vector selection and partition mode decision at one coding depth, the above operations need to be repeated to make motion vector selection and partition mode decision at each coding depth under the condition that the maximum coding depth is not reached. And when the maximum coding depth is reached, the target inter-frame image can be selected and coded according to the partition mode and the motion vector of each coding block under each coding depth.

It should be noted that, because the existence of the non-partition mode, the partition representing the current coded depth is partitioned to achieve the optimal partition effect, and partition partitioning is not required, in the operation process of the multi-layer tree CNN, the step of S4 is followed by the steps of:

s41: judging whether the partition mode of the coding block under the current coding depth and partition is a non-partition mode, if so, entering the step S42, and if not, entering the step S5;

In order to make the parameter quantity of the multi-layer tree-shaped CNN involved in calculation in the block division decision process consistent with that in the actual operation process and improve the training performance, an information vector connection layer (info. vector) is respectively connected between the head-end residual layer, the middle residual layer, the tail-end residual layer and the full connection layer and is used for performing information vector connection on the convolution output, the corresponding image number information and the quantization parameter of the coding block (if the information is unavailable, the information is set to zero). Correspondingly, after the step of S3, the method further includes the steps of:

Of course, it is obvious that the multi-layer tree CNN only based on the above structural and functional description cannot be applied to the actual inter-frame image coding process, and a training process is necessarily required before the actual operation. Therefore, before the multi-layer tree CNN is put into use, that is, before the step S1, the method further includes the steps of:

As the partition division mode selection result obtained by the method of the invention is not obtained in the initial training stage, the training data acquisition is carried out by depending on the advanced motion vector mode in the initial stage. And when the multilayer tree CNN training is finished and the multilayer tree CNN is put into operation for a period of time, the multilayer tree CNN can be updated by adopting the partition division mode selection result obtained by the method.

In the initial stage of designing the multilayer tree-shaped CNN, the cross loss function adopted by training is as follows:

where s is the pixel value of the input information,

and

when the input samples s are respectively expressed, the category

C is the total number of classes, and the true probability and the predicted probability of the block splitting pattern of (a). Further, due to

Can be expressed as:

in the formula, j is the serial number of the category,

the number of the pairs is a natural logarithm,

is a category

The pixel value of (2).

Therefore, in order to make the multi-layer tree-shaped CNN more suitable for the block division characteristics of the coding units under the VVC coding standard, the invention trains the multi-layer tree-shaped CNN by using a cross entropy loss function of weighted classification, specifically, the formula can be expressed as:

is a constant that is initially 1 and,

is a first

The weight of the layer residual layer(s),

for multi-layer tree CNN at

Cross entropy loss when layers are residual. In addition to the above-mentioned descriptions,

the method needs to be iterated through training after running for a period of time along with the application of the multilayer tree-shaped CNN. As can be seen from the formula, when training the residual layers of different levels of the multi-level tree CNN,in the early stage of training, more weight can be given to the loss of the head-end residual layer, and with the progress of learning, the residual layers (middle residual layer and tail-end residual layer) at lower levels can also obtain more weight, so that the problem under complex conditions can be solved more effectively by the multi-layer tree-shaped CNN under the cross entropy loss function training.

In summary, the interframe image mode decision method based on the convolutional neural network combines the Merge mode (Merge) in the intra-prediction mode with the Convolutional Neural Network (CNN), and utilizes the advantages of the low coding bit rate requirement of the Merge mode and the feature learning of the convolutional neural network, so as to reduce the time required by coding while maintaining the rate-distortion performance.

The image number information, the current coding depth and the quantization parameters of the coding blocks under the partition division are added into the multilayer tree-shaped CNN, so that the multilayer tree-shaped CNN can better accord with the parameter characteristics in the actual coding process, and the decision accuracy is further improved.

Aiming at the partition problem of the coding block, considering that the partition of the block is similar to a tree-shaped split structure, the hierarchical split tree characteristics when the coding block is partitioned are learned through multilayer tree-shaped CNN by setting a weight different from iteration. Meanwhile, in the training process of the multilayer tree-shaped CNN, higher weights are set for corresponding levels in different training stages, so that the trained multilayer tree-shaped CNN can solve the complex problem more effectively.

Example two

In order to better verify the technical effect of the method of the present invention, the present embodiment is described by a set of specific experimental data. Specifically, the performance of the algorithm is verified by comparing the rate distortion and the computational complexity of the algorithm and a VVC reference model (VTM) encoder, and a standard VVC video sequence is adopted in experimental tests. In training a multi-level tree CNN, an Adam optimizer is used, with an initial learning rate of 0.0008. To evaluate the performance of the proposed algorithm, the BDBR (Bj Brontegaard Delta Bit Rate) is used to evaluate the overall Rate-distortion characteristics of the proposed algorithm, with reduced coding computation complexity measured using the average saving coding time (Δ T).

Wherein, the first and the second end of the pipe are connected with each other,

and

the coding time of reference software and the coding time of the algorithm proposed by the patent are respectively under different quantization parameter QP values. The experimental results are shown in table 1, and it can be seen that the method of the present invention can reduce the encoding time by 34%, while the encoding efficiency is only lost by 1.1%, thus confirming the effectiveness of the present invention.

Table 1: list of experimental results

It should be noted that all directional indicators (such as upper, lower, left, right, front and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.

Moreover, descriptions of the present invention as relating to "first," "second," "a," etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicit ly indicating a number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be interconnected within two elements or in a relationship where two elements interact with each other unless otherwise specifically limited. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

Claims

1. An interframe image mode decision method based on a convolutional neural network is characterized by comprising the following steps:

s5: judging whether the current coding depth reaches the maximum depth, if so, coding the target inter-frame image according to the partition mode of each coding block under each coding depth, and if not, entering the next coding depth and returning to the step of S1;

the partition division mode comprises a non-division mode, a quadtree mode, a horizontal binary tree mode, a vertical binary tree mode, a horizontal ternary tree mode and a vertical ternary tree mode;

the step of S4 is followed by the step of:

2. The convolutional neural network-based interframe image mode decision method of claim 1, wherein said step of S1 is preceded by the steps of:

3. The convolutional neural network-based interframe image mode decision method as claimed in claim 2, wherein in the step S0, the multi-layer tree-shaped CNN is trained based on a weighted classification cross entropy loss function, and the weighted classification cross entropy loss function can be expressed as the following formula:

is a constant that is initially 1 and,

is as follows

The weight of the layer residual layer(s),

for multi-layer tree CNN at

Cross entropy loss when layers are residual.

4. The convolutional neural network based inter-frame image mode decision method as claimed in claim 1, wherein after said step of S3, further comprising the steps of:

5. The convolutional neural network-based interframe image mode decision method as claimed in claim 1, wherein said multilayer tree CNN comprises:

6. The convolutional neural network based inter-frame image mode decision method as claimed in claim 5, wherein an information vector connection layer is connected between the head residual layer, the middle residual layer, the tail residual layer and the full connection layer, respectively, and the information vector connection layer is used for performing information vector connection between the convolutional output and the corresponding image number information and the quantization parameter of the coding block.