WO2019179523A1 - 基于深度学习方法的块分割编码复杂度优化方法及装置 - Google Patents

基于深度学习方法的块分割编码复杂度优化方法及装置 Download PDF

Info

Publication number
WO2019179523A1
WO2019179523A1 PCT/CN2019/079312 CN2019079312W WO2019179523A1 WO 2019179523 A1 WO2019179523 A1 WO 2019179523A1 CN 2019079312 W CN2019079312 W CN 2019079312W WO 2019179523 A1 WO2019179523 A1 WO 2019179523A1
Authority
WO
WIPO (PCT)
Prior art keywords
eth
cnn
lstm
segmentation
training
Prior art date
Application number
PCT/CN2019/079312
Other languages
English (en)
French (fr)
Inventor
徐迈
李天一
杨韧
关振宇
黄典润
Original Assignee
北京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京航空航天大学 filed Critical 北京航空航天大学
Publication of WO2019179523A1 publication Critical patent/WO2019179523A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/103Selection of coding mode or of prediction mode
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/12Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
    • H04N19/122Selection of transform size, e.g. 8x8 or 2x4x8 DCT; Selection of sub-band transforms of varying structure or type
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/157Assigned coding mode, i.e. the coding mode being predefined or preselected to be further used for selection of another element or parameter
    • H04N19/159Prediction type, e.g. intra-frame, inter-frame or bidirectional frame prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/18Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a set of transform coefficients
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/96Tree coding, e.g. quad-tree coding

Definitions

  • the present disclosure relates to the field of video coding technologies, and in particular, to a block division coding complexity optimization method and apparatus based on a deep learning method.
  • the High Efficiency Video Coding (HEVC) standard can save about 50% bit rate at the same video quality. This is due to some advanced video coding techniques, such as a coding unit (CU) partition structure based on a quadtree structure. However, these techniques also bring considerable complexity.
  • the prior art HEVC has an encoding time of about 255% more than H.264/AVC, which limits the practical application of the standard. Therefore, it is necessary to significantly reduce the complexity of HEVC coding under the premise that the rate-distortion (RD) performance is hardly affected.
  • Early CU segmentation prediction methods are generally heuristic. These methods determine CU segmentation in advance before performing recursive search based on some features in the encoding process.
  • brute force search can be simplified by extracting some intermediate features. For example, a method of determining a CU partition at a frame level, which skips the CU depth that is less frequently present in the previous frame when determining the CU depth of the current frame.
  • the industry also proposes a CU segmentation decision method based on pyramidal motion divergence and based on the number of high-frequency key points.
  • CU partitioning can also be determined by characterizing the complete and low complexity RD cost.
  • the industry has also proposed a variety of heuristic methods to reduce coding complexity at the prediction unit (PU) and transform unit (TU) levels. For example, a fast PU size decision method has been proposed to adaptively integrate smaller PUs into larger PUs.
  • the prior art also predicts the maximum probability of PU partitioning based on a coding block flag (CBF) and an RD cost of the encoded CU.
  • CBF coding block flag
  • the coding coefficients are modeled using a mixed Laplacian distribution and based on this, the RDO quantization process is accelerated.
  • coding complexity is also simplified in the prior art at other levels of HEVC, such as intra or inter prediction mode selection, and loop filtering.
  • a CU depth decision method based on a support vector machine (SVM) three-level joint classifier is also proposed to predict whether three sizes of CUs need to be segmented.
  • SVM support vector machine
  • These methods use a large amount of data to learn the coding rules in some aspects of HEVC to simplify or replace the violent search in the original encoding process.
  • the prior art utilizes logistic regression and SVM to perform two-class modeling on CU segmentation.
  • the trained model can be used to determine in advance whether each CU is split to avoid time-consuming, recursive and violent searches.
  • three early termination mechanisms are proposed to estimate the optimal CTU segmentation result to simplify the CTU segmentation process in the original encoder.
  • the industry has studied several intermediate features related to CU segmentation and combined these features to determine the CU segmentation depth and skip the violent RDO search, thus reducing the HEVC coding complexity.
  • the technician proposed a SVM method that combines two classifications and multiple classifications, predicting CU partitioning and PU mode selection in advance, which can further reduce the coding time of HEVC.
  • the above learning-based approach relies heavily on manual extraction of features. This requires a lot of prior knowledge and may ignore some hidden but valuable features.
  • CNN convolutional neural network
  • the present disclosure provides a block partitioning coding complexity optimization method and apparatus based on a deep learning method, which can significantly shorten the time required for determining CU segmentation during encoding while ensuring CU segmentation prediction accuracy. , effectively reduce the complexity of HEVC coding.
  • the present disclosure provides a block segmentation coding complexity optimization method based on a deep learning method, including:
  • the CU segmentation prediction model is a pre-established and trained model, and the model has early termination capability
  • the frame coding mode is an intra mode
  • the CU segmentation prediction model is a layered convolutional neural network ETH-CNN capable of early termination
  • the frame coding mode is an inter mode
  • the CU segmentation prediction model is an ETH-LSTM and the ETH-CNN that can be terminated early.
  • the method before the step of viewing the frame coding mode currently used by the HEVC, the method further includes:
  • the ETH-LSTM is constructed and the ETH-LSTM is trained.
  • the step of constructing the ETH-CNN and training the ETH-CNN includes:
  • the ETH-CNN corresponding to the intra mode is trained using the positive sample and the negative sample.
  • the resolution of each image in the first database is 4928 ⁇ 3264;
  • the first database includes: a training set, a verification set, and a test set; each of the training set, the verification set, and the test set includes four subsets;
  • the resolution of each image in the first subset of the four subsets is 4928 ⁇ 3264, the resolution of each image in the second subset is 2880 ⁇ 1920, and the resolution of each image in the third subset is 1536 ⁇ 1024.
  • the resolution of each of the four subsets is 768 x 512.
  • the step of constructing the ETH-CNN, training the ETH-CNN, constructing the ETH-LSTM, and training the ETH-LSTM includes:
  • the pre-processed video in the second database is encoded by using a HEVC standard reference program to obtain a positive sample and a negative sample in the second database;
  • the ETH-CNN corresponding to the inter mode and the ETH-LSTM corresponding to the inter mode are trained by using the positive sample and the negative sample.
  • the second database comprises one or more of the following resolution videos:
  • the second database includes: a training set, a verification set, and a test set.
  • the input of the ETH-CNN is a 64 ⁇ 64 matrix, which represents luminance information of the entire CTU, and is represented by U;
  • the ETH-CNN structured output consists of three branches that represent the predicted results of the three-level HCPM: with
  • the early termination mechanism of ETH-CNN can end the calculation of the full connection layer on the second and third branches ahead of time;
  • the specific structure of ETH-CNN includes two pre-processing layers, three convolutional layers, one merged layer and three fully connected layers.
  • the pre-processing layer is configured to perform a pre-processing operation on the matrix
  • the pre-processed data is convolved with 16 4 ⁇ 4 cores to obtain 16 different feature maps to extract low-level features in the image information, in preparation for determining the CU segmentation;
  • the above feature map is sequentially convolved through 24 and 32 2 ⁇ 2 cores to extract higher-level features, and finally 32 types are obtained in each B 1 branch.
  • the step size of the convolution operation is equal to the side length of the core
  • the fully connected layer divides the merged feature into three branches Processing, which also corresponds to the tertiary output in HCPM;
  • each branch B 1 the feature vector passes through three fully connected layers in sequence: two hidden layers, and one output layer; the outputs of the two hidden layers are in turn with The output of the last layer is the final HCPM;
  • the number of features in each fully connected layer is related to the branch in which it is located, and it can ensure that three branches B 1 , B 2 and B 3 respectively output 1, 4 and 16 features, corresponding to the predicted values of the three-level HCPM.
  • QP is added as an external feature to the feature vector, enabling ETH-CNN to model the relationship between QP and CU segmentation.
  • the predicted CU segmentation result is represented by a structured output manner of the hierarchical CU segmentation graph HCPM;
  • the HCPM includes 1 ⁇ 1, 2 ⁇ 2, and 4 ⁇ 4 two-category labels at levels 1, 2, and 3, respectively, corresponding to the true value y 1 (U), with And predicted values with
  • the CU segmentation result includes: a first level classification tag
  • the CU segmentation result includes a second-level two-category tag or a third-level two-category tag
  • the CU segmentation result includes a null value in the second-level classification tag or the third-level classification tag.
  • the second and third level two-category labels exist, but sometimes the null value is null.
  • the objective function of the ETH-CNN model training is cross entropy
  • H( ⁇ , ⁇ )(l ⁇ 1,2,3 ⁇ ) represents the cross entropy between the predicted value and the true value label of a two classifier in HCPM
  • r represents the sample number in a batch of training samples
  • L r represents the objective function of the rth sample
  • y 1 (U) Representing the true value
  • the residual CTU obtained by fast precoding is input to the ETH-CNN, and the CU partition label in the second database is used as a true value, and the ETH-CNN of the inter mode is trained;
  • the LSTM unit and the fully connected layer of each level in the ETH-LSTM are trained by the CUs of this level, that is, the ETH-LSTM level 1 is trained by the 64 ⁇ 64 CU, and the level 2 is trained by the 32 ⁇ 32 CU. , level 3 is trained by a 16 ⁇ 16 CU.
  • the cross entropy is used as the loss function
  • the loss function of the t-th frame of the r-th sample is L r (t)
  • the training is performed by the momentum stochastic gradient descent method
  • HCPM is obtained by ETH-LSTM to predict the inter-mode CU partitioning result.
  • the embodiment of the present disclosure further provides a block segmentation coding complexity optimization device based on a deep learning method, including:
  • a memory a processor, a bus, and a computer program stored on the memory and running on the processor, the processor executing the program, the method of any of the first aspect.
  • a computer storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the method of any of the second aspects.
  • the present disclosure utilizes the structured output of the HCPM to efficiently represent the CU partitioning process. Only need to run the trained ETH-CNN/ETH-LSTM model once, all the CU segmentation results in the entire CTU can be obtained in the form of one HCPM, which significantly reduces the running time of the deep neural network itself and helps to reduce the overall coding. the complexity.
  • the depth ETH-CNN structure in the present disclosure solves the defect of manually extracting features in the prior art by automatically extracting features related to CU segmentation.
  • the depth ETH-CNN structure has more trainable parameters than the CNN structure in the prior art, which significantly improves the prediction accuracy of the CU segmentation.
  • the depth ETH-LSTM model proposed in the present disclosure is for learning long-term and short-term dependencies of CU partitioning between different frames of an inter mode. For the first time in the present disclosure, LSTM is used to predict CU partitioning to reduce HEVC coding complexity.
  • a CU partition database is established in advance for the intra mode and the inter mode. Compared to other methods in the prior art, it only relies on the existing JCT-VC database, which is much smaller than the database of the present disclosure.
  • FIG. 1 is a schematic diagram of a rate distortion cost check and comparison in the prior art
  • FIG. 2 is a schematic diagram of a CU partition structure
  • FIG. 3 is a schematic diagram of an HCPM according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of an ETH-CNN structure according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of an ETH-LSTM structure according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic flowchart of a block partition coding complexity optimization method based on a deep learning method according to an embodiment of the present disclosure
  • FIG. 7 is a schematic structural diagram of a block division coding complexity optimization apparatus based on a depth learning method according to an embodiment of the present disclosure
  • FIG. 8 is a schematic diagram of using ETH-LSTM according to an embodiment of the present disclosure.
  • Deep learning does not require manual extraction of features during the encoding process, but automatically extracts a variety of features associated with the encoded results from large-scale data.
  • in-depth research using deep learning to reduce coding complexity is still rare.
  • a shallow CNN structure is mainly used in the CU segmentation prediction of the intra mode, and the CNN structure includes only two convolution layers, each containing 6 and 16 3 ⁇ 3 convolution kernels. .
  • the work of simplifying coding complexity without deep learning has not explored the CU partition correlation between frames of different distances.
  • the embodiments of the present disclosure propose a CU partition prediction model based on ETH-CNN and ETH-LSTM deep network structure, which is used for accurately predicting CU segmentation results, reducing HEVC intra-frame and inter-frame complexity, ie, reducing coding complexity. degree.
  • the HCPM of the embodiment of the present disclosure differs from the conventional method in determining whether a single CU is split, respectively, but predicts the CU partitioning condition in the entire CTU at a time through hierarchical structured output.
  • the deep CNN structure is improved by introducing an early termination mechanism for reducing the complexity of the HEVC intra mode.
  • the core improvement points of the present disclosure may include: 1. Constructing a large-scale CU partitioning database suitable for HEVC intra and inter mode, and promoting research on reducing HECV complexity based on deep learning. 2. Propose a deep CNN network, ETH-CNN, structured output of CU partitioning by HCPM to reduce the complexity of HEVC intra mode. A deep LSTM network, ETH-LSTM, is proposed to combine with ETH-CNN to learn the time-space correlation of CU partitioning, which is used to reduce the complexity of HEVC inter-frame mode.
  • the embodiment of the present disclosure proposes a block segmentation coding complexity optimization method based on a deep learning method, which is applicable to both intra-frame and inter-frame modes.
  • This method can learn the entire coding tree unit (CTU) from the above database.
  • CU split case That is, the hierarchical CU partition map (HCPM) is used to efficiently represent the CU partition in the entire CTU.
  • HCPM hierarchical CU partition map
  • the deep learning network structure can be more "deep", so that by learning enough parameters to explore a variety of CU partitioning modes.
  • the deep learning method of the embodiment of the present disclosure introduces an early terminated hierarchical CNN (ETH-CNN), and generates a structured HCPM with a layered idea. This early termination can save the computation time of CNN itself and promote the reduction of intra mode HEVC coding complexity.
  • the embodiment of the present disclosure also introduces an early terminated hierarchical LSTM (ETH-LSTM) suitable for inter mode.
  • ETH-LSTM the temporal correlation of CU partitioning can be learned in the LSTM unit.
  • the ETH-LSTM After taking the features in the ETH-CNN as input, the ETH-LSTM combines the learned LSTM unit with the early termination mechanism to output the HCPM hierarchically. As such, the above method can be effectively used to reduce the coding complexity of the HEVC inter mode.
  • the block division coding complexity optimization method based on the depth learning method of the present disclosure may include the following steps:
  • CU segmentation prediction model corresponding to the frame coding mode, where the CU segmentation prediction model is a model that is pre-established and trained and has an early termination mechanism.
  • the CU segmentation prediction result in the HEVC is predicted according to the selected CU segmentation prediction model, and the entire coding tree unit CTU is segmented according to the predicted CU segmentation result.
  • the above method may further include the step 600 not shown in the following figure:
  • the frame coding mode is an intra mode
  • the CU segmentation prediction model is ETH-CNN; at this time, only ETH-CNN is constructed, and ETH-CNN can be trained.
  • the frame coding mode is an inter mode
  • the CU segmentation prediction model is ETH-LSTM and ETH-CNN, that is, ETH-CNN is constructed, ETH-CNN is trained, ETH-LSTM is constructed, and ETH-LSTM is trained. That is to say, a long-short-term memory structure is designed to learn the time-domain dependence of the inter-mode CU partitioning, and then the CNN is combined with the LSTM to predict the inter-mode CU partitioning. In this way, the HEVC coding complexity of the inter mode can be significantly reduced.
  • the training of the CU segmentation prediction model for the intra mode may include the following steps:
  • the training of the CU partition prediction model for the inter mode may include the following steps:
  • the ETH-CNN corresponding to the inter mode and the ETH-LSTM corresponding to the inter mode are trained by using the positive sample and the negative sample.
  • ETH-LSTM can effectively reduce the complexity of the HEVC inter mode.
  • a large-scale inter-mode CU partition database is established in the embodiment of the present disclosure, and the database covers the intra mode (2000 lossless images, which are compressed by 4 quantization parameters (QP)) and Inter mode (111 lossless images, compressed with 4 QPs), while facilitating research on reducing HEVC complexity based on deep learning.
  • QP quantization parameters
  • the CTU partition structure with CU partition as the core is one of the main components of the HEVC standard.
  • the default size of the CTU is 64 ⁇ 64 pixels.
  • a CTU can contain either a single CU or a number of smaller CUs based on the quadtree recursive structure.
  • the default minimum size of the CU is 8 ⁇ 8.
  • the CTU or CU size can be set before encoding, that is, the maximum and minimum CTU or CU size are artificially set according to the encoding requirements. Therefore, CUs in CTUs come in many possible sizes.
  • the CU size in each CTU is determined by a recursive search.
  • This process in the standard encoder is a violent search process that includes a top-down inspection process and a bottom-up comparison process.
  • Figure 1 illustrates the RD cost checking and comparison process between the parent CU and its four sub-CUs.
  • the encoder checks the RD cost of the entire CTU and retrieves the RD cost of its sub-CU. For each sub-CU, if there are still possible sub-CUs, then check the RD cost of each sub-CU for the next generation. ... so recursive until the smallest size CU is checked.
  • the RD cost of the parent CU is represented as R pa
  • the RD cost of the child CU is expressed as Where m ⁇ ⁇ 1, 2, 3, 4 ⁇ represents the sequence number of each sub-CU. Then, by comparing the RD loss of the parent CU and the child CU, it is determined whether the parent CU needs to be split. As shown in Figure 1-(b), if Then the parent CU needs to be split, otherwise it is not needed. It is worth noting that when deciding whether to split the CU, it is also necessary to consider the RD cost of the split flag itself. After the complete RDO search process, the CU segmentation result with the lowest RD cost is obtained. Note that recursive RDO searches are extremely time consuming.
  • CTU a 64 ⁇ 64 it is necessary to check 85 may be of the CU, which comprises: CU a 64 ⁇ 64, the four CU 32 ⁇ 32, the four 2 CU 16 ⁇ 16 and four 3 of 8 ⁇ 8 CU.
  • the encoder precodes the CU, in the process it is necessary to encode the possible prediction and transform modes.
  • all 85 possible CUs must be precoded, which occupies most of the coding time.
  • the final CU partitioning result only retains 1 (CTU is not split) to 64 (the entire CTU is divided into CUs of minimum size 8 ⁇ 8) CUs, far less than all 85. Therefore, if a reasonable CU segmentation result can be predicted in advance, the RD cost checking process of up to 84 and a minimum of 21 CUs can be omitted, thereby achieving the purpose of reducing coding complexity.
  • the first database of a large-scale CU partition database (CU partition of HEVC-Intra, CPH-Intra) suitable for HEVC intra mode is described below.
  • the first database is the first database for CU partitioning in HEVC.
  • first 2000 images of resolution 4928 ⁇ 3264 were selected from the Raw Images Dataset (RAISE).
  • the 2,000 images were randomly divided into a training set (1700), a validation set (100), and a test set (200).
  • each set is equally divided into 4 subsets: one of the subsets remains unchanged from the original resolution, and the other three subsets downsample the original image to 2880 ⁇ 1920, 1536 ⁇ 1024, and 768 ⁇ 512, respectively.
  • the CPH-Intra database contains images of multiple resolutions, ensuring the diversity of training data for CU segmentation.
  • the above images are then encoded using HEVC standard reference software such as HM16.5.
  • 4 different QPs ⁇ 22, 27, 32, 37 ⁇ are used for encoding, corresponding to the All-Intra (AI) configuration of the standard encoder (file encoder_intra_main.cfg).
  • the CPH-Intra database consists of 110,405,784 samples. All samples are divided into 12 sub-databases according to their QP value and CU size. The segmentation (49.2%) is close to the number of CUs without segmentation (50.8%), which ensures positive and negative. The sample is relatively balanced.
  • the CU partition database that establishes the inter mode is the second database: CPH-Inter database.
  • CPH-Inter database In order to establish the second database, first select 111 lossless videos, including 6 of the 1080P (1920 ⁇ 1080) video, and 18 AE standards recommended by the Joint Collaborative Team on Video Coding (JCT-VC). Test the video and 87 videos from Xiph.org.
  • the second database contains videos of various resolutions: SIF (352 ⁇ 240), CIF (352 ⁇ 288), NTSC (720 ⁇ 486), 4CIF (704 ⁇ 576), 240p (416 ⁇ 240), 480p.
  • the HEVC encoded video only supports 8 ⁇ 8 multiple resolution, it is necessary to adjust the video that does not meet the requirement. Therefore, in the second database, the bottom edge of the NTSC video is uniformly cropped, so that the resolution is changed. It is 720 ⁇ 480. At the same time, if the video is greater than 10 seconds, it will be cropped to 10 seconds.
  • the above video is divided into training sets (83), verification sets (10) and test sets (18) that do not overlap each other.
  • the video in the test set was derived from 18 standard sequences of JCT-VC.
  • the CPH-Inter database is also encoded using HM16.5 under four QPs ⁇ 22, 27, 32, 37 ⁇ .
  • LDP Low Delay P
  • LDB Low Delay B
  • RA Random Access
  • HCPM Hierarchical CU partition map
  • the default CU can take four different sizes: 64 ⁇ 64, 32 ⁇ 32, 16 ⁇ 16 and 8 ⁇ 8, corresponding to CU depths: 0, 1, 2 and 3.
  • a CU of a non-minimum size (16 ⁇ 16 or more) may or may not be divided.
  • the entire CU segmentation process can be regarded as a 3-level two-category tag.
  • a CU with a depth of 0 is denoted as U.
  • the sub-CU whose depth is 3 is 3
  • the subscripts i, j, k ⁇ ⁇ 1, 2, 3, 4 ⁇ represent the sequence numbers of each sub-CU in U, U i and U i,j , respectively.
  • the hierarchical CU split label described above is shown by the downward arrow in FIG.
  • the standard HEVC encoder obtains the CU partition label y 1 (U) through a time-consuming RDO process. with of.
  • these tags can be predicted by machine learning to replace the traditional RDO process.
  • this is difficult to predict in one step by a simple multi-level classifier.
  • the predicted CU partition label should be performed layer by layer, that is, the label y 1 (U) is separately divided for each layer of the CU, with Make predictions and record the predictions as with
  • the present embodiment utilizes a hierarchical CU partition map HCPM to efficiently represent the CU partition result with a structured output. In this way, the trained model can be called once, that is, the CU segmentation result in the entire CTU can be predicted, which greatly reduces the calculation time of the prediction process itself.
  • Figure 3 is an example of HCPM that hierarchically represents a CU partitioned label as a structured output.
  • the HCPM includes 1 ⁇ 1, 2 ⁇ 2, and 4 ⁇ 4 two-category labels at levels 1, 2, and 3, respectively, corresponding to the true value y 1 (U), with And predicted values with Regardless of the CU partitioning situation, the first level classification label must exist; but when U or U i is not divided, the corresponding sub CU or Does not exist, at this time will be in HCPM with The label is set to null (null), as shown by "-" in Figure 3.
  • the main task of the method of the present embodiment is to predict the segmentation result of the CU from the image information of the CTU, and the input information is represented by a matrix, which has significant spatial correlation, so in the present embodiment, the HCPM is modeled by CNN.
  • the input of ETH-CNN is a 64 ⁇ 64 matrix, which represents the luminance information of the entire CTU, and is represented by U.
  • the ETH-CNN structured output consists of three branches that represent the predicted results of the three-level HCPM: with Compared with the ordinary CNN structure, ETH-CNN introduces an early termination mechanism, which can end the calculation of the full connection layer on the second and third branches ahead of time.
  • the specific structure of ETH-CNN consists of two pre-processing layers, three convolutional layers, one merged layer and three fully connected layers.
  • the CTU original luminance matrix (64 ⁇ 64) is subjected to pre-processing such as de-averaging and downsampling.
  • pre-processing such as de-averaging and downsampling.
  • the input information is in three parallel branches. Processing and transformation. De-averaging operation in three branches The brightness matrix of each input CTU will subtract the average brightness within a certain range of the image to reduce the difference in brightness between the images.
  • the luminance matrix is subtracted from the overall luminance of the CTU, corresponding to one prediction result of the first stage of the HCPM.
  • the 64 ⁇ 64 luma matrix can be divided into 2 ⁇ 2 non-overlapping 32 ⁇ 32 units, and each unit is respectively subtracted from its internal average brightness, which corresponds to the 4 labels of the HCPM second level.
  • the matrix will be divided into 64 ⁇ 64 luminance B 3 is 4 ⁇ 4 non-overlapping units of 16 ⁇ 16, and then to operate the mean within each unit, corresponding to the third stage HCPM tabs 4 ⁇ 4
  • the de-scoring luminance matrix is continuously downsampled, as shown in the figure.
  • converting the matrix size to 16x16 and 32x32 further reduces subsequent computational complexity. Moreover, by selective downsampling, it is ensured that the output size of the subsequent convolution layer in B 1 to B 3 is consistent with the number of output labels of the first to third stages of the HCPM, so that the output result of the convolution layer is relatively clear and clear. significance.
  • each branch B l a three-layer convolution operation is performed on all preprocessed data, expressed as with In the same layer, the convolution kernels of all three branches are the same size.
  • the pre-processed data is convolved with 16 4 ⁇ 4 cores to obtain 16 different feature maps to extract low-level features in the image information to prepare for determining the CU partition.
  • the above feature map is sequentially convolved through 24 and 32 2 ⁇ 2 cores to extract higher-level features, and finally 32 kinds are obtained in each B l branch.
  • Feature map is sequentially convolved through 24 and 32 2 ⁇ 2 cores to extract higher-level features.
  • the step size of the convolution operation is equal to the side length of the core, which can achieve non-overlapping convolution operations, and most convolution kernels are 8 ⁇ 8, 16 ⁇ 16, 32 ⁇ 32 or 64. ⁇ 64, etc. (the integer powers of both sides are 2), which correspond exactly to the position and size of the non-overlapping CUs in HEVC.
  • ⁇ Merging layer Will three branches All features of the second and third convolutional layers are grouped together and combined into a vector. As shown in Figure 4, the features of this layer are composed of a combination of feature maps of six sources, namely with In order to obtain a variety of global and local features. After the feature is merged, in the subsequent fully connected layer, the features in the complete CTU can be utilized to predict the segmentation result of a certain level of CU in the HCPM, not limited to a certain branch B 1 , B 2 or B 3 . Characteristics.
  • all convolutional layers and the first and second fully connected layers are activated with rectified linear units (ReLU) to introduce appropriate sparsity in the network to improve training efficiency.
  • All branches The third fully connected layer, the output layer, is activated by a sigmoid function, so that the output value is within (0,1), which is compatible with the two-class label in HCPM.
  • ETH-CNN The specific configuration of ETH-CNN is shown in Table 1. From the table, there are 1,287,189 trainable parameters in the network. Compared to the shallow CNN of the prior art, ETH-CNN has a higher network capacity and can model the CU partitioning problem more efficiently. Thanks to more than 100 million training samples in the CPH-Intra database, the network is able to reduce the risk of overfitting under a wide range of parameters. In addition, using the same network to predict the output of all three levels in HCPM is also a major advantage of the ETH-CNN, which makes the network predict y 1 (U), with The features in the convolutional layer and the merged layer can be shared.
  • ETH-CNN uses HCPM as the output, it has the characteristics of network structure sharing and parameter sharing, which can be accurate. Under the premise of predicting CU segmentation, the computational complexity of the network itself is significantly reduced, further reducing the overall complexity of the coding.
  • H( ⁇ , ⁇ )(l ⁇ 1,2,3 ⁇ ) represents the cross entropy between the predicted value and the true value label of a two classifier in the HCPM. Considering that some true value tags do not exist, such as in Figure 2. Only valid true and predicted values ( and ) is counted in the objective function.
  • the objective function on a batch of samples is the average of all sample objective functions:
  • each part should be randomly selected in a large number of training samples as a network input, so the momentum random gradient descent method is used for optimization.
  • the CU segmentation in the HEVC inter mode has a certain correlation in time. For example, the closer the frame distance is, the more similar the CU segmentation result is; the frame distance is increased, and the degree of similarity is reduced.
  • the present disclosure further proposes an ETH-LSTM network to learn the long and short-term dependencies of inter-frame CU partitioning.
  • the overall framework of ETH-LSTM is shown in Figure 5.
  • the network takes the residual CTU as input.
  • the residual here is obtained by fast precoding the current frame. This process is similar to the standard encoding process. The only difference is that the CU and PU are forced to the maximum size of 64 ⁇ 64 to save time. Although the extra precoding process brings time redundancy, it only accounts for 3% of the standard encoding time and does not significantly affect the performance of the proposed algorithm.
  • the residual CTU is input to the ETH-CNN.
  • the parameters in the ETH-CNN are retrained by the residual CTU in the CPH-Inter database and the true value of the CU partition.
  • the features of the 7th layer (1st fully connected layer) in the ETH-CNN are output in each frame. Send ETH-LSTM for later processing.
  • ETH-LSTM the three-level LSTM used to determine the CU depth is shown in Figure 5.
  • the feature vector output from the LSTM unit passes through two fully connected layers in turn, and each fully connected layer also contains two external features: the QP value and the frame order of the current frame in the GOP.
  • the frame order is represented by a one-hot vector.
  • the output characteristics of the LSTM unit and the output characteristics of the first fully connected layer are denoted as f' 1-l (t) and f' 2-l (t), respectively.
  • the second fully connected layer then outputs the probability of CU partitioning, which is the result of the two classifications in HCPM.
  • the early termination mechanism is also introduced in ETH-LSTM. Wherein, if the first-level LSTM predicts that the CU is not split, then the second level of the HCPM The two fully connected layers will be skipped and terminated early.
  • ETH-LSTM is output in the form of HCPM, that is, the result of the division of the current CTU in the t-th frame.
  • the ETH-LSTM can utilize the segmentation results for the same location CTU in the previous frame by learning the long and short time correlation of the CU segmentation with different levels of LSTM units.
  • the LSTM units of each level in the ETH-LSTM are trained by the CUs of this level, that is, the ETH-LSTM level 1 is trained by the 64 ⁇ 64 CU, and the level 2 is trained by the 32 ⁇ 32 CU.
  • Level 3 is trained by a 16x16 CU.
  • the LSTM unit of the t-th frame of the first level is taken as an example to introduce the learning mechanism of the ETH-LSTM.
  • the LSTM network consists of three types of gates: the input gate i l (t), the output gate o l (t), and the forgotten gate g l (t).
  • the input gate i l (t) of the current frame LSTM ie, the feature of the first fully connected layer of ETH-CNN
  • the previous frame LSTM output feature f' 1-l (t-1) Given the input feature f 1-l (t) of the current frame LSTM (ie, the feature of the first fully connected layer of ETH-CNN) and the previous frame LSTM output feature f' 1-l (t-1), then The three doors can be expressed as:
  • ⁇ ( ⁇ ) represents a sigmoid function.
  • W i , W o and W f are the trainable parameters of the three gates, and b i , b o and b f are the corresponding offsets.
  • the LSTM unit updates the state of the t-th frame with the following strategy:
  • LSTM output unit f' 1-l (t) can be expressed as:
  • the lengths of the state vector c l (t) and the output vector f' 1-l (t) are the same as the input vector f 1-l (t).
  • the configuration of the ETH-LSTM is shown in Table 2, which includes all the trainable parameters.
  • the training is performed using the momentum stochastic gradient descent method.
  • the HCPM can be obtained by ETH-LSTM to predict the inter-mode CU segmentation results.
  • Step P1 initializing the current frame.
  • Step P2 all CTUs of the current frame:
  • Step P3 Extract residuals of all CTUs of the current frame.
  • the residual is the residual mentioned in the HEVC standard, which is the difference between the result obtained after each PU prediction and the original image.
  • the source of the image information is different, such as it is possible to use the previous frame prediction, or it is possible to predict with a certain frame earlier, and so on.
  • the residuals of all CTUs eventually form a residual frame, so it is difficult to say which frame and which frame the current frame residual is subtracted because the prediction source of each PU is different.
  • Step P4 All CTUs of the current frame:
  • Step P5 Perform post-processing on the current frame, such as loop filtering.
  • the embodiment of the present disclosure uses the trained deep neural network for prediction, and can be implemented by a general deep learning framework, such as Tensorflow, caffe, pytouch, etc., as long as the above ETH- can be constructed.
  • CNN and ETH-LSTM are all right. For example, you can use the Python language to call Tensorflow.
  • ETH-LSTM is only used in inter mode because LSTM is used to extract inter-frame dependencies in image features.
  • Inter mode includes three sub-modes, LDP (Low Delay P), LDB (Low Delay B), and RA (Random Access)
  • these three sub-modes have multiple configurations, and the performance of the test algorithm is based on the standard configuration of each seed mode.
  • the frame order is IPPPPPP..., that is, the first frame is an I frame (pure intra prediction), and then all frames are P frames (supporting intra prediction, or inter prediction of a single reference frame) .
  • the I frame is predicted by ETH-CNN, and only the P frame is input to the LSTM.
  • the LSTM time length is set to 20, and in order to increase the number of training samples, there are 10 frames overlap between two adjacent LSTMs. That is, the first to 20th frames, the 11th to 30th frames, the 21st to the 40th frames, and the like other than the I frame are put into the same LSTM for training.
  • the LSTM length is set to the number of frames of all P frames, that is, all P frames are continuously placed in the same LSTM until the last frame of the video.
  • the frame order is IBBBBBB..., that is, the first frame is an I frame, and then all frames are B frames (supporting intra prediction, or inter prediction of dual reference frames).
  • the LSTM time length is also the same as the LDP mode.
  • the standard RA is slightly more complicated, and the frame encoding order is different from the playback order.
  • information is passed in coding order, i.e., the first encoded frame is first input into the LSTM.
  • the coding order of the frames is I(BBB...BIBBBBBBB)(BBB...BIBBBBBBB)(BBB...BIBBBBBBB). That is, the first frame is an I frame, and then every 32 frames, the 25th frame in each group is an I frame, and the other frames are B frames. Because there is a 32-frame period, regardless of the training or test phase, the LSTM length is set to 32, exactly one set corresponds to one LSTM, and there is no overlap between two adjacent LSTMs.
  • the test does not specifically distinguish between the I frame and the B frame. Instead, the 32 frames of each group are input into the LSTM, and the CM partition is determined by the HCPM outputted by the LSTM at each moment. In this way, each group of 32 frames is a whole, and there is no breakpoint, and the information can be continuously transmitted.
  • the LSTM length setting in this embodiment is flexible and configured according to actual requirements.
  • block division coding complexity optimization method based on the depth learning method of the above-described embodiments of the present disclosure may be implemented by a block division coding complexity optimization apparatus. As shown in Figure 7.
  • the block partitioning coding complexity optimization apparatus based on the deep learning method may include a processor 501 and a memory 502 storing computer program instructions.
  • the processor 501 may include a central processing unit (CPU), or an application specific integrated circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present disclosure.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • Memory 502 can include mass storage for data or instructions.
  • the memory 502 can include a hard disk drive (HDD), a floppy disk drive, a flash memory, an optical disk, a magneto-optical disk, a magnetic tape, or a Universal Serial Bus (USB) drive, or two or more. A combination of more than one of these.
  • Memory 502 may include removable or non-removable (or fixed) media, where appropriate.
  • Memory 502 may be internal or external to the data processing device, where appropriate.
  • memory 502 is a non-volatile solid state memory.
  • memory 502 includes a read only memory (ROM).
  • the ROM may be a mask programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM) or flash memory or A combination of two or more of these.
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM
  • EAROM electrically rewritable ROM
  • flash memory or A combination of two or more of these.
  • the processor 501 implements any of the block division coding complexity optimization methods in the above embodiments by reading and executing computer program instructions stored in the memory 502.
  • the block-segment coding complexity optimization apparatus based on the deep learning method may further include a communication interface 503 and a bus 510. As shown in FIG. 7, the processor 501, the memory 502, and the communication interface 503 are connected by the bus 510 and complete communication with each other.
  • the communication interface 503 is mainly used to implement communication between modules, devices, units and/or devices in the embodiments of the present disclosure.
  • Bus 510 includes hardware, software, or both, coupling the components of the above described devices together.
  • the bus may include an accelerated graphics port (AGP) or other graphics bus, an enhanced industry standard architecture (EISA) bus, a front side bus (FSB), a super transfer (HT) interconnect, an industry standard architecture (ISA).
  • AGP accelerated graphics port
  • EISA enhanced industry standard architecture
  • FBB front side bus
  • HT super transfer
  • ISA industry standard architecture
  • Bus Infinite Bandwidth Interconnect
  • LPC Low Pin Count
  • MCA Micro Channel Architecture
  • PCI Peripheral Component Interconnect
  • PCI-X PCI-Express
  • SATA Serial Advanced Technology Attachment
  • VLB Video Electronics Standards Association Local
  • Bus 510 may include one or more buses, where appropriate. Although a particular bus is described and illustrated with respect to embodiments of the present disclosure, this disclosure contemplates any suitable bus or interconnect.
  • the embodiment of the present disclosure may be implemented by providing a computer readable storage medium.
  • the computer readable storage medium stores computer program instructions; when the computer program instructions are executed by the processor, any one of the above embodiments is implemented.
  • the functional blocks shown in the block diagrams described above may be implemented as hardware, software, firmware, or a combination thereof.
  • hardware When implemented in hardware, it can be, for example, an electronic circuit, an application specific integrated circuit (ASIC), suitable firmware, plug-ins, function cards, and the like.
  • ASIC application specific integrated circuit
  • elements of the present disclosure are programs or code segments that are used to perform the required tasks.
  • the program or code segments can be stored in a machine readable medium or transmitted over a transmission medium or communication link through a data signal carried in the carrier.
  • a "machine-readable medium” can include any medium that can store or transfer information.
  • machine-readable media examples include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, and the like.
  • the code segments can be downloaded via a computer network such as the Internet, an intranet, and the like.
  • the exemplary embodiments mentioned in the present disclosure describe some methods or systems based on a series of steps or devices.
  • the present disclosure is not limited to the order of the above steps, that is, the steps may be performed in the order mentioned in the embodiment, or may be different from the order in the embodiment, or several steps may be simultaneously performed.
  • the present disclosure has the beneficial effects of: (1)
  • the present disclosure utilizes the structured output of the HCPM to efficiently represent the CU partitioning process as compared to the prior art three-level CU partitioning tag. Only need to run the trained ETH-CNN/ETH-LSTM model once, all the CU segmentation results in the entire CTU can be obtained in the form of one HCPM, which significantly reduces the running time of the deep neural network itself and helps to reduce the overall coding. the complexity.
  • the depth ETH-CNN structure in the present disclosure solves the defect of manually extracting features in the prior art by automatically extracting features related to CU segmentation.
  • the depth ETH-CNN structure has more trainable parameters than the CNN structure in the prior art, which significantly improves the prediction accuracy of the CU segmentation.
  • the depth ETH-LSTM model proposed in the present disclosure is for learning long-term and short-term dependencies of CU partitioning between different frames of an inter mode. For the first time in the present disclosure, LSTM is used to predict CU partitioning to reduce HEVC coding complexity.
  • a CU partition database is established in advance for the intra mode and the inter mode. Compared to other methods in the prior art, it only relies on the existing JCT-VC database, which is much smaller than the database of the present disclosure.
  • the present disclosure has strong industrial applicability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Discrete Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

本发明提供一种基于深度学习方法的块分割编码复杂度优化方法及装置,方法包括:在HEVC中,查看HEVC当前使用的帧编码模式;根据帧编码模式选取与所述帧编码模式对应的CU分割预测模型;CU分割预测模型为预先建立并训练的模型;根据选取的CU分割预测模型预测HEVC中的CU分割结果,根据预测的CU分割结果对整个CTU进行分割。在具体应用中,帧编码模式为帧内模式,则CU分割预测模型为能够提前终止的ETH-CNN;帧编码模式为帧间模式,则CU分割预测模型为能够提前终止的ETH-LSTM和所述ETH-CNN。上述方法在保证CU分割预测精度的前提下,显著缩短了编码时决定CU分割所需时间,有效降低HEVC编码复杂度。

Description

基于深度学习方法的块分割编码复杂度优化方法及装置
本申请要求于2018年03月22日提交中国专利局、申请号为2018102409124、发明名称为“基于深度学习方法的块分割编码复杂度优化方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及视频编码技术领域,特别是一种基于深度学习方法的块分割编码复杂度优化方法及装置。
背景技术
与前一代H.264/高级视频编码(Advanced Video Coding,AVC)标准相比,高效率视频编码(High Efficiency Video Coding,HEVC)标准能够在相同视频质量下,节省大约50%的比特率。这得益于一些先进的视频编码技术,例如基于四叉树结构的编码单元(coding unit,CU)分割结构。但是,这些技术也带来了相当高的复杂度。现有技术HEVC的编码时间比H.264/AVC平均多出大约253%,这限制了该标准的实际应用。因此,有必要在率失真(rate-distortion,RD)性能几乎不受影响的前提下,显著降低HEVC编码的复杂度。
在过去几年中,已经提出了多种降低HEVC编码复杂度的方法。据测量,基于四叉树的递归的CU分割搜索过程,占据大部分编码时间(在标准参考软件HM中用时超过80%),因此很多方法都通过简化CU分割来降低HEVC编码复杂度。这类方法的基本思想是***出CU分割,代替原编码器中递归的暴力搜索来实现率失真优化(RD optimization,RDO)。
目前,降低HEVC编码复杂度的方法可以分成两大类:启发式方法与基于学习的方法。
早期的CU分割预测方法一般是启发式的,这些方法根据编码过程中的一些特征,在进行递归搜索之前,提前决定CU分割。在启发式方法中,可以通过提取一些中间特征,来简化暴力搜索。例如,在帧级别对CU分割进行判决的方法,该方法在决定当前帧的CU深度时,跳过先前帧中较少出现的CU深度。在CU层面,业内人士还提出基于金字塔型运动散度以及基于高频关键点数量的CU分割判决方法。另外,业内人士利用一些关键并且易于计算的特征(比如RD代价,帧间模式预测误差)来***比较合理的CU分割,再基于最小化贝叶斯风险准则来决定CU分割。同样基于贝叶斯准则,还可以完全的和低复杂度的RD代价为特征,来判决CU分割。除简化CU分割以外,业内人士还提出多种启发式方法,在预测单元(prediction unit,PU)和变换单元(transform unit,TU)层面上降低编码复杂度。例如,现有提出一种快速的PU尺寸判决方法,将较小的PU自适应地整合到较大的PU中。现有技术还根据编码块标志(coding block flag,CBF)和已编码CU的RD代价,来预测最大概率的PU分割。在最新 的研究中,利用混合拉普拉斯分布对编码系数进行建模,并以此为基础来加速RDO量化过程。此外,现有技术中还在HEVC的其他层面上(例如帧内或帧间预测模式选择,以及环路滤波)简化编码复杂度。
近年来,基于学习的方法在降低HEVC复杂度方面成果显著。也就是说,2015年以后,提出几种利用机器学习预测CU分割的方法,来降低HEVC编码复杂度。例如,为降低HEVC帧间模式复杂度,现有还提出一种基于支持向量机(support vector machine,SVM)三级联合分类器的CU深度判决方法,以预测三种尺寸的CU是否需要分割。这些方法通过大量数据在HEVC某些环节中学习到编码规律,来简化或取代原编码过程中的暴力搜索。例如,对于帧内模式,现有技术中利用逻辑回归和SVM,对CU分割进行二分类建模。如此,即可利用训练好的模型提前决定每个CU是否分割,以避免费时、递归的暴力搜索。对于帧间模式,利用数据挖掘方法,提出了三种提前终止机制,估计最优的CTU分割结果,以简化原编码器中的CTU分割过程。业内人士研究了与CU分割相关的几个中间特征,并将这些特征结合,来确定CU分割深度,跳过暴力RDO搜索,从而降低HEVC编码复杂度。后来,技术人员提出一种将二分类和多分类相结合的SVM方法,***CU分割及PU模式选择,这可以进一步减少HEVC的编码时间。然而,上述基于学习的方法很大程度上依靠手动提取特征。这需要较多的先验知识,并且可能忽略一些隐藏的但有价值的特征。
为解决上述手动提取特征,且能够降低HEVC帧内模式复杂度,业内人士设计了一种基于卷积神经网络(convolutional neural network,CNN)的CU分割预测方法。然而,现有技术中的CNN结构很浅,学习能力有限,因此不能对复杂的CU分割过程进行准确的建模。
发明内容
针对现有技术中的问题,本公开提供一种基于深度学习方法的块分割编码复杂度优化方法及装置,能够在保证CU分割预测精度的前提下,显著缩短了编码时决定CU分割所需时间,有效降低HEVC编码复杂度。
第一方面,本公开提供一种基于深度学习方法的块分割编码复杂度优化方法,包括:
在高效率视频编码HEVC中,查看所述HEVC当前使用的帧编码模式;
根据所述帧编码模式选取与所述帧编码模式对应的编码单元CU分割预测模型;所述CU分割预测模型为预先建立并训练的模型,该模型具有提前终止能力;
根据选取的所述CU分割预测模型预测所述HEVC中的CU分割结果,根据预测的所述CU分割结果对整个编码树单元CTU进行分割。
可选地,所述帧编码模式为帧内模式,则所述CU分割预测模型为能够提前终止的分层卷积神经网 络ETH-CNN;
所述帧编码模式为帧间模式,则所述CU分割预测模型为能够提前终止的ETH-LSTM和所述ETH-CNN。
可选地,所述查看所述HEVC当前使用的帧编码模式的步骤之前,所述方法还包括:
构建所述ETH-CNN,训练所述ETH-CNN;
构建所述ETH-LSTM,训练所述ETH-LSTM。
可选地,构建所述ETH-CNN,训练所述ETH-CNN的步骤,包括:
构建帧内模式下HEVC中用于预测CU分割结果的第一数据库;
采用HEVC标准参考程序对所述第一数据库中的图像进行编码,获取所述第一数据库中的正样本和负样本;
采用所述正样本和所述负样本训练帧内模式对应的ETH-CNN。
可选地,所述第一数据库中的每一个图像的分辨率为4928×3264;
所述第一数据库包括:训练集、验证集和测试集;所述训练集、验证集和测试集中的每一个均包括四个子集;
四个子集中第一个子集中每一个图像的分辨率为4928×3264,第二个子集中每一个图像的分辨率为2880×1920,第三个子集中每一个图像的分辨率为1536×1024,第四个子集中每一个图像的分辨率为768×512。
可选地,构建所述ETH-CNN,训练所述ETH-CNN;构建所述ETH-LSTM,训练所述ETH-LSTM的步骤,包括:
构建帧间模式下HEVC中用于预测CU分割结果的第二数据库;
对第二数据库中的所有视频的分辨率进行预处理,使得每一个视频段分辨率在预设范围内,以及对视频长度进行预处理,使得每一个视频长度为预设长度以内;
采用HEVC标准参考程序对预处理后的所述第二数据库中的视频进行编码,获取所述第二数据库中的正样本和负样本;
采用所述正样本和所述负样本训练帧间模式对应的ETH-CNN和帧间模式对应的ETH-LSTM。
可选地,所述第二数据库包括下述一种或多种分辨率的视频:
SIF(352×240),CIF(352×288),NTSC(720×486),4CIF(704×576),240p(416×240),480p(832×480),720p(1280×720),1080p(1920×1080),WQXGA(2560×1600)和4K(4096×2160);
所述第二数据库包括:训练集、验证集和测试集。
可选地,ETH-CNN的输入为一个64×64的矩阵,代表整个CTU的亮度信息,用U来表示;
ETH-CNN结构化输出为包含三个分支,分别代表三级HCPM的预测结果:
Figure PCTCN2019079312-appb-000001
Figure PCTCN2019079312-appb-000002
ETH-CNN的提前终止机制能够提前结束第二、三个支路上全连接层的计算;
和/或,ETH-CNN的具体结构包含两个预处理层,三个卷积层,一个归并层和三个全连接层。
可选地,所述预处理层用于对所述矩阵进行预处理操作;
从预处理层开始,输入信息在三条并列的分支
Figure PCTCN2019079312-appb-000003
中进行处理与变换;
在每条分支B 1中,所述卷积层对所有预处理后的数据进行三层卷积操作,表示为
Figure PCTCN2019079312-appb-000004
Figure PCTCN2019079312-appb-000005
在同一层中,所有三条分支的卷积核大小相同;
首先,在第1卷积层中,预处理后数据与16个4×4的核进行卷积,获得16种不同的特征图,以提取图像信息中的低级特征,为决定CU分割做准备;在第2、第3卷积层中,将上述特征图依次通过24个和32个2×2的核进行卷积,以提取较高级的特征,最终在每条B 1分支中均得到32种特征图;
所有卷积层中,卷积操作的步长等于核的边长;
归并层,将三条分支
Figure PCTCN2019079312-appb-000006
中第2、第3卷积层的所有特征归并在一起,组合成向量;所述归并层的特征总共由6种来源的特征图组合而成,即
Figure PCTCN2019079312-appb-000007
Figure PCTCN2019079312-appb-000008
所述全连接层,将归并后的特征再次分为三条支路
Figure PCTCN2019079312-appb-000009
进行处理,同样对应于HCPM中的三级输出;
在每条支路B 1中,特征向量依次通过三个全连接层:包括两个隐含层,以及一个输出层;两个隐藏层的输出依次为
Figure PCTCN2019079312-appb-000010
Figure PCTCN2019079312-appb-000011
最后一层的输出即为最终的HCPM;
每个全连接层中的特征数量与其所在的支路有关,并能保证三条支路B 1,B 2和B 3分别输出1个,4个和16个特征,对应三级HCPM的预测值
Figure PCTCN2019079312-appb-000012
Figure PCTCN2019079312-appb-000013
在ETH-CNN的第一、第二全连接层中,将QP作为一个外部特征,添加到特征向量中,使ETH-CNN能够对QP与CU分割的关系进行建模。
可选地,预测的所述CU分割结果采用分层CU分割图HCPM的结构化输出方式表示;
和/或,所述HCPM在在第1、2和3级分别包含1×1、2×2和4×4个二分类标签,对应真值y 1(U),
Figure PCTCN2019079312-appb-000014
Figure PCTCN2019079312-appb-000015
以及预测值
Figure PCTCN2019079312-appb-000016
Figure PCTCN2019079312-appb-000017
所述CU分割结果包括:第1级分类标签;
和/或,
当U或者U i被分割时,所述CU分割结果包括的第2级二分类标签或第3级二分类标签;
当U或者U i没被分割时,所述CU分割结果包括的第2级分类标签或第3级分类标签中的空值null。
即,无论何种CU分割情况,第2、3级二分类标签都存在,只是有时为空值null。
可选地,所述ETH-CNN模型训练的目标函数为交叉熵;
对于每个样本,其目标函数L r为所有二分类标签的交叉熵之和:
Figure PCTCN2019079312-appb-000018
其中,H(·,·)(l∈{1,2,3})代表HCPM中一个二分类器的预测值与真值标签间的交叉熵,r代表一批训练样本中的样本序号,L r表示第r个样本的目标函数,y 1(U),
Figure PCTCN2019079312-appb-000019
Figure PCTCN2019079312-appb-000020
分别表示真值,
Figure PCTCN2019079312-appb-000021
Figure PCTCN2019079312-appb-000022
Figure PCTCN2019079312-appb-000023
分别表示预测值。
可选地,将快速预编码后得到的残差CTU输入到所述ETH-CNN,以第二数据库中的CU分割标签作为真值,训练所述帧间模式的ETH-CNN;
将所述ETH-CNN第一个全连接层输出的三个向量,分别输入到所述ETH-LSTM的三个级别;
以及,以所述第二数据库中的CU分割标签作为真值,训练所述帧间模式的ETH-LSTM;
ETH-LSTM中每一级的LSTM单元和全连接层,由这一级的CU分别进行训练,即由64×64的CU训练ETH-LSTM第1级,由32×32的CU训练第2级,由16×16的CU训练第3级。
训练下述表2对应的ETH-LSTM的配置信息中的参数时,以交叉熵作为损失函数;
设训练时一批有R个样本,每个样本中LSTM的时间长度为T即T个LSTM单元,第r个样本第t帧的损失函数为L r(t),则这一批样本的损失函数L定义为所有L r(t)的平均值,即
Figure PCTCN2019079312-appb-000024
之后,利用动量随机梯度下降法进行训练;
最终,给定训练好的LSTM,由ETH-LSTM得到HCPM,以预测帧间模式CU分割结果。
第二方面,本公开实施例还提供一种基于深度学习方法的块分割编码复杂度优化装置,包括:
存储器、处理器、总线以及存储在存储器上并在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面任意一项的方法。
第二方面,一种计算机存储介质,其上存储有计算机程序,其特征在于:所述程序被处理器执行时实现如第二方面任意一项的方法。
本公开具有的有益效果:
(1)与现有技术中的三级CU分割标签相比,本公开利用HCPM的结构化输出,高效表示CU分割过程。只需要将训练好的ETH-CNN/ETH-LSTM模型运行一次,即可用一个HCPM的形式获得整个 CTU中所有的CU分割结果,显著降低了深度神经网络本身的运行时间,有利于降低总体的编码复杂度。
(2)本公开中的深度ETH-CNN结构,以自动提取与CU分割有关的特征,解决了现有技术中手动提取特征的缺陷。此外,深度ETH-CNN结构比现有技术中的CNN结构具有更多的可训练参数,显着提高了CU分割预测精度。
(3)本公开中还使用提前终止机制可以进一步节省计算时间。
(4)本公开中提出的深度ETH-LSTM模型,用于学习帧间模式不同帧之间CU分割的长时和短时依赖关系。本公开中首次利用LSTM预测CU分割,以降低HEVC编码复杂度。
(5)为了对本公开中ETH-CNN和ETH-LSTM的大量参数进行训练,预先对帧内模式和帧间模式建立了CU分割数据库。相比现有技术中的其他方法只依赖于现有的JCT-VC数据库,其规模远小于本公开的数据库。通过建立大规模CU分割数据库,可以促进利用深度学习预测CU分割来降低HECV复杂度的后续研究。
附图说明
图1为现有技术中率失真代价检查与比较的示意图;
图2为CU分割结构的示意图;
图3为本公开一实施例提供的HCPM的示意图;
图4为本公开一实施例提供的ETH-CNN结构的示意图;
图5为本公开一实施例提供的ETH-LSTM结构的示意图;
图6为本公开一实施例提供的基于深度学习方法的块分割编码复杂度优化方法的流程示意图;
图7为本公开一实施例提供的基于深度学习方法的块分割编码复杂度优化装置的结构示意图;
图8为本公开一实施例提供的使用ETH-LSTM的示意图。
具体实施方式
为了解决现有技术中手动提取特征的缺陷,基于深度学习的方法应运而生。深度学习不需要在编码过程中手动提取特征,而是从大规模的数据中,自动提取与编码结果相关的多种特征。然而,利用深度学习降低编码复杂度的深入研究,现在还很少。目前而言,现有技术中主要在帧内模式的CU分割预测中利用到较浅CNN结构,该CNN结构只包含两个卷积层,各包含6个和16个3×3的卷积核。对于帧间模式,还没有利用深度学习简化编码复杂度的工作即都没有探索不同距离的帧之间的CU分割相关性。
相比之下,本公开实施例提出了基于ETH-CNN和ETH-LSTM的深层网络结构的CU分割预测模 型,用于准确预测CU分割结果,降低HEVC帧内、帧间复杂度即降低编码复杂度。
具体地,本公开实施例的HCPM不同于传统方法分别决定单个CU是否分割,而是通过分层次的结构化输出,一次预测整个CTU中的CU分割情况。在HCPM基础上,通过引入提前终止机制来改进深度CNN结构,用于降低HEVC帧内模式的复杂度。
本公开的核心改进点可包括:1、构建一个适用于HEVC帧内和帧间模式的大规模CU分割数据库,促进基于深度学习降低HECV复杂度的研究。2、提出一种深度CNN网络,ETH-CNN,通过HCPM对CU分割进行结构化的输出,用于降低HEVC帧内模式复杂度。3、提出一种深度LSTM网络,ETH-LSTM,将其与ETH-CNN结合,学习CU分割的时间-空间相关性,用于降低HEVC帧间模式复杂度。
本公开实施例提出一种基于深度学习方法的块分割编码复杂度优化方法,适用于帧内、帧间两种模式,此方法可以从上述数据库学习整个编码树单元(coding tree unit,CTU)中的CU分割情况。即通过分层CU分割图(hierarchical CU partition map,HCPM)来高效表示整个CTU中的CU分割。给定足够多的训练数据以及高效的HCPM表示方法之后,深度学习的网路结构可以更加“深入”,从而通过学习到足够的参数来探寻多种多样的CU分割模式。
另外,本公开实施例的深度学习方法引入了可以提前终止的CNN(early terminated hierarchical CNN,ETH-CNN),并用分层的思想生成结构化的HCPM。这种提前终止可以节省CNN本身的计算时间,对降低帧内模式HEVC编码复杂度有着促进作用。另外,本公开实施例还引入适用于帧间模式的可以提前终止的LSTM(early terminated hierarchical LSTM,ETH-LSTM)。在ETH-LSTM中,能够在LSTM单元中学习到CU分割的时间相关性。将ETH-CNN中的特征作为输入后,ETH-LSTM把学习到的LSTM单元与提前终止机制相结合,从而分层次地输出HCPM。如此,上述方法能够有效地用于降低HEVC帧间模式的编码复杂度。
实施例一
如图6所示,本公开的基于深度学习方法的块分割编码复杂度优化方法可包括下述的步骤:
101、在HEVC中,查看所述HEVC当前使用的帧编码模式;
102、根据所述帧编码模式选取与所述帧编码模式对应的CU分割预测模型;所述CU分割预测模型为预先建立并训练的且具有提前终止机制的模型。
103、根据选取的所述CU分割预测模型预测所述HEVC中的CU分割结果,根据预测的所述CU分割结果对整个编码树单元CTU进行分割。
具体地,在执行上述图6所示的方法之前,上述方法还可包括下述的图中未示出的步骤600:
构建所述ETH-CNN,训练所述ETH-CNN;构建所述ETH-LSTM,训练所述ETH-LSTM。
其中,所述帧编码模式为帧内模式,则所述CU分割预测模型为ETH-CNN;此时,仅构建ETH-CNN,训练ETH-CNN即可。
所述帧编码模式为帧间模式,则所述CU分割预测模型为的ETH-LSTM和ETH-CNN,即构建ETH-CNN,训练ETH-CNN;构建ETH-LSTM,训练ETH-LSTM。也就是说,设计出一种长-短期记忆结构来学习帧间模式CU分割的时域依赖关系,之后将CNN与LSTM结合,来预测帧间模式CU分割。如此,可以显著降低帧间模式的HEVC编码复杂度。
进一步地,在具体实现过程中,对于帧内模式的CU分割预测模型的训练可包括下述步骤:
S1、构建帧内模式下HEVC中用于预测CU分割结果的第一数据库;
S2、采用HEVC标准参考程序对所述第一数据库中的图像进行编码,获取所述第一数据库中的正样本和负样本;
S3、采用所述正样本和所述负样本训练帧内模式对应的ETH-CNN。
另外,对于帧间模式的CU分割预测模型的训练可包括下述步骤:
M1、构建帧间模式下HEVC中用于预测CU分割结果的第二数据库;
M2、对第二数据库中的所有视频的分辨率进行预处理,使得每一个视频段分辨率在预设范围内,以及对视频长度进行预处理,使得每一个视频长度为预设长度(如10s)以内;
M3、采用HEVC标准参考程序对预处理后的所述第二数据库中的视频进行编码,获取所述第二数据库中的正样本和负样本;
M4、采用所述正样本和所述负样本训练帧间模式对应的ETH-CNN和帧间模式对应的ETH-LSTM。
ETH-LSTM可有效降低HEVC帧间模式的复杂度。为了训练ETH-LSTM,本公开实施例中建立了大规模的帧间模式CU分割数据库,数据库同时涵盖了帧内模式(2000个无损图像,用4个量化参数(quantization parameter,QP)压缩)和帧间模式(111个无损图像,用4个QP压缩),同时可以促进基于深度学习降低HEVC复杂度方面的研究。
为更好的理解本公开的内容,以下对本公开进行详细说明。
第一、CU分割数据库
A、CU分割综述
以CU分割为核心的CTU分割结构,是HEVC标准的主要构成之一。CTU的默认尺寸为64×64像素,一个CTU既可以包含单个CU,也可以基于四叉树递归结构分割成的若干个更小的CU,CU的默认 最小尺寸为8×8。另外,在编码之前,可对CTU或CU尺寸进行设定,即根据编码需求,人为设置最大和最小CTU或CU尺寸。因此,CTU中的CU有多种可能的尺寸。
在HEVC编码标准中,每个CTU中的CU尺寸,是由递归搜索确定的。标准编码器中这一过程是一个暴力搜索过程,它包括自上向下的检查过程,以及自下而上的比较过程。图1阐述了父CU与其四个子CU之间的RD代价检查与比较过程。在检查过程中,编码器会检查整个CTU的RD代价,再检索其子CU的RD代价;对每个子CU而言,若还存在可能的子CU,则再检查下一代每个子CU的RD代价……如此递归,直到检查完最小尺寸的CU。
在图1中,父CU的RD代价表示为R pa,子CU的RD代价表示为
Figure PCTCN2019079312-appb-000025
其中m∈{1,2,3,4}表示每个子CU的序号。然后通过比较父CU以及子CU的RD损失,确定父CU是否需要分割。如图1-(b)所示,如果
Figure PCTCN2019079312-appb-000026
则父CU需要分割,反之,则不需要。值得注意的是,决定是否分割CU时,也需要考虑分割标志本身的RD代价。进行完整的RDO搜索过程之后,即可得到RD代价最小的CU分割结果。注意到,递归的RDO搜索极其耗时。对于一个64×64的CTU,需要检查85个可能的CU,其中包括:一个64×64的CU,4个32×32的CU,4 2个16×16的CU和4 3个8×8的CU。为了检索每个CU的RD损失,编码器要对CU进行预编码,在此过程中,需要对可能的预测和变换模式进行编码。而且,为决定完整CTU的CU分割结果,必须对所有85个可能的CU都进行预编码,这占据了大部分的编码时间。然而,最终CU分割结果仅仅保留了1(CTU没有分割)到64(整个CTU被分成最小尺寸8×8的CU)个CU,远少于全部的85个。所以,如果能够***出合理的CU分割结果,即可省略最多84个、最少21个CU的RD代价检查过程,实现降低编码复杂度的目的。
B、帧内模式数据库
下面介绍适用于HEVC帧内模式的大规模CU分割数据库(CU partition of HEVC–Intra,CPH-Intra)第一数据库。第一数据库是第一个用于HEVC中CU分割的数据库。为建立此第一数据库,首先从无损图像数据集(Raw Images Dataset,RAISE)中选择2000幅分辨率为4928×3264的图像。这2000幅图像随机分成训练集(1700幅),验证集(100幅)以及测试集(200幅)。此外,每个集合被等分为4个子集:其中一个子集保持原分辨率不变,另外三个子集分别将原图降采样到2880×1920、1536×1024和768×512。如此,CPH-Intra数据库即包含了多种分辨率的图像,确保CU分割训练数据的多样性。
之后,使用HEVC标准参考软件如HM16.5对上述图像进行编码。这里使用4个不同的QP{22,27,32,37}进行编码,对应标准编码器的All-Intra(AI)配置(文件encoder_intra_main.cfg)。编码结束之后,即可获得所有CU的二分类标签,代表分割(=1)或不分割(=0),每个CU的图像信息和标签的组合,即为一个样本。最终,CPH-Intra数据库包括110,405,784个样本,所有样本根据其QP值和CU尺寸分为12个子数据库,其中分割(49.2%)与不分割(50.8%)的CU个数比较接近,保证了正负样本相对平衡。
C、帧间模式数据库
此外,建立帧间模式的CU分割数据库即第二数据库:CPH-Inter数据库。为建立该第二数据库,首先选择111个无损视频,包括中的6个1080P(1920×1080)视频,视频编码联合组(Joint Collaborative Team on Video  Coding,JCT-VC)推荐的18个A-E类标准测试视频,以及Xiph.org中的87个视频。如此,第二数据库包含了多种分辨率的视频:SIF(352×240),CIF(352×288),NTSC(720×486),4CIF(704×576),240p(416×240),480p(832×480),720p(1280×720),1080p(1920×1080),WQXGA(2560×1600)和4K(4096×2160)。另外注意到,由于HEVC编码的视频仅支持8×8倍数的分辨率,需要对不符合此要求的视频进行调整,因此在第二数据库中,统一裁剪掉NTSC视频的底部边缘,使分辨率变为720×480。同时,若视频大于10秒,将其统一裁剪为10秒。
在CPH-Inter数据库中,上述视频被分成互不重叠的训练集(83个),验证集(10个)和测试集(18个)。测试集中的视频,来源于JCT-VC的18个标准序列。与CPH-Intra数据库类似,CPH-Inter数据库同样利用HM16.5在4个QP{22,27,32,37}下进行编码。考虑到帧间模式不同的编码需求,所有视频通过三种配置进行编码,包括:Low Delay P(LDP)配置(标准文件encoder_lowdelay_P_main.cfg),Low Delay B(LDB)配置(标准文件encoder_lowdelay_main.cfg)和Random Access(RA)配置(标准文件encoder_randomaccess_main.cfg)的。如此,在每种配置下,根据不同QP与CU尺寸,都可得到12个子数据库。CPH-Inter数据库在LDP配置下包含307,831,288个样本,在LDB配置下包含275,163,224个样本,在RA配置下包含232,095,164个样本,保证了充足的数据量用于深度学习。
第二、HEVC帧内模式复杂度降低方法
A、分层CU分割图(hierarchical CU partition map,HCPM)
根据HEVC的CU分割结构,默认CU可以取四种不同的尺寸:64×64,32×32,16×16和8×8,分别对应CU深度:0,1,2和3。其中,非最小尺寸(大于等于16×16)的CU既可以分割,也可以不分割。如图2所示,整个CU分割过程可视为3级二分类标签
Figure PCTCN2019079312-appb-000027
的组合,其中l∈{1,2,3}代表分割的层级。具体而言,l=1表示第一级,它决定是否将64×64的CU分割成四个32×32的CU,l=2是决定是否将32×32的CU分割成16×16的CU,l=3则决定是否将16×16分割成8×8。
对于一个给定的CTU,将深度为0的CU记为U。对U,第一级标签y 1(U)代表分割(=1)或不分割(=0)这个CU。若分割U,将深度为1的子CU记为
Figure PCTCN2019079312-appb-000028
这样,第二级标签
Figure PCTCN2019079312-appb-000029
表示是否对这些子CU进行分割(分割=1,不分割=0)。对每个需要分割的U i,其深度为2的子CU用
Figure PCTCN2019079312-appb-000030
表示。同样,第三级标签
Figure PCTCN2019079312-appb-000031
表示是否分割每个深度为2的CU。对每个分割的U i,j,其深度为3的子CU为
Figure PCTCN2019079312-appb-000032
下标i,j,k∈{1,2,3,4}分别代表U,U i和U i,j中每个子CU的序号。上述分层CU分割标签如图2中向下的箭头所示。由于存在众多种可能的组合,CTU中总体的CU分割结果是非常复杂的。例如,对一个64×64的U,如果y 1(U)=1,它会被分割成4个32×32的CU,即
Figure PCTCN2019079312-appb-000033
对每个U i,考虑其4个可能的子CU,
Figure PCTCN2019079312-appb-000034
又存在1+2 4=17种分法。因此,对整个CTU而言,共存在1+17 4=83522种CU分割方法。
如上述第一章分割数据库中的A节所述,标准HEVC编码器是通过耗时的RDO过程得到CU分割标签y 1(U),
Figure PCTCN2019079312-appb-000035
Figure PCTCN2019079312-appb-000036
的。其实,这些标签可以通过机器学习预测得到,以取代传统的RDO过程。然而,由于存在众多种可能的CU分割情况(上述83,522种),这很难由一个简单的多级分 类器进行一步预测得到。
因此,预测CU分割标签应该逐层进行,也就是分别对每一层的CU分割标签y 1(U),
Figure PCTCN2019079312-appb-000037
Figure PCTCN2019079312-appb-000038
进行预测,并将预测结果记为
Figure PCTCN2019079312-appb-000039
Figure PCTCN2019079312-appb-000040
现有技术中决定64×64、32×32和16×16的CU的二分类标签
Figure PCTCN2019079312-appb-000041
Figure PCTCN2019079312-appb-000042
是分开预测的。欲决定整个CTU的CU分割结果,训练后的模型需要被多次调用,这就造成计算上的大量冗余。为了克服这个缺点,本实施例利用分层CU分割图HCPM,用结构化的输出来高效表示CU分割结果。如此,只需将训练好的模型调用一次,即可以预测整个CTU中的CU分割结果,大大减少了预测过程本身的计算时间。
图3是HCPM的一个例子,它将CU分割标签分层表示为结构化的输出。具体而言,HCPM在第1、2和3级分别包含1×1、2×2和4×4个二分类标签,对应真值y 1(U),
Figure PCTCN2019079312-appb-000043
Figure PCTCN2019079312-appb-000044
以及预测值
Figure PCTCN2019079312-appb-000045
Figure PCTCN2019079312-appb-000046
无论何种CU分割情况,第1级分类标签一定都存在;但当U或者U i没被分割时,相应的子CU
Figure PCTCN2019079312-appb-000047
Figure PCTCN2019079312-appb-000048
不存在,此时将HCPM中的
Figure PCTCN2019079312-appb-000049
Figure PCTCN2019079312-appb-000050
标签设定为空(null),如图3中“-”所示。
B、与HCPM相适应的ETH-CNN结构
考虑到本实施例方法的主要任务是由CTU的图像信息预测CU的分割结果,输入信息以矩阵表示,具有显著的空间相关性,因此在本实施例中,利用CNN对HCPM建模。
根据CU分割原理,设计出的ETH-CNN结构如图4所示。
ETH-CNN的输入为一个64×64的矩阵,代表整个CTU的亮度信息,用U来表示。ETH-CNN结构化输出为包含三个分支,分别代表三级HCPM的预测结果:
Figure PCTCN2019079312-appb-000051
Figure PCTCN2019079312-appb-000052
与普通的CNN结构相比,ETH-CNN引入了提前终止机制,它可以提前结束第二、三个支路上全连接层的计算。ETH-CNN的具体结构包含两个预处理层,三个卷积层,一个归并层和三个全连接层。
各部分的具体配置与功能,阐述如下。
●预处理层。首先对CTU原始亮度矩阵(64×64)进行去均值和降采样等预处理操作。为了适应HCPM的三级输出,从预处理层开始,输入信息即在三条并列的分支
Figure PCTCN2019079312-appb-000053
中进行处理与变换。在三条分支的去均值操作
Figure PCTCN2019079312-appb-000054
中,每个输入CTU的亮度矩阵将减去图像某一范围内的平均亮度,以减小图像间的亮度差异。其中,在B 1分支中,亮度矩阵与CTU整体的平均亮度相减,对应于HCPM第一级的1个预测结果
Figure PCTCN2019079312-appb-000055
在B 2分支中,64×64亮度矩阵可分为2×2个不重叠的32×32单元,各个单元分别与其内部的平均亮度相减,恰好对应HCPM第二级的4个标签
Figure PCTCN2019079312-appb-000056
类似,B 3中则将64×64亮度矩阵划分为4×4个不重叠的16×16单元,再在各单元内部进行去均值操作,对应HCPM第三级的4×4个标签
Figure PCTCN2019079312-appb-000057
之后,考虑到分割深度较浅的CTU中的图像内容一般比较平滑,没有过多的细节信息,因此在B 1和B 2中,继续对去均值后的亮度矩阵进行降采样,如图中
Figure PCTCN2019079312-appb-000058
所示,将矩阵尺寸转化为16×16和32×32,可进一步降低后续的计算复杂度。并且,通过选择性的降采样,能够保证B 1~B 3中后续的卷积层输出尺寸与HCPM第1~3级的输出标签数相一致,使卷积层输出结果具有比较清晰、明确的意义。
●卷积层。在每条分支B l中,对所有预处理后的数据进行三层卷积操作,表示为
Figure PCTCN2019079312-appb-000059
Figure PCTCN2019079312-appb-000060
在同一层中,所有三条分支的卷积核大小相同。首先,在第1卷积层中,预处理后数据与16个4×4的核进行卷积,获得16种不同的特征图,以提取图像信息中的低级特征,为决定CU分割做准备。在第2、第3卷积层中,将上述特征图依次通过24个和32个2×2的核进行卷积,以提取较高级的特征,最终在每条B l分支中均得到32种特征图。所有卷积层中,卷积操作的步长等于核的边长,恰好能实现无重叠的卷积运算,且大多数卷积核作用域为8×8,16×16,32×32或64×64等(边长均为2的整数次幂),恰好对应HEVC中的互不重叠的CU的位置和尺寸。
●归并层。将三条分支
Figure PCTCN2019079312-appb-000061
中第2、第3卷积层的所有特征归并在一起,组合成向量。如图4所示,此层的特征总共由6种来源的特征图组合而成,即
Figure PCTCN2019079312-appb-000062
Figure PCTCN2019079312-appb-000063
以便获得多种全局与局部的特征。经过特征归并后,在后续的全连接层中,即可利用完整CTU中的特征,预测HCPM中某一级CU的分割结果,而不仅仅局限于某一条分支B 1,B 2或B 3中的特征。
●全连接层。将归并后的特征再次分为三条支路
Figure PCTCN2019079312-appb-000064
进行处理,同样对应于HCPM中的三级输出。在每条支路B l中,特征向量依次通过三个全连接层:包括两个隐含层,以及一个输出层。两个隐藏层的输出依次为
Figure PCTCN2019079312-appb-000065
Figure PCTCN2019079312-appb-000066
最后一层的输出即为最终的HCPM。每个全连接层中的特征数量与其所在的支路(即HCPM的级别l)有关,并能保证三条支路B 1,B 2和B 3分别输出1个,4个和16个特征,恰好对应三级HCPM的预测值
Figure PCTCN2019079312-appb-000067
Figure PCTCN2019079312-appb-000068
此外,需要考虑量化参数QP对CU分割的影响。一般,随着QP减小,更多的CU将被分割,反之当QP增大时,则倾向于不分割。因此,在ETH-CNN的第一、第二全连接层中,将QP作为一个外部特征,添加到特征向量中,使网络能够对QP与CU分割的关系进行建模,在不同QP下准确预测分割结果,提高算法对不同编码质量和码率的适应性。另外,通过ETH-CNN的提前终止机制,可以跳过第二、三级全连接层,以节省计算时间。具体而言,如果第一级的U不分割,则不需要计算
Figure PCTCN2019079312-appb-000069
所在的第二级;如果第二级的{U i} 4 i=1都不分割,则不需要计算
Figure PCTCN2019079312-appb-000070
所在的第三级。
●其他层。在CNN训练阶段,将第一、第二全连接层的特征分别以50%和20%的概率随机丢弃(dropout),防止过拟合,从而提高网络的泛化能力。
在训练和测试阶段,所有卷积层和第一、二个全连接层均用修正线性单元(rectified linear units,ReLU)激活,在网络中引入适当的稀疏性以提高训练效率。所有分支
Figure PCTCN2019079312-appb-000071
的第三个全连接层,即输出层,采用S形(sigmoid)函数进行激活,使输出值位于(0,1)内,与HCPM中的二分类标签相适应。
ETH-CNN的具体配置如表1所示。由表得,网络中共有1,287,189个可训练参数。与现有技术的浅层CNN相比,ETH-CNN具有更高的网络容量,能够对CU分割问题更有效地建模。得益于CPH-Intra数据库中超过1亿的训练样本,网络能够在参数众多的条件下降低过拟合的风险。另外,用同一个网络预测HCPM中所有三级的输出结果,也是该ETH-CNN的一个主要优势,它使得网络在预测y 1(U),
Figure PCTCN2019079312-appb-000072
Figure PCTCN2019079312-appb-000073
时,能够共享卷积层和归并层中的特征。不同于传统的基于学习的方法,需要依次预测64×64、32×32和16×16CU的分割情况,由于ETH-CNN以HCPM作为输出,因此具有网络结 构共享和参数共享等特点,能够在准确预测CU分割的前提下,显著减少网络本身的计算量,进一步降低编码的总体复杂度。
表1 ETH-CNN配置
Figure PCTCN2019079312-appb-000074
C.ETH-CNN模型训练的目标函数
确定深度CNN的结构之后,需要根据训练过程的真值和模型的输出值,寻求合适的目标函数,使模型能够对HCPM进行有效的预测。由于总样本数众多,若全部输入到网络中,会存在硬件资源不足、网络权值的更新速度过慢等问题,因此在训练CNN时,采取分批训练的方式。设每次将R个样本输入到网络中,其对应的HCPM真值标签为
Figure PCTCN2019079312-appb-000075
预测值标签为
Figure PCTCN2019079312-appb-000076
由于网络输出值和对应的真值标签均已二值化,范围在[0,1]之间,本实施例采用交叉熵作为目标函数。
对于每个样本,其目标函数L r为所有二分类标签的交叉熵之和:
Figure PCTCN2019079312-appb-000077
其中,H(·,·)(l∈{1,2,3})代表HCPM中一个二分类器的预测值与真值标签间的交叉熵。考虑到某些真值标签不存在,例如图2中的
Figure PCTCN2019079312-appb-000078
只有有效的真值和预测值(
Figure PCTCN2019079312-appb-000079
Figure PCTCN2019079312-appb-000080
)才被计入到目标函数中。
一批样本上的目标函数,即为所有样本目标函数的平均值:
Figure PCTCN2019079312-appb-000081
考虑到CNN的训练是分批进行的,为保证各样本被选取的机会均等,每次应在大量训练样本中随机选取一部分,作为网络输入,因此选用动量随机梯度下降法进行优化。
第三、HEVC帧间模式复杂度降低方法
基于研究分析,HEVC帧间模式中的CU分割在时间上存在一定的相关性。例如,帧距离越近,CU分割结果越相似;帧距离增大,则相似程度减小。本公开在ETH-CNN的基础上,进一步提出一种ETH-LSTM网络,来学习帧间CU分割的长、短时依赖关系。ETH-LSTM的总体框架如图5所示。
为了利用帧间模式中图像的空间相关性,该网络以残差CTU作为输入。此处的残差,是通过对当前帧进行快速预编码获得的,这个过程与标准编码过程相似,唯一区别是将CU和PU强制设为最大尺寸64×64,以节省时间。尽管额外的预编码过程带来了时间冗余,但它只占标准编码时间的3%以内,不会显著影响本文算法的性能。预编码结束后,将残差CTU输入给ETH-CNN。在帧间模式中,ETH-CNN中的参数由CPH-Inter数据库中的残差CTU与CU分割的真值重新训练得到。接下来,在每一帧中都将ETH-CNN中第7层(第1个全连接层)输出的特征
Figure PCTCN2019079312-appb-000082
送入ETH-LSTM,以备后续处理。
在ETH-LSTM中,用于决定CU深度的三级LSTM如图5所示。具体而言,ETH-LSTM的第1、2、3级各有一个LSTM单元,对应三级HCPM中的
Figure PCTCN2019079312-appb-000083
Figure PCTCN2019079312-appb-000084
其中,
Figure PCTCN2019079312-appb-000085
表示第t帧的U(64×64,深度=0)是否分割;类似地,
Figure PCTCN2019079312-appb-000086
Figure PCTCN2019079312-appb-000087
分别表示U i(32×32,深度=1)和U i,j(16×16,深度=2)是否分割。每一级中,从LSTM单元中输出的特征向量,再依次通过两个全连接层,并且,每个全连接层还包含了两种外部特征:QP值与当前帧在GOP中的帧顺序。值得注意的是,帧顺序由独热(one-hot)向量的形式表示。对于第t帧的第l级,将LSTM单元的输出特征和第一个全连接层的输出特征,分别记为f′ 1-l(t)和f′ 2-l(t)。其后的第二个全连接层,输出的是CU分割的概率,即HCPM中的二分类结果。与ETH-CNN类似,ETH-LSTM中同样引入提前终止机制。其中,如果第一级LSTM预测出CU不分割,则HCPM中第二级的
Figure PCTCN2019079312-appb-000088
将跳过两个全连接层,提前终止。类似,若第二级LSTM预测出4个CU均为不分割,则HCPM中的第三级的全连接层也会提前终止。如此,即可减少ETH-LSTM中冗余的计算时间。最终,以HCPM的形式输出ETH-LSTM的结果,即第t帧中当前CTU的分割结果。
在决定每个CTU的HCPM结果时,ETH-LSTM可以利用到先前帧中相同位置CTU的分割结果,这是通过用不同级别LSTM单元学习CU分割的长、短时相关性来实现的。具体训练时,ETH-LSTM中每一级的LSTM单元由这一级的CU分别进行训练,即由64×64的CU训练ETH-LSTM第1级,由32×32的CU训练第2级,由16×16的CU训练第3级。
接下来,以第l级第t帧的LSTM单元为例,介绍ETH-LSTM的学习机制。LSTM网络包括三种门:输入门i l(t),输出门o l(t),遗忘门g l(t)。给定当前帧LSTM的输入特征f 1-l(t)(也即ETH-CNN第一个全连接层的特征)以及上一帧LSTM输出特征f′ 1-l(t-1),则上述三个门可表示为:
i l(t)=σ(W i·[f 1-l(t),f′ 1-l(t-1)]+b i)  (3)
o l(t)=σ(W o·[f 1-l(t),f′ 1-l(t-1)]+b o)  (4)
g l(t)=σ(W f·[f 1-l(t),f′ 1-l(t-1)]+b f)  (5)
其中σ(·)表示S形(sigmoid)函数。上述三个等式,W i,W o和W f为三个门的可训练参数,b i,b o和b f为相应的偏置。通过这三个门,LSTM单元利用如下策略更新第t帧的状态:
c l(t)=i l(t)tanh⊙(W c⊙[f 1-l(t),f′ 1-l(t-1)]+b c)+g l(t)⊙c l(t-1)  (6)
其中,⊙表示按元素相乘,上式中,W c与b c是计算c l(t)所需的可训练参数与偏置。
最终,LSTM输出单元f′ 1-l(t)可以表示为:
f′ 1-l(t)=o l(t)⊙c l(t)  (7)
上两式中,状态向量c l(t)与输出向量f′ 1-l(t)的长度,均与输入向量f 1-l(t)相同。
ETH-LSTM的配置如表2所示,其中包括所有的可训练参数。
表2 ETH-LSTM配置
Figure PCTCN2019079312-appb-000089
与训练ETH-CNN类似,训练上述表2的参数时,仍以交叉熵作为损失函数,如等式(1)所示。设训练时一批有R个样本,每个样本中LSTM的时间长度为T(即T个LSTM单元),第r个样本第t帧的损失函数为L r(t),则这一批样本的损失函数L可定义为所有L r(t)的平均值,即
Figure PCTCN2019079312-appb-000090
之后,利用动量随机梯度下降法进行训练。最终,给定训练好的LSTM,即可由ETH-LSTM得到HCPM,以预测帧间模式CU分割结果。
第四、HEVC中HM编码器改进的主要过程
步骤P1、初始化当前帧。
步骤P2、对当前帧所有的CTU:
(1)直接将CU和PU(predicting unit,预测单元)尺寸设为固定的64*64,对于帧的边缘不足64*64的部分,也取最大可能的尺寸(固定最大尺寸后,可以避免对每个CU进行递归的检查和比较,节省时间);
(2)对当前CTU编码。此过程中HM编码器会记录每个CTU的残差,如图8所示。
步骤P3、提取当前帧所有CTU的残差。此处残差就是HEVC标准中提到的残差,是对每一个PU进行预测之后得到的结果与原始图像之间的差。对不同PU,图像信息的来源不一样,比如有可能用上一帧预测,也有可能用更早的某一帧预测,等等。所有CTU的残差,最终组成一个残差帧,所以很难说当前帧残差是由哪帧和哪帧相减得到的,因为每个PU的预测来源不同。
步骤P4、对当前帧所有的CTU:
(1)若t为ETH-LSTM的起始时刻,将每个LSTM单元的状态向量初始化为0,否则跳过此步骤;
(2)将CTU残差的亮度信息送入ETH-CNN中,得到第一个全连接层的输出向量f1-1(t),f1-2(t)和f1-3(t);
(3)将f1-1(t),f1-2(t)和f1-3(t)输入到三级ETH-LSTM中,由每个LSTM单元读取各自的输入向量和状态向量,获得更新后的状态向量和输出向量;
(4)将每个LSTM单元的输出向量通过两个全连接层,得到当前CTU在t时刻的最终结果,即HCPM;
(5)用预测好的HCPM直接决定CU分割方案;
(6)用此CU分割方案对当前CTU编码。
步骤P5、对当前帧进行后处理,如环路滤波等。
基于上述的步骤P1至P4,本公开实施例中是用训练好的深度神经网络做预测,可以用通用的深度学习框架来实现,如Tensorflow,caffe,pytouch等,只要能搭建出上述的ETH-CNN和ETH-LSTM即可。例如,可以采用用Python语言调用Tensorflow实现的。
第五、关于LSTM长度的说明
只有帧间模式会用到ETH-LSTM,因为LSTM用于提取图像特征中的帧间依赖关系。
帧间模式包括三种子模式,LDP(低延迟P)、LDB(低延迟B)和RA(随机访问)
HEVC中,这三种子模式都有多种配置,关于测试算法性能,都是用每种子模式的标准配置。
1.标准LDP模式下,帧顺序为IPPPPPP……,即第一帧是I帧(纯帧内预测),之后所有帧都是P帧(支持帧内预测,或单参考帧的帧间预测)。此方法中,I帧是用ETH-CNN预测,只有P帧输入到LSTM。在训练阶段,LSTM时间长度设为20,另外为了增加训练样本数,相邻两个LSTM之间有10帧重叠。即,除I帧外的第1~20帧、第11~30帧、第21~40帧等等,放入同一个LSTM中训练。在预测阶段,为了方便,LSTM长度设为所有P帧的帧数,即所有P帧连续不断地放入同一个LSTM里,直到视频最后一帧。
2.标准LDB模式下,帧顺序为IBBBBBB……,即第一帧是I帧,之后所有帧都是B帧(支持帧内预测,或双参考帧的帧间预测)。与LDP相比,只是P帧换成B帧,其他都相同。因此LSTM时间长度,也和LDP模式相同。
3.标准RA则稍复杂,帧编码顺序不同于播放顺序。在本公开的LSTM中,信息按编码顺序传递,即先编 码的帧先输入到LSTM里。帧的编码顺序为I(BBB……BIBBBBBBB)(BBB……BIBBBBBBB)(BBB……BIBBBBBBB)……。即第一帧是I帧,之后每32帧一组,每组中第25帧是I帧,其他帧都是B帧。因为有32帧的周期,无论训练或测试阶段,LSTM长度都设成32,恰好一组对应一个LSTM,而且相邻两个LSTM之间无重叠。为便于实现,测试时没有特别区分I帧和B帧,而是把每组的32帧都输入LSTM,用LSTM每个时刻输出的HCPM,决定CU分割。如此,每组的32帧即为一个整体,且其中没有断点,信息可以连续传递。
本实施例的LSTM长度的设置比较灵活,根据实际需求进行配置。
实施例二
另外,上述描述的本公开实施例的基于深度学习方法的块分割编码复杂度优化方法可以由块分割编码复杂度优化装置来实现。如图7所示。
基于深度学习方法的块分割编码复杂度优化装置可以包括处理器501以及存储有计算机程序指令的存储器502。
具体地,上述处理器501可以包括中央处理器(CPU),或者特定集成电路(Application Specific Integrated Circuit,ASIC),或者可以被配置成实施本公开实施例的一个或多个集成电路。
存储器502可以包括用于数据或指令的大容量存储器。举例来说而非限制,存储器502可包括硬盘驱动器(Hard Disk Drive,HDD)、软盘驱动器、闪存、光盘、磁光盘、磁带或通用串行总线(Universal Serial Bus,USB)驱动器或者两个或更多个以上这些的组合。在合适的情况下,存储器502可包括可移除或不可移除(或固定)的介质。在合适的情况下,存储器502可在数据处理装置的内部或外部。在特定实施例中,存储器502是非易失性固态存储器。在特定实施例中,存储器502包括只读存储器(ROM)。在合适的情况下,该ROM可以是掩模编程的ROM、可编程ROM(PROM)、可擦除PROM(EPROM)、电可擦除PROM(EEPROM)、电可改写ROM(EAROM)或闪存或者两个或更多个以上这些的组合。
处理器501通过读取并执行存储器502中存储的计算机程序指令,以实现上述实施例中的任意一种块分割编码复杂度优化方法。
在一个示例中,基于深度学习方法的块分割编码复杂度优化装置还可包括通信接口503和总线510。其中,如图7所示,处理器501、存储器502、通信接口503通过总线510连接并完成相互间的通信。
通信接口503,主要用于实现本公开实施例中各模块、装置、单元和/或设备之间的通信。
总线510包括硬件、软件或两者,将上述装置的部件彼此耦接在一起。举例来说而非限制,总线可包括加速图形端口(AGP)或其他图形总线、增强工业标准架构(EISA)总线、前端总线(FSB)、超传输(HT)互连、工业标准架构(ISA)总线、无限带宽互连、低引脚数(LPC)总线、存储器总线、微信道架构(MCA)总线、***组件互连(PCI)总线、PCI-Express(PCI-X)总线、串行高级技术附件(SATA)总线、视频电子标准协会局部(VLB)总线或其他合适的总线或者两个或更多个以上这些的组合。在合适的情况下,总线510可包括一个或多个总线。尽管本公开实施例描述和示出了特定的总线,但本公开考虑任何合适的总线或互连。
另外,结合上述实施例中的基于深度学习方法的块分割编码复杂度优化方法,本公开实施例可提 供一种计算机可读存储介质来实现。该计算机可读存储介质上存储有计算机程序指令;该计算机程序指令被处理器执行时实现上述实施例中的任意一种方法。
需要明确的是,本公开并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见,这里省略了对已知方法的详细描述。在上述实施例中,描述和示出了若干具体的步骤作为示例。但是,本公开的方法过程并不限于所描述和示出的具体步骤,本领域的技术人员可以在领会本公开的精神后,作出各种改变、修改和添加,或者改变步骤之间的顺序。
以上所述的结构框图中所示的功能块可以实现为硬件、软件、固件或者它们的组合。当以硬件方式实现时,其可以例如是电子电路、专用集成电路(ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时,本公开的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中,或者通过载波中携带的数据信号在传输介质或者通信链路上传送。“机器可读介质”可以包括能够存储或传输信息的任何介质。机器可读介质的例子包括电子电路、半导体存储器设备、ROM、闪存、可擦除ROM(EROM)、软盘、CD-ROM、光盘、硬盘、光纤介质、射频(RF)链路,等等。代码段可以经由诸如因特网、内联网等的计算机网络被下载。
还需要说明的是,本公开中提及的示例性实施例,基于一系列的步骤或者装置描述一些方法或***。但是,本公开不局限于上述步骤的顺序,也就是说,可以按照实施例中提及的顺序执行步骤,也可以不同于实施例中的顺序,或者若干步骤同时执行。
最后应说明的是:以上所述的各实施例仅用于说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分或全部技术特征进行等同替换;而这些修改或替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的范围。
工业实用性
本公开具有的有益效果:(1)与现有技术中的三级CU分割标签相比,本公开利用HCPM的结构化输出,高效表示CU分割过程。只需要将训练好的ETH-CNN/ETH-LSTM模型运行一次,即可用一个HCPM的形式获得整个CTU中所有的CU分割结果,显著降低了深度神经网络本身的运行时间,有利于降低总体的编码复杂度。
(2)本公开中的深度ETH-CNN结构,以自动提取与CU分割有关的特征,解决了现有技术中手动提取特征的缺陷。此外,深度ETH-CNN结构比现有技术中的CNN结构具有更多的可训练参数,显着提高了CU分割预测精度。
(3)本公开中还使用提前终止机制可以进一步节省计算时间。
(4)本公开中提出的深度ETH-LSTM模型,用于学习帧间模式不同帧之间CU分割的长时和短时 依赖关系。本公开中首次利用LSTM预测CU分割,以降低HEVC编码复杂度。
(5)为了对本公开中ETH-CNN和ETH-LSTM的大量参数进行训练,预先对帧内模式和帧间模式建立了CU分割数据库。相比现有技术中的其他方法只依赖于现有的JCT-VC数据库,其规模远小于本公开的数据库。通过建立大规模CU分割数据库,可以促进利用深度学习预测CU分割来降低HECV复杂度的后续研究。
因此,本公开具有很强的工业实用性。

Claims (12)

  1. 一种基于深度学习方法的块分割编码复杂度优化方法,其特征在于,包括:
    在高效率视频编码HEVC中,查看所述HEVC当前使用的帧编码模式;
    根据所述帧编码模式选取与所述帧编码模式对应的编码单元CU分割预测模型;所述CU分割预测模型为预先建立并训练的模型;
    根据选取的所述CU分割预测模型预测所述HEVC中的CU分割结果,根据预测的所述CU分割结果对整个编码树单元CTU进行分割。
  2. 根据权利要求1所述的方法,其特征在于,
    所述帧编码模式为帧内模式,则所述CU分割预测模型为能够提前终止的分层卷积神经网络ETH-CNN;
    所述帧编码模式为帧间模式,则所述CU分割预测模型为能够提前终止的ETH-LSTM和所述ETH-CNN。
  3. 根据权利要求2所述的方法,其特征在于,所述查看所述HEVC当前使用的帧编码模式的步骤之前,所述方法还包括:
    构建所述ETH-CNN,训练所述ETH-CNN;
    构建所述ETH-LSTM,训练所述ETH-LSTM。
  4. 根据权利要求3所述的方法,其特征在于,构建所述ETH-CNN,训练所述ETH-CNN的步骤,包括:
    构建帧内模式下HEVC中用于预测CU分割结果的第一数据库;
    采用HEVC标准参考程序对所述第一数据库中的图像进行编码,获取所述第一数据库中的正样本和负样本;
    采用所述正样本和所述负样本训练帧内模式对应的ETH-CNN。
  5. 根据权利要求4所述的方法,其特征在于,所述第一数据库中的每一个图像的分辨率为4928×3264;
    所述第一数据库包括:训练集、验证集和测试集;所述训练集、验证集和测试集中的每一个均包括四个子集;
    四个子集中第一个子集中每一个图像的分辨率为4928×3264,第二个子集中每一个图像的分辨率为2880×1920,第三个子集中每一个图像的分辨率为1536×1024,第四个子集中每一个图像的分辨率为768×512。
  6. 根据权利要求3所述的方法,其特征在于,构建所述ETH-CNN,训练所述ETH-CNN;构建所述ETH-LSTM,训练所述ETH-LSTM的步骤,包括:
    构建帧间模式下HEVC中用于预测CU分割结果的第二数据库;
    对第二数据库中的所有视频的分辨率进行预处理,使得每一个视频段分辨率在预设范围内,以及对视频长度进行预处理,使得每一个视频长度为预设长度以内;
    采用HEVC标准参考程序对预处理后的所述第二数据库中的视频进行编码,获取所述第二数据库中的正样本和负样本;
    采用所述正样本和所述负样本训练帧间模式对应的ETH-CNN和帧间模式对应的ETH-LSTM。
  7. 根据权利要求4所述的方法,其特征在于,
    ETH-CNN的输入为一个64×64的矩阵,代表整个CTU的亮度信息,用U来表示;
    ETH-CNN结构化输出为包含三个分支,分别代表三级HCPM的预测结果:
    Figure PCTCN2019079312-appb-100001
    Figure PCTCN2019079312-appb-100002
    ETH-CNN的提前终止机制能够提前结束第二、三个支路上全连接层的计算;
    和/或,ETH-CNN的具体结构包含两个预处理层,三个卷积层,一个归并层和三个全连接层。
  8. 根据权利要求7所述的方法,其特征在于,
    所述预处理层用于对所述矩阵进行预处理操作;
    从预处理层开始,输入信息在三条并列的分支
    Figure PCTCN2019079312-appb-100003
    中进行处理与变换;
    在每条分支B l中,所述卷积层对所有预处理后的数据进行三层卷积操作,表示为
    Figure PCTCN2019079312-appb-100004
    Figure PCTCN2019079312-appb-100005
    在同一层中,所有三条分支的卷积核大小相同;
    首先,在第1卷积层中,预处理后数据与16个4×4的核进行卷积,获得16种不同的特征图,以提取图像信息中的低级特征,为决定CU分割做准备;在第2、第3卷积层中,将上述特征图依次通过24个和32个2×2的核进行卷积,以提取较高级的特征,最终在每条B l分支中均得到32种特征图;
    所有卷积层中,卷积操作的步长等于核的边长;
    归并层,将三条分支
    Figure PCTCN2019079312-appb-100006
    中第2、第3卷积层的所有特征归并在一起,组合成向量;所述归并层的特征总共由6种来源的特征图组合而成,即
    Figure PCTCN2019079312-appb-100007
    Figure PCTCN2019079312-appb-100008
    所述全连接层,将归并后的特征再次分为三条支路
    Figure PCTCN2019079312-appb-100009
    进行处理,同样对应于HCPM中的三级输出;
    在每条支路B l中,特征向量依次通过三个全连接层:包括两个隐含层,以及一个输出层;两个隐藏 层的输出依次为
    Figure PCTCN2019079312-appb-100010
    Figure PCTCN2019079312-appb-100011
    最后一层的输出即为最终的HCPM;
    每个全连接层中的特征数量与其所在的支路有关,并能保证三条支路B 1,B 2和B 3分别输出1个,4个和16个特征,对应三级HCPM的预测值
    Figure PCTCN2019079312-appb-100012
    Figure PCTCN2019079312-appb-100013
    在ETH-CNN的第一、第二全连接层中,将QP作为一个外部特征,添加到特征向量中,使ETH-CNN能够对QP与CU分割的关系进行建模。
  9. 根据权利要求2至8任一所述的方法,其特征在于,预测的所述CU分割结果采用分层CU分割图HCPM的结构化输出方式表示;
    和/或,所述HCPM在在第1、2和3级分别包含1×1、2×2和4×4个二分类标签,对应真值y 1(U),
    Figure PCTCN2019079312-appb-100014
    Figure PCTCN2019079312-appb-100015
    以及预测值
    Figure PCTCN2019079312-appb-100016
    Figure PCTCN2019079312-appb-100017
    所述CU分割结果包括:第1级分类标签;
    和/或,
    当U或者U i被分割时,所述CU分割结果包括的第2级二分类标签或第3级二分类标签;
    当U或者U i没被分割时,所述CU分割结果包括的第2级分类标签或第3级分类标签中的空值null;
    和/或,所述ETH-CNN模型训练的目标函数为交叉熵;
    对于每个样本,其目标函数L r为所有二分类标签的交叉熵之和:
    Figure PCTCN2019079312-appb-100018
    其中,H(·,·)(l∈{1,2,3})代表HCPM中一个二分类器的预测值与真值标签间的交叉熵,r代表一批训练样本中的样本序号,L r表示第r个样本的目标函数,y 1(U),
    Figure PCTCN2019079312-appb-100019
    Figure PCTCN2019079312-appb-100020
    分别表示真值,
    Figure PCTCN2019079312-appb-100021
    Figure PCTCN2019079312-appb-100022
    Figure PCTCN2019079312-appb-100023
    分别表示预测值。
  10. 根据权利要求6所述的方法,其特征在于,
    将快速预编码后得到的残差CTU输入到所述ETH-CNN,以第二数据库中的CU分割标签作为真值,训练所述帧间模式的ETH-CNN;
    将所述ETH-CNN第一个全连接层输出的三个向量,分别输入到所述ETH-LSTM的三个级别;
    以及,以所述第二数据库中的CU分割标签作为真值,训练所述帧间模式的ETH-LSTM;
    ETH-LSTM中每一级的LSTM单元和全连接层,由这一级的CU分别进行训练,即由64×64的CU训练ETH-LSTM第1级,由32×32的CU训练第2级,由16×16的CU训练第3级;
    和/或,
    训练ETH-LSTM的配置信息中的参数时,以交叉熵作为损失函数;
    设训练时一批有R个样本,每个样本中LSTM的时间长度为T即T个LSTM单元,第r个样本第t帧的损失函数为L r(t),则这一批样本的损失函数L定义为所有L r(t)的平均值,即
    Figure PCTCN2019079312-appb-100024
    之后,利用动量随机梯度下降法进行训练;
    最终,给定训练好的LSTM,由ETH-LSTM得到HCPM,以预测帧间模式CU分割结果。
  11. 一种基于深度学习方法的块分割编码复杂度优化装置,其特征在于,包括:
    存储器、处理器、总线以及存储在存储器上并在处理器上运行的计算机程序,所述处理器执行所述程序时实现如权利要求1-10任意一项的方法。
  12. 一种计算机存储介质,其上存储有计算机程序,其特征在于:所述程序被处理器执行时实现如权利要求1-10任意一项的方法。
PCT/CN2019/079312 2018-03-22 2019-03-22 基于深度学习方法的块分割编码复杂度优化方法及装置 WO2019179523A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810240912.4A CN108495129B (zh) 2018-03-22 2018-03-22 基于深度学习方法的块分割编码复杂度优化方法及装置
CN201810240912.4 2018-03-22

Publications (1)

Publication Number Publication Date
WO2019179523A1 true WO2019179523A1 (zh) 2019-09-26

Family

ID=63319290

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/079312 WO2019179523A1 (zh) 2018-03-22 2019-03-22 基于深度学习方法的块分割编码复杂度优化方法及装置

Country Status (2)

Country Link
CN (1) CN108495129B (zh)
WO (1) WO2019179523A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111654698A (zh) * 2020-06-12 2020-09-11 郑州轻工业大学 一种针对h.266/vvc的快速cu分区决策方法

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108495129B (zh) * 2018-03-22 2019-03-08 北京航空航天大学 基于深度学习方法的块分割编码复杂度优化方法及装置
EP3743855A1 (en) * 2018-09-18 2020-12-02 Google LLC Receptive-field-conforming convolution models for video coding
CN111163320A (zh) * 2018-11-07 2020-05-15 合肥图鸭信息科技有限公司 一种视频压缩方法及***
CN110009640B (zh) * 2018-11-20 2023-09-26 腾讯科技(深圳)有限公司 处理心脏视频的方法、设备和可读介质
CN109769119B (zh) * 2018-12-18 2021-01-19 中国科学院深圳先进技术研究院 一种低复杂度视频信号编码处理方法
CN109788296A (zh) * 2018-12-25 2019-05-21 中山大学 用于hevc的帧间编码单元划分方法、装置和存储介质
CN109714584A (zh) * 2019-01-11 2019-05-03 杭州电子科技大学 基于深度学习的3d-hevc深度图编码单元快速决策方法
CN109996084B (zh) * 2019-04-30 2022-11-01 华侨大学 一种基于多分支卷积神经网络的hevc帧内预测方法
CN112087624A (zh) * 2019-06-13 2020-12-15 深圳市中兴微电子技术有限公司 基于高效率视频编码的编码管理方法
CN110675893B (zh) * 2019-09-19 2022-04-05 腾讯音乐娱乐科技(深圳)有限公司 一种歌曲识别方法、装置、存储介质及电子设备
CN110717898A (zh) * 2019-09-25 2020-01-21 上海众壹云计算科技有限公司 一种运用ai和大数据管理的半导体制造缺陷自动管理方法
CN111263145B (zh) * 2020-01-17 2022-03-22 福州大学 基于深度神经网络的多功能视频快速编码方法
CN111405295A (zh) * 2020-02-24 2020-07-10 核芯互联科技(青岛)有限公司 一种视频编码单元分割方法、***以及硬件实现方法
CN111385585B (zh) * 2020-03-18 2022-05-24 北京工业大学 一种基于机器学习的3d-hevc深度图编码单元划分方法
CN111556316B (zh) * 2020-04-08 2022-06-03 北京航空航天大学杭州创新研究院 一种基于深度神经网络加速的快速块分割编码方法和装置
JP2021175126A (ja) * 2020-04-28 2021-11-01 キヤノン株式会社 分割パターン決定装置、分割パターン決定方法、学習装置、学習方法およびプログラム
CN111583364A (zh) * 2020-05-07 2020-08-25 江苏原力数字科技股份有限公司 一种基于神经网络的群组动画生成方法
CN111596366B (zh) * 2020-06-24 2021-07-30 厦门大学 一种基于地震信号优化处理的波阻抗反演方法
CN112084949B (zh) * 2020-09-10 2022-07-19 上海交通大学 视频实时识别分割和检测方法及装置
CN111931732B (zh) * 2020-09-24 2022-07-15 苏州科达科技股份有限公司 压缩视频的显著性目标检测方法、***、设备及存储介质
CN112465664B (zh) * 2020-11-12 2022-05-03 贵州电网有限责任公司 一种基于人工神经网络及深度强化学习的avc智能控制方法
WO2023198057A1 (en) * 2022-04-12 2023-10-19 Beijing Bytedance Network Technology Co., Ltd. Method, apparatus, and medium for video processing
CN117319679A (zh) * 2023-07-20 2023-12-29 南通大学 一种基于长短时记忆网络的hevc帧间快速编码方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104602000A (zh) * 2014-12-30 2015-05-06 北京奇艺世纪科技有限公司 一种编码单元的分割方法和装置
CN104754357A (zh) * 2015-03-24 2015-07-01 清华大学 基于卷积神经网络的帧内编码优化方法及装置
US20150189270A1 (en) * 2013-10-08 2015-07-02 Kabushiki Kaisha Toshiba Image compression device, image compression method, image decompression device, and image decompression method
JP2016213615A (ja) * 2015-05-01 2016-12-15 富士通株式会社 動画像符号化装置、動画像符号化方法及び動画像符号化用コンピュータプログラム
CN108495129A (zh) * 2018-03-22 2018-09-04 北京航空航天大学 基于深度学习方法的块分割编码复杂度优化方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106162167B (zh) * 2015-03-26 2019-05-17 中国科学院深圳先进技术研究院 基于学习的高效视频编码方法
JP2017034531A (ja) * 2015-08-04 2017-02-09 富士通株式会社 動画像符号化装置及び動画像符号化方法
CN105120295B (zh) * 2015-08-11 2018-05-18 北京航空航天大学 一种基于四叉树编码分割的hevc复杂度控制方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150189270A1 (en) * 2013-10-08 2015-07-02 Kabushiki Kaisha Toshiba Image compression device, image compression method, image decompression device, and image decompression method
CN104602000A (zh) * 2014-12-30 2015-05-06 北京奇艺世纪科技有限公司 一种编码单元的分割方法和装置
CN104754357A (zh) * 2015-03-24 2015-07-01 清华大学 基于卷积神经网络的帧内编码优化方法及装置
JP2016213615A (ja) * 2015-05-01 2016-12-15 富士通株式会社 動画像符号化装置、動画像符号化方法及び動画像符号化用コンピュータプログラム
CN108495129A (zh) * 2018-03-22 2018-09-04 北京航空航天大学 基于深度学习方法的块分割编码复杂度优化方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, TIANYI ET AL.: "A Deep Convolutional Neural Network Approach For Complexity Reduction On Intra-mode HEVC", 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME, 14 July 2017 (2017-07-14), pages 1256 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111654698A (zh) * 2020-06-12 2020-09-11 郑州轻工业大学 一种针对h.266/vvc的快速cu分区决策方法
CN111654698B (zh) * 2020-06-12 2022-03-22 郑州轻工业大学 一种针对h.266/vvc的快速cu分区决策方法

Also Published As

Publication number Publication date
CN108495129A (zh) 2018-09-04
CN108495129B (zh) 2019-03-08

Similar Documents

Publication Publication Date Title
WO2019179523A1 (zh) 基于深度学习方法的块分割编码复杂度优化方法及装置
Xu et al. Reducing complexity of HEVC: A deep learning approach
Kim et al. Fast CU depth decision for HEVC using neural networks
TWI744827B (zh) 用以壓縮類神經網路參數之方法與裝置
TWI806199B (zh) 特徵圖資訊的指示方法,設備以及電腦程式
US20230336758A1 (en) Encoding with signaling of feature map data
CN114286093A (zh) 一种基于深度神经网络的快速视频编码方法
US11062210B2 (en) Method and apparatus for training a neural network used for denoising
US20230353764A1 (en) Method and apparatus for decoding with signaling of feature map data
CN111800642B (zh) Hevc帧内角度模式选择方法、装置、设备及可读存储介质
CN115311605B (zh) 基于近邻一致性和对比学习的半监督视频分类方法及***
CN114710667A (zh) 针对h.266/vvc屏幕内容帧内cu划分的快速预测方法及装置
CN116508320A (zh) 基于机器学习的图像译码中的色度子采样格式处理方法
TWI814540B (zh) 視訊編解碼方法及裝置
WO2023122132A2 (en) Video and feature coding for multi-task machine learning
Bakkouri et al. Effective CU size decision algorithm based on depth map homogeneity for 3D-HEVC inter-coding
US20240185572A1 (en) Systems and methods for joint optimization training and encoder side downsampling
CN114449273B (zh) 基于hevc增强型块划分搜索方法和装置
WO2023081091A2 (en) Systems and methods for motion information transfer from visual to feature domain and feature-based decoder-side motion vector refinement control
WO2023024115A1 (zh) 编码方法、解码方法、编码器、解码器和解码***
WO2023122244A1 (en) Intelligent multi-stream video coding for video surveillance
WO2023122149A2 (en) Systems and methods for video coding of features using subpictures
CN116634173A (zh) 视频的特征提取及切片方法、装置、电子设备及存储介质
WO2023137003A1 (en) Systems and methods for privacy protection in video communication systems
WO2023069337A1 (en) Systems and methods for optimizing a loss function for video coding for machines

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19772007

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19772007

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19772007

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.03.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19772007

Country of ref document: EP

Kind code of ref document: A1