CN112468808A

CN112468808A - I frame target bandwidth allocation method and device based on reinforcement learning

Info

Publication number: CN112468808A
Application number: CN202011354798.1A
Authority: CN
Inventors: 王妙辉; 黄丽蓉
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-09
Anticipated expiration: 2040-11-26
Also published as: CN112468808B

Abstract

The invention provides an I frame target bandwidth allocation method and device based on reinforcement learning, comprising the following steps: s1, inputting the video sequence into the HM coding system; s2, after the HM coding system allocates the target bandwidth to the GOP, calling the reinforcement learning neural network to allocate the target bandwidth to the current I frame; s3, the HM coding system uses the allocated target bandwidth to code the current I frame data, continuously codes the rest frames in the GOP to obtain the finished GOP data, and inputs the finished GOP data into a buffer area; and S4, judging whether the video sequence is coded or not, if not, acquiring the next GOP data, and returning to S2. The invention has the beneficial effects that: the method can select the optimal target bandwidth for the current video sequence by continuously perceiving the environment state, and helps to obtain better video quality and smaller code rate error.

Description

I frame target bandwidth allocation method and device based on reinforcement learning

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method and an apparatus for allocating I-frame target bandwidths based on reinforcement learning.

Background

The goal of rate control algorithms is to provide a high quality compressed sequence at a particular bandwidth or storage, which is crucial for maintaining the quality of video applications, especially for systems with high real-time requirements. In video coding, balancing the code rate and distortion of video frames is a key issue for rate control. In the prior art, a mathematical model is established through experimental data and research experience, so that bandwidth allocation, quantization and parameter adjustment are performed.

The rate control algorithm of h.265/HEVC still employs the traditional two-step approach — target bandwidth allocation and quantization parameter determination. The key of the image-level target bandwidth allocation is that the interdependence relation among video frame rate distortions is considered, and the allocated bandwidth weight is closely related to a target code rate, video content characteristics and a time domain prediction structure.

In HEVC, target bandwidth allocation is divided into GOP level, picture level and CTU level, wherein there are I, P, B video frame types in GOP level, I frame is the first frame of each GOP, and is an independent frame with all information, while P frame and B frame need to be predicted by relying on other frames. When there is a drastic change in motion and a fast scene change in a video sequence, the inter-frame correlation between two I frames is significantly reduced, and thus more bandwidth needs to be consumed for encoding. The existing image-level target bandwidth allocation strategy allocates weights to images according to target code rates, content characteristics and time domain prediction structures, and has no targeted design for the above conditions, so that effective processing cannot be guaranteed. The reinforcement learning-based method can optimize the target bandwidth allocation process from end to end, and further improve the performance. Therefore, we adopt a reinforcement learning mode to hope to obtain a more reasonable I-frame target bandwidth allocation strategy.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the defects of the prior art, an I frame target bandwidth allocation method and device based on reinforcement learning are provided, and the purpose is to optimize the target bandwidth allocation of an image level in the code rate control process, thereby reducing distortion and improving the video quality.

In order to solve the technical problems, the invention adopts the technical scheme that: an I-frame target bandwidth allocation method based on reinforcement learning comprises the following steps:

s1, inputting the video sequence into the HM coding system;

s2, after the HM coding system allocates the target bandwidth to the GOP, calling the reinforcement learning neural network to allocate the target bandwidth to the current I frame;

s3, the HM coding system uses the allocated target bandwidth to code the current I frame data, continuously codes the rest frames in the GOP to obtain the finished GOP data, and inputs the finished GOP data into a buffer area;

and S4, judging whether the video sequence is coded or not, if not, acquiring the next GOP data, and returning to S2.

Further, before step S2, the method further includes establishing a training model:

s21, selecting at least two videos with resolution difference, at least two videos with content difference and at least two videos with duration difference, performing bandwidth allocation and quantization parameter selection on the encoding process of the H.265/HEVC video according to the HM encoding system, and recording the encoding information of each video;

and S22, inputting the coded information into a reinforcement learning neural network for reinforcement learning.

Further, in step S22, A2C neural network is used for reinforcement learning.

Further, after step S21, the method further includes obtaining the supplemental encoding information:

s211, obtaining the texture features of the current frame I through a multi-scale Gaussian difference fusion calculation formula, wherein the multi-scale Gaussian difference fusion formula is as follows:

where (x, y) is the spatial coordinate, σ determines the smoothness of the image, σ₁＝0.54，σ₂＝0.87，σ₃＝1.19，

w is the weight of the gaussian difference term, w is 0.284,

a and b are parameters of the Gaussian difference, wherein a is 0.75, and b is 0.66;

s212, according to the sigma₁Generating a two-dimensional Gaussian distributionMatrix, the calculation formula is:

where x and y are the dimensions of the Gaussian kernel, w₁，w₂，w₃Three parameters related to the visual characteristics of human eyes are respectively w₁＝0.536，w₂＝0.277，w₃＝0.187；

By calculating a pixel gradient matrix G_xyAcquiring the edge characteristics of the current I frame, wherein the calculation formula of the pixel gradient matrix is as follows:

the image matrix coordinate system comprises a gray image matrix, a Sobel operator, a coordinate system, an image matrix coordinate system and a coordinate system, wherein I is a gray image matrix, S is a Sobel operator, c is 2, the origin of the image matrix coordinate system is located at the upper left corner, the positive x direction is from left to right, and the positive y direction is from top to bottom;

s213, obtaining the color feature of the current I frame through a color feature extraction formula, wherein the color feature extraction formula is as follows:

wherein h is_i,jRepresenting the probability of the occurrence of a pixel with a gray value of j in the ith color channel component, n representing the number of image gray levels, and d being 1.33;

s214, packaging the texture feature, the edge feature and the color feature of the current I frame into the supplementary coding information of the current I frame, and inputting the supplementary coding information into the reinforcement learning neural network for reinforcement learning.

Further, after step S2, the method further includes, in combination with the distortion degree after encoding the current frame and the distortion degree history information of the encoded frame, making an evaluation on the I-frame target bandwidth allocated by the mobile network using a reward calculation formula, where the reward calculation formula for evaluating bandwidth allocation is:

where i is the frame number, N represents the number of encoded frames, Q_iPSNR value, a2, B representing an image_iDenotes the sliding window size, R_iRepresents the number of encoding bandwidths, and lambda is the Lagrangian optimization factor value.

The invention also relates to an I frame target bandwidth allocation device based on reinforcement learning, which comprises a transmission module, an allocation module, a calling module, a coding module and a judgment module,

the transmission module is used for inputting the video sequence into the HM coding system;

the distribution module is used for distributing target bandwidth for GOP;

the calling module is used for calling the reinforcement learning neural network to allocate a target bandwidth for the current I frame;

the encoding module is used for using the allocated target bandwidth for encoding the current I frame data and continuously encoding the rest frames in the GOP to obtain the finished GOP data;

the transmission module is also used for inputting the completed GOP data into a buffer area;

the judging module is used for judging whether the video sequence is coded.

The system further comprises a learning module, wherein the learning module is used for selecting at least two videos with resolution difference, at least two videos with content difference and at least two videos with duration difference, performing bandwidth allocation and quantization parameter selection on the encoding process of the H.265/HEVC video according to an HM (high efficiency video) encoding system, recording the encoding information of each video, and inputting the encoding information to a reinforcement learning neural network for reinforcement learning.

Further, the learning module is also used for performing reinforcement learning by using an A2C neural network.

Further, the method further includes an obtaining module, where the obtaining module is configured to obtain supplementary coding information, where the supplementary coding information includes texture features, edge features, and color features of the current I frame, specifically:

acquiring the texture characteristics of the current frame I through a multi-scale Gaussian difference fusion calculation formula, wherein the multi-scale Gaussian difference fusion formula is as follows:

where (x, y) are spatial coordinates and the size of σ determines the degree of smoothness of the image, i.e. the size of the σ value for image profile and detail features, σ₁＝0.54，σ₂＝0.87，σ₃＝1.19，

w is the weight of the gaussian difference term, w is 0.284,

according to σ₁Generating a two-dimensional Gaussian distribution matrix, wherein the calculation formula is as follows:

obtaining the color feature of the current I frame through a color feature extraction formula, wherein the color feature extraction formula is as follows:

the acquisition module packs texture features, edge features and color features of the current I frame into the complementary coding information of the current I frame.

Further, the learning module is further configured to combine the distortion degree after the current frame is coded and the distortion degree history information of the coded frame, and evaluate an I-frame target bandwidth allocated to the mobile network by using an incentive calculation formula, where the incentive calculation formula for evaluating bandwidth allocation is:

The invention has the beneficial effects that: the method can select the optimal target bandwidth for the current video sequence by continuously perceiving the environment state, and helps to obtain better video quality and smaller code rate error.

Drawings

The specific process and structure of the present invention are detailed below with reference to the accompanying drawings:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of a reinforcement learning neural network structure according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the description of the invention relating to "first", "second", etc. is for descriptive purposes only and is not to be construed as indicating or implying any relative importance or implicit indication of the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

Referring to fig. 1, a method for allocating I-frame target bandwidth based on reinforcement learning includes:

s1, inputting the video sequence into the HM coding system;

in order to enable the reinforcement learning neural network to have the capability of primarily allocating target bandwidth, a training model needs to be established for the reinforcement learning neural network:

s21, selecting at least two videos with resolution difference, at least two videos with content difference and at least two videos with duration difference, performing bandwidth allocation and quantization parameter selection on the encoding flow of the H.265/HEVC video according to the HM encoding system, recording the encoding information of each video,

in this embodiment, there are 5 kinds of selected video resolutions, which are respectively: 352 × 288, 720 × 480, 1280 × 720, 1920 × 1080, 3840 × 2160;

the selected video content features are 3, respectively: simple background, small picture color change, simple foreground texture and outline, and uniform and flat motion; the background is complex, the picture color is rich, the foreground has the textures and contours of various objects, and the object rotates when moving slowly; the background is complex, the color of the picture is complicated, the details of the texture and the outline are numerous, and the scene switching is violent in motion or quicker;

there are 3 kinds of selected video duration differences, which are respectively: within 10 seconds; 10-30 seconds; 30-60 seconds;

according to the video difference, at least 10 videos of each feature are selected as training data, and 2 videos are selected as testing data. Thus, there were 450 training data and 90 test data. The training data set is encoded using the same quantization parameter for each frame, ranging from 20 to 44 (the quantization parameter is an integer), and the actual encoded information is recorded.

In order to better embody the inter-frame relevance of the I frame, the content features of the current I frame can be extracted, wherein the content features comprise texture features, contour features and color features, and are used as supplementary coding information of the training set.

w is the weight of the gaussian difference term, w is 0.284,

s212, according to the sigma₁Generating a two-dimensional Gaussian distribution matrix, wherein the calculation formula is as follows:

wherein x and y are dimensions of the Gaussian kernelDegree, w₁，w₂，w₃Three parameters related to the visual characteristics of human eyes are respectively w₁＝0.536，w₂＝0.277，w₃＝0.187；

and packing the texture feature, the edge feature and the color feature of the current I frame into the complementary coding information of the current I frame.

And S22, inputting the coding information and the supplementary coding information into the reinforcement learning neural network for reinforcement learning.

The A2C neural network is used for reinforcement learning, and the reinforcement learning neural network comprises a mobile network and an evaluation network, and the network structure of the network is shown in FIG. 2.

The action network is used for inputting a target bandwidth of a GOP in which the current I frame is positioned, texture features, contour features and color features of the current I frame, and texture features, contour features and color features of the last I frame.

The mobile network output is the target bandwidth of the current I frame.

In the action network, the reinforcement learning neural network can intelligently combine information such as historical coding information, the correlation degree of characteristics of a current I frame and a previous I frame, the target bandwidth of a current GOP (group of pictures), the target bandwidth of a frame layer and the like to decide the target bandwidth of the current I frame.

In order to evaluate the target bandwidth of the current I frame output by the mobile network, the evaluation network of the reinforcement learning neural network is used for inputting the target bandwidth of the current I frame and outputting the target bandwidth as the evaluation value of the mobile network.

In the evaluation network, the reinforcement learning neural network can intelligently combine the distortion degree after the current frame is coded and the historical information of the distortion degree of the coded frame, and a reward calculation formula for evaluating bandwidth allocation is adopted to evaluate the I frame target bandwidth allocated by the action network. Meanwhile, the evaluation network can carry out back propagation on the calculation gradient and update the network parameters.

The reward calculation formula for evaluating bandwidth allocation is as follows:

And (3) carrying out feature sampling on the video I frame in the data set according to the feature extraction method, and inputting the features and required information into a mobile network. When the evaluation network evaluates, the coding information in the data set is used as a part of the historical information for rate distortion performance evaluation, and continuous learning reinforcement of the reinforcement learning neural network can be realized.

S3, using the I frame target bandwidth output by the reinforcement learning neural network in subsequent target bandwidth allocation and quantization parameter decision, encoding the current I frame data, continuously encoding the rest frames in the GOP to obtain finished GOP data, and inputting the finished GOP data into a buffer area;

and S4, judging whether the video sequence is coded or not, if not, acquiring next GOP data, returning to S2, and circulating the steps until the coding of the whole video sequence is finished.

From the above description, the beneficial effects of the present invention are: the method can be used for extracting texture features, contour features and color features of the I frame, correlating content features of a video foreground, analyzing the image complexity of the I frame, assisting in more accurate bandwidth allocation, continuously sensing an environment state, selecting an optimal target bandwidth for a current video sequence and helping to obtain better video quality and smaller code rate errors.

the distribution module is used for distributing target bandwidth for GOP;

the judging module is used for judging whether the video sequence is coded.

In order to enable the reinforcement learning neural network to have the initial target bandwidth allocation capacity, the reinforcement learning neural network further comprises a learning module, wherein the learning module is used for selecting at least two videos with resolution difference, at least two videos with content difference and at least two videos with duration difference, performing bandwidth allocation and quantization parameter selection on the encoding process of the H.265/HEVC video according to an HM encoding system, recording the encoding information of each video, and inputting the encoding information to the reinforcement learning neural network for reinforcement learning.

In order to better embody the inter-frame relevance of the I frame, the method further includes an obtaining module, where the obtaining module may extract content features of the current I frame, where the content features include texture features, contour features, and color features, and serve as supplementary encoding information of a training set, specifically:

w is the weight of the gaussian difference term, w is 0.284,

and finally, the acquisition module packs the texture feature, the edge feature and the color feature of the current I frame into the complementary coding information of the current I frame.

In order to ensure the learning effect of the reinforcement learning neural network, the learning module adopts the A2C neural network to carry out reinforcement learning.

Wherein, the reinforcement learning neural network comprises a mobile network and an evaluation network.

The mobile network output is the target bandwidth of the current I frame.

In order to evaluate the target bandwidth of the current I frame output by the mobile network, the learning module is further configured to evaluate the I frame target bandwidth allocated by the mobile network by using a reward calculation formula for evaluating bandwidth allocation in combination with the distortion degree after the current frame is encoded and the distortion degree history information of the encoded frame, and the evaluation network can perform backward propagation on the calculation gradient and update the network parameters.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An I-frame target bandwidth allocation method based on reinforcement learning comprises the following steps:

s1, inputting the video sequence into the HM coding system;

2. The reinforcement learning-based I-frame target bandwidth allocation method of claim 1, wherein: before step S2, the method further includes establishing a training model:

3. The reinforcement learning-based I-frame target bandwidth allocation method of claim 2, wherein: in step S22, reinforcement learning is performed using the A2C neural network.

4. The reinforcement learning-based I-frame target bandwidth allocation method according to claim 3, wherein: in step S21, the method further includes obtaining the supplemental encoding information:

w is the weight of the gaussian difference term, w is 0.284,

5. The reinforcement learning-based I-frame target bandwidth allocation method according to claim 4, wherein: after step S2, the method further includes, in combination with the distortion degree after encoding the current frame and the distortion degree history information of the encoded frame, making an evaluation on the I-frame target bandwidth allocated by the mobile network using a reward calculation formula, where the reward calculation formula for evaluating bandwidth allocation is:

6. An I-frame target bandwidth allocation apparatus based on reinforcement learning, characterized in that: comprises a transmission module, a distribution module, a calling module, a coding module and a judgment module,

the distribution module is used for distributing target bandwidth for GOP;

the judging module is used for judging whether the video sequence is coded.

7. The reinforcement learning-based I-frame target bandwidth allocation apparatus of claim 6, wherein: the video enhancement learning system further comprises a learning module, wherein the learning module is used for selecting at least two videos with resolution difference, at least two videos with content difference and at least two videos with duration difference, carrying out bandwidth allocation and quantization parameter selection on the encoding process of the H.265/HEVC videos according to the HM encoding system, recording the encoding information of each video, and inputting the encoding information to the enhancement learning neural network for enhancement learning.

8. The reinforcement learning-based I-frame target bandwidth allocation apparatus of claim 7, wherein: the learning module is also used for performing reinforcement learning by adopting an A2C neural network.

9. The reinforcement learning-based I-frame target bandwidth allocation apparatus of claim 8, wherein: the system further comprises an obtaining module, wherein the obtaining module is configured to obtain supplementary encoding information, and the supplementary encoding information includes texture features, edge features, and color features of the current I frame, specifically:

w is the weight of the gaussian difference term, w is 0.284,

according to σ₁Generating a two-dimensional Gaussian distribution momentThe calculation formula of the array is as follows:

10. The reinforcement learning-based I-frame target bandwidth allocation apparatus of claim 9, wherein: the learning module is further configured to evaluate an I-frame target bandwidth allocated to the mobile network by using a reward calculation formula in combination with the distortion degree after the current frame is encoded and the distortion degree history information of the encoded frame, where the reward calculation formula for evaluating bandwidth allocation is: