CN114745549A

CN114745549A - Video coding method and system based on region of interest

Info

Publication number: CN114745549A
Application number: CN202210350595.8A
Authority: CN
Inventors: 毕江; 王立冬; 金强; 肖春艳; 韩强; 樊思津; 张文东; 周骋
Original assignee: Beijing Radio And Television Station; Sumavision Technologies Co Ltd
Current assignee: Beijing Radio And Television Station; Sumavision Technologies Co Ltd
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-07-12
Anticipated expiration: 2042-04-02
Also published as: CN114745549B

Abstract

The invention relates to a video coding method and a system based on a region of interest, wherein a downsampling module downsamples an original image to obtain a low-resolution image. And the primary selection prediction module divides the low-resolution image into a plurality of macro blocks, and performs intra-frame or inter-frame prediction on the macro blocks in the region of interest to obtain the result of the optimal primary selection prediction mode of the current macro block. And the coding module is used for setting coding units in the original image to code each coding unit, and acquiring the optimal prediction angle Dir required by real coding according to the result of the optimal primary selection prediction mode of the current macro block in the coding process_bestOr the best predicted motion vector and the corresponding rate-distortion optimized RDO value.

Description

Video coding method and system based on region of interest

Technical Field

The invention relates to a video coding technology, in particular to code rate control when a video is coded according to a region of interest.

Background

The purpose of video coding is to remove redundant information in video and compress data amount, and currently, a hybrid coding framework of 'prediction + transformation + quantization' is generally adopted for compression coding.

Prediction is to predict the information of the current pixel by using the information of the known pixel, and can be divided into two categories: intra prediction and inter prediction. The intra-frame prediction is to predict the pixel value of the current pixel based on the spatial correlation among pixels in the same frame image, for example, the pixel value (predicted value) of the current block is projected and predicted by using the reconstructed pixel of the adjacent domain block of the current coding unit; inter-frame prediction is based on the correlation between different frame images in the time domain, and predicts the pixel value of the current pixel, for example, the motion trajectories of the current coding block and the corresponding block in the reference image are tracked and predicted, the current coding unit is predicted by using the reference block adjacent in the time domain, and the motion estimation precision is improved by means of interpolation calculation and the like.

And subtracting the pixel values of the predicted value and the original video image point by point, carrying out variable coding on the residual error, and further concentrating the energy in a low-frequency area through cosine transformation.

Quantization is the only process which brings video quality loss, the coding process needs to balance the selection process of QP (quantization parameter), the QP is larger, more high-frequency signals are lost, and the image becomes fuzzy and loses texture details; the smaller the QP, the larger the residual coefficients that remain, which may exceed the nominal bandwidth of the code rate.

With the requirements of ultra-high definition, high dynamic range, high color gamut, high-fluency video playing and good viewing experience, higher demands are made on the coding performance and quality. Because the information amount of the video image presented by the terminal is larger than that of the prior art, audiences tend to be more sensitive to human eyes in a flat low-frequency area or an eyeball-attracting area in a scene, such as elements of facial expressions, rolling identification captions, television station logos and the like in a television play and a stage evening; for scenes with severe changes in the time-space domain scale, such as objects moving at high speed, decoration with complex textures, human eyes are often ignored, and accurate understanding of image content and corresponding coding technology are important links for improving quality and balancing code rate distribution in the coding process.

The traditional video coding needs to traverse various combinations of coding tools, such as coding unit partition strategies and prediction methods with different sizes, measure coding loss by comparing rate-distortion cost functions of different combinations, and determine an optimal coding mode. This process is the most time-consuming link in the encoding process, but it cannot guarantee the best encoding quality. For example, the importance of the local region is described by using the conventional complexity index such as SATD (sum of absolute differences after hadamard transformation of the residual), SAD (sum of absolute differences), etc., which tends to allocate more bits to the region to which the human eye is not sensitive, and consumes more computing resources, resulting in poor real-time performance.

In view of the problems in the prior art, an object of the present invention is to provide a method and a system for video coding decision based on regions of interest, which reduce the computational resources required for coding and improve the real-time performance while providing better viewing experience for users.

Disclosure of Invention

In order to solve the above problem, a first technical solution is a region-of-interest based video coding method, including,

an information reading step, which is to sequentially acquire original image data of each frame of a video and pixel position information of an interested area in the original image, wherein the original image at least comprises one interested area;

a down-sampling step, namely down-sampling the original image to obtain a low-resolution image;

a preliminary selection prediction step, namely dividing the low-resolution image into a plurality of macro blocks, carrying out intra-frame prediction on the macro blocks in the region of interest, traversing the prediction angle supported in the coding standard, calculating the distortion SATD value of the intra-frame prediction pixel after projection reconstruction and the pixel of the low-resolution image, and obtaining the inter-frame minimum distortion SATD_bestCorresponding predicted angle Dir_best，

For the original image of an I frame, SATD (minimum distortion degree)_bestCorresponding predicted angle Dir_bestAs a result of the best preliminary prediction mode for the current macroblock;

for the original image of the P frame or the B frame, searching the coordinates of the area corresponding to the interested area in the adjacent frame, calculating the motion vector corresponding to the change of the barycentric position of the interested area between the reference frame and the reference frame, searching by taking the motion vector as a starting vector, sequentially calculating the SATD under different offsets, and determining the optimal motion vector prediction value MV_bestAnd minimum value of inter prediction distortion (SATD)_interComparing the minimum distortion degree SATD_bestAnd the minimum value SATD of the inter-prediction distortion degree_interSelecting a prediction result with small distortion as a result of the optimal initial selection prediction mode of the current macro block;

an encoding step of setting encoding units in the original image to encode each encoding unit, and in the encoding process, for the original image of the I frame, according to the prediction angle Dir_bestConstructing a prediction reference angle set, traversing the angles in the set, comparing the RDO values corresponding to all the angles, and obtaining the optimal prediction angle Dir required by real coding_best；

For the original image of P frame or B frame, according to the selection node of the initial selection prediction stepIf intra-frame prediction is selected, the optimal prediction angle Dir required by real coding is obtained according to the same method as the I frame_best(ii) a If the inter-frame prediction is selected, stretching the motion vector obtained in the preliminary selection prediction step according to the scaling scale, and comparing rate distortion optimization RDO values of different stretched motion vectors in the same search range to obtain the optimal prediction motion vector required by the real coding and the corresponding rate distortion optimization RDO value.

Therefore, in the encoding process, the invention can predict only the macro block of the interested area according to the pixel position information of the interested area, and calculate the optimal prediction angle Dir required by real encoding when encoding the encoding unit according to the obtained result of the optimal prediction mode_bestAnd the corresponding rate distortion optimization RDO value or the optimal prediction motion vector and the corresponding rate distortion optimization RDO value improve the real-time performance of coding. In addition, different coding strategies do not need to be selected according to different characteristics of the interested region, and the method is particularly suitable for video coding of the interested region with different characteristics mixed in the same frame of image, so that better watching experience is provided for a user, and the overall efficiency is improved.

Preferably, in the encoding step, a reference quantization parameter QP is allocated to the original picture_baseCounting the sum of the SATD values of the distortion degrees of different interested areas in the original image, allocating a local target code rate to the interested areas according to the area ratio of the interested areas to the original image, taking the sum of the SATD values as the input of a code rate control algorithm, allocating a quantization parameter QP to each interested area according to the local target code rate,

wherein, clip3(x, min, max) limits x to (min, max).

Because the size of the region of interest is smaller than that of the whole video image, when the bit number is allocated to the coding unit in the region of interest, a certain degree of offset is carried out relative to the quantization parameter QP of the current image, namely, the code rate resource is moderately inclined to the region of interest, so that the effective utilization of the code rate is improved, and the code rate allocation of the region of interest is more reasonable.

Preferably, the original image includes Y, U, V data of three channels, and in the down-sampling step, the data of the Y component in the original image is down-sampled to obtain a low-resolution image.

Because the data of the Y component is only downsampled to obtain the low-resolution image, the calculation time and the calculation amount are saved.

Preferably, in the down-sampling step, pixels closest to the edge of the low-resolution image are sequentially copied, and then pixels extending outward around the low-resolution image are added.

Since the pixels closest to the edge of the low-resolution image are copied in sequence and then added with the pixels which are expanded outwards at the periphery of the low-resolution image, the search area can contain the boundary of the low-resolution image.

Preferably, in the encoding step, for the original image of the I frame, the encoding unit divides the original image into minimum sizes, constructs a prediction reference angle set for macroblocks of different division levels, traverses angles in the set, and compares RDO values corresponding to the angles to obtain an optimal prediction angle for a real encoding process of each layer.

Because the I frame is a reference frame when the P frame and the B frame are coded, the coding unit is divided into the minimum size, the detail information of the image can be reserved, and the coding quality of the whole video sequence is improved.

Preferably, the region of interest is a face region, and in the encoding step, when the original image of the P frame or the B frame contains the face region in the current encoding unit, it is determined whether the encoding unit contains the face region and the edge of the background or five sense organs, and if so, the encoding unit is divided into the minimum size; and if not, dividing a layer, and when the sum of the rate-distortion optimized RDO values of all the sub-units after division is smaller than the rate-distortion RDO value corresponding to the optimal prediction mode when the sub-units are not divided, dividing the coding unit by a layer.

Therefore, high-frequency detail information of the human face area and the edge of the background or the five sense organs can be reserved, and the code rate is more reasonably utilized while the video impression experience is improved.

Preferably, the region of interest is a caption region, and in the encoding step, when the current encoding unit contains a caption region for an original image of a P frame or a B frame, it is determined whether the encoding unit contains a boundary position between a caption and a background region, and if so, the encoding unit is divided into minimum sizes; if not, the coding unit is not divided.

Since the movement of the subtitles is usually a rigid movement of horizontal or vertical movement, the calculation time of the roughing process can be saved. The real-time performance of the coding is improved.

Preferably, the region of interest is a fixed identifier region, and in the encoding step, when the current encoding unit of the original image of the P frame or the B frame includes a fixed identifier, it is determined whether the encoding unit includes an edge of the fixed identifier, and if so, the encoding unit is divided into minimum sizes; if not, the coding unit is not divided.

Because fixed marks such as television station marks and the like are usually fixed at fixed positions of video pictures, the video impression experience is improved, and meanwhile, the calculation time in the rough selection process can be saved. The real-time performance of the coding is improved.

A second technical solution is a region-of-interest based video coding system, comprising,

the information reading module 100 sequentially obtains original image data of each frame of a video and pixel position information of an interested area in the original image, wherein the original image at least comprises one interested area;

a down-sampling module 200, which down-samples the original image to obtain a low resolution image;

the initial selection prediction module 300 divides the low-resolution image into a plurality of macro blocks, performs intra-frame prediction on the macro blocks in the region of interest, traverses the prediction angle supported in the coding standard, calculates the distortion SATD value of the intra-frame prediction pixel after projection reconstruction and the pixel of the low-resolution image, and obtains the inter-frame minimum distortion SATD value_bestAnd pair ofCorresponding predicted angle Dir_best，

an encoding module 400, configured to set encoding units in the original image to encode each encoding unit, and during encoding, for the original image of the I frame, according to the prediction angle Dir_bestConstructing a prediction reference angle set, traversing the angles in the set, comparing the RDO values corresponding to all the angles, and obtaining the optimal prediction angle Dir required by real coding_best；

For the original image of the P frame or the B frame, according to the selection result of the initial selection prediction step, if the selected image is intra-frame prediction, the optimal prediction angle Dir required by real coding is obtained according to the same method as the I frame_best(ii) a If the inter-frame prediction is selected, stretching the motion vector obtained in the preliminary selection prediction step according to the scaling scale, and comparing rate distortion optimization RDO values of different stretched motion vectors in the same search range to obtain the optimal prediction motion vector required by the real coding and the corresponding rate distortion optimization RDO value.

The technical effect is the same as that of the first technical scheme.

Preferably, the encoding module 400 allocates a reference quantization parameter QP to the original image_baseAll in allCalculating the sum of the SATD values of the distortion degrees of different interested areas in the original image, allocating a local target code rate to the interested areas according to the area ratio of the interested areas to the original image, taking the sum of the SATD values as the input of a code rate control algorithm, allocating a quantization parameter QP to each interested area according to the local target code rate,

wherein, clip3(x, min, max) limits x to (min, max).

The technical effect is the same as that of the first technical scheme.

Drawings

FIG. 1 is a diagram illustrating an embodiment of a region of interest based video encoding system;

fig. 2 is an explanatory diagram of downsampling an original image;

fig. 3 is an explanatory diagram of dividing a low resolution image into macroblocks;

FIG. 4 is a flowchart of setting an optimal prediction mode when the region of interest is a face region;

FIG. 5 is a flowchart illustrating setting an optimal prediction mode when the region of interest is a subtitle region;

fig. 6 is a flowchart of setting the optimal prediction mode when the region of interest is the logo region.

Detailed Description

In the following detailed description of the preferred embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration, specific features of the invention, such that the advantages and features of the invention may be more readily understood and appreciated. The following description is an embodiment of the claimed invention, and other embodiments related to the claims not specifically described also fall within the scope of the claims.

First, a video coding decision based on a region of interest will be explained.

In the first step, each frame of original image of the video is down-sampled by 1/s times to obtain a low-resolution image, for example, the Y component of the original image is down-sampled to save the calculation time.

And step two, outwards expanding the edge of the low-resolution image by s pixels to obtain an expanded low-resolution image. By expanding the pixels, the boundary of the low-resolution image can be searched during searching, and the searching range is improved.

And a third step of dividing the low resolution image portion of the extended low resolution image into a number of macroblocks.

In addition, the original image of each frame takes a human face, a caption, a station caption and the like as interested areas through a system such as a neural network and the like to be recognized, and the position information of each pixel of the interested areas is obtained. The texture features and motion features of the face, the subtitles and the station captions are different and the same as the region of interest.

And fourthly, carrying out intra-frame prediction on the macro blocks positioned in the interested area to obtain the best coarse preselection result. The result of the optimal coarse pre-selection is used for calculating the rate distortion optimization RDO value of the coding unit during coding and the corresponding optimal prediction angle and optimal motion vector in the real coding process.

In the fifth step, the original image is divided into a plurality of coding units, for example, into a plurality of coding units of 64 × 64 pixels, and encoded. In the encoding process, if the original image is an I frame, the encoding unit is divided into minimum sizes, for example, 8x8 pixels, so as to retain the detail information of the image and improve the encoding quality of the whole video sequence.

If the original image of P frame or B frame, when the coding unit belongs to the face area, it is determined whether it belongs to the edge area of the face of P frame or B frame and five sense organs, if it includes, the coding unit is subdivided into the minimum size, for example, 8x8 pixels. If the maximum coding unit is 64x64 pixels, the coding unit is divided into 4 sub-units of 32x32 pixels.

And judging whether the coding unit containing the subtitle area is positioned at the boundary position of the subtitle and the background area, and if so, dividing the coding unit into 8x8 pixels with the minimum size to ensure the definition of the edge area. If not, because the motion of the subtitles in the image is regular vertical or horizontal movement, the subtitles belong to rigid motion, too large deformation cannot be caused, and the division of the coding units is not performed any more.

And a sixth step of calculating the rate-distortion optimized RDO value of the coding unit and the corresponding optimal prediction angle and the optimal motion vector in the real coding process according to the result of the optimal coarse pre-selection during the initial pre-selection prediction.

If the decision mode of the current coding unit in the initial selection prediction is intra-frame prediction, making a decision according to the decision process of the prediction direction of the I frame image in the initial selection prediction, and calculating the optimal rate distortion optimization RDO value; if the prediction is inter-frame prediction, the motion vector obtained by rough selection is firstly stretched, for example, motion estimation is carried out in a rectangular frame with a search range of 8x4, and finally, the best prediction motion vector and the corresponding rate distortion optimization RDO value are determined.

If the current video is a P frame or a B frame and the coding unit belongs to the logo region, similarly, the coding unit of the edge position is divided into 8x8 pixels with the minimum size; if not, the partition of the coding unit is not performed. For the decision of the coding mode, similarly, if the best prediction mode of the initial prediction is intra-frame prediction, the method in rough selection is used, for example, a reference angle set is constructed, and the best projection angle is determined after traversal. And if the inter-frame prediction is obtained by the initial selection prediction, the motion vector during the initial selection prediction is directly used as the prediction vector of the current coding unit after being stretched, and the coding mode is determined by comparing whether the rate distortion optimization RDO value of the residual is reserved or not.

A seventh step of allocating a reference quantization parameter QP to the original picture during encoding_baseCounting the sum of the SATD values of the distortion degrees of different interested areas in the original image, allocating a local target code rate to the interested areas according to the area ratio of the interested areas to the original image, taking the sum of the SATD values as the input of a code rate control algorithm, allocating a quantization parameter QP to each interested area according to the local target code rate,

wherein, clip3(x, min, max) limits x to (min, max).

Since the region of interest (ROI) is small relative to the size of the video image, when allocating the number of bits to the coding units in these regions, a strategy of performing a certain degree of offset on the reference QP of the current image is adopted, so that the bitrate resource can be moderately tilted toward the ROI. On the basis of not increasing the code rate remarkably, the watching experience of the whole video is further improved.

When the offset is distributed for the region of interest, the offset range needs to be limited, the occurrence of a maximum or minimum value is avoided, and the offset range is set to be-3 according to the change of quantization parameters after the compression rate is doubled in different indexes.

The down-sampling may be performed at a ratio of 1/2, 1/4, or 1/8 in both directions of the length and width of the image, for example, as needed.

At a sampling ratio of 1/16, i.e., 1/4 down-samples each frame image in the vertical and horizontal directions to the original size, the first image is 1/16 at the original resolution. The invention adopts Gaussian filter function to carry out down-sampling to obtain low-resolution images.

In the invention, the initial selection prediction is carried out on the low-resolution image, so that the speed of the rough selection prediction can be improved, and the time is saved.

Before rough selection prediction is carried out, the low-resolution image is sequentially expanded outwards by 16 pixel values from top to bottom and from left to right, namely the pixels closest to the edge of the image are sequentially copied for 16 times to obtain the expanded low-resolution image. The method can support the prediction process to refer to data except for the down-sampling image, and is compatible with the initial selection prediction speed and the accuracy of the initial selection prediction process.

Meanwhile, the coordinates of the region of interest are correspondingly scaled in cooperation with the low-resolution image, that is, the positions of the pixels of the region of interest in the low-resolution image are not changed.

Fig. 1 is a diagram illustrating an embodiment of a region-of-interest based video coding system. As shown in fig. 2, the present embodiment includes a region of interest recognition device 90, an information reading module 100, a down-sampling module 200, a preliminary selection prediction module 300, and an encoding module 400.

The region-of-interest identifying device 90 detects a region of interest in the video image, and obtains position information of each pixel of the region of interest in each frame image. In this embodiment, the region-of-interest recognition apparatus 90 includes a face recognition module 91, a subtitle recognition module 92, and a logo recognition module 93, and the face pixel position, the subtitle pixel position, and the logo pixel position information recognized by each module are respectively read by the information reading module 100 together with the video image. That is, the information reading module 100 sequentially reads the original image information and the pixel position information of the identified region of interest frame by frame.

The down-sampling module 200 down-samples the original image to obtain a low resolution image. In this embodiment, the original image includes Y, U, V three-channel data, and the downsampling module 200 downsamples the Y component data in the original image to obtain a low resolution image, sequentially copies the pixels closest to the edge of the low resolution image, and adds the pixels extending outward around the low resolution image to obtain an extended low resolution image.

Fig. 2 is an explanatory diagram of a generation process from an original image, a low resolution image, and an extended low resolution image. With the down sampling of the image, the pixel position of the interested area is correspondingly adjusted, so that the position of the interested area in the original image is ensured to be consistent with the position of the interested area in the low-resolution image. Wherein the face 21, the subtitle 22, and the station caption 23 are respectively recognized as the regions of interest by the region of interest recognition device 90.

The initial selection prediction module 300 divides the low-resolution image into a plurality of macroblocks (see the macroblock 11 in fig. 3) of 8 × 8 pixels, performs intra-frame prediction on the macroblocks in the region of interest, traverses the prediction angle supported in the coding standard, calculates the distortion SATD values of the intra-frame prediction pixels after projection reconstruction and the pixels of the low-resolution image, and obtains the inter-frame minimum distortion SATD value_bestCorresponding predicted angle Dir_bestI.e. by traversing the prediction angle, obtaining the best prediction direction and distortion factor(Steps S10, S11, S20, S21, S30, S31 in FIGS. 4 to 6).

Judging whether the current frame is a P frame or a B frame (see steps S12, S22, S32 in FIGS. 4 to 6), if not, the picture is the original picture of the I frame, the best prediction mode is intra-frame prediction (steps 13, 23, 33 in FIGS. 4 to 6), and the SATD with the minimum distortion is determined after traversing all possible prediction angles_bestCorresponding to the predicted angle Dir_bestThe minimum distortion degree SATD_bestCorresponding predicted angle Dir_bestAs a result of the best preliminary prediction mode for the current macroblock.

In the case of a P frame or a B frame, the macroblock can be intra-predicted and inter-predicted at this time. When the current macro block belongs to the face area, because the motion state of people in a video scene is not fixed, the optimal rough selection prediction mode of the current macro block has two possibilities of intra-frame prediction and inter-frame prediction, and the respective optimal prediction methods are judged.

Firstly, inter-frame prediction is carried out, coordinate axes of corresponding face areas in adjacent frames are searched according to the positions of the face areas of P frames or B frames, motion vectors corresponding to the changes of the gravity center positions of rectangular frames of the two frames of faces are calculated (step S14 in figure 4), the motion vectors are used as starting vectors, motion estimation is started based on a traditional motion estimation algorithm (step S15 in figure 4), searching is carried out in a rectangular frame with the fixed size of 16x8 pixels (S16), SATD values under different motion vector offsets are sequentially compared, and an optimal motion vector prediction value MV is determined_bestAnd minimum value of inter prediction distortion (SATD)_inter。

Then, intra-frame prediction is carried out, the same prediction mode as that in the I frame is carried out, and the minimum distortion SATD is determined after all possible prediction angles are traversed_intraCorresponding predicted angle Dir_intraThen, SATD is compared_intraAnd SATD_interThe prediction mode with the smaller distortion degree is selected as the optimal rough prediction mode of the current macroblock (step S18 in fig. 4).

If the current macroblock belongs to the caption area, the caption can be judged in advance to be rigid motion conforming to horizontal or vertical movement according to the prior knowledge about the caption. At this time

For the image of the I frame, the same prediction process as the human face area is adopted, namely, all prediction angles are traversed, and the minimum distortion degree SATD between frames is selected_bestCorresponding predicted angle Dir_bestAs a result of the best preliminary prediction mode for the current macroblock (step S23, fig. 5).

For the image of P frame or B frame, firstly judging whether the frame nearest to the I frame is the frame nearest to the I frame, when the frame nearest to the I frame is the frame, adopting the same method as the previous human face area to calculate the gravity center motion vector of the caption area in the I frame and the P frame or the B frame (step S26 in figure 5), taking the gravity center motion vector as the starting motion vector of the current macro block, selecting a smaller search range 8x4 pixel (step S27 in figure 5), starting motion estimation (step S28 in figure 5) to obtain and record the optimal motion vector predicted value MV_bestAnd minimum value of inter prediction distortion (SATD)_inter(step S29 of fig. 5). At this time, considering that the optimal motion vector between different macro blocks has difference due to pixel noise and other factors, the MV of the caption macro block in the current frame is needed_bestAveraging and storing.

If the current frame is not nearest to the I frame, the optimal motion vector predicted value MV which is obtained by previous calculation is used_bestAs the base vector, the corresponding best prediction vector is obtained by stretching according to the distance (step S25 in fig. 5), so that the calculation time of the rough selection process can be saved.

When the current macro block belongs to the station logo area, the rough selection process can be further simplified. The station caption is generally fixed at a fixed position in a video picture, and the position of the station caption in a longer-time video sequence is considered to be kept unchanged, and the inter-frame minimum distortion SATD is obtained for the I frame in the same prediction mode of the human face and the caption_bestCorresponding predicted angle Dir_best。

For P frame or B frame, the initial error of motion estimation is directly set to (0, 0) due to the fixed position of the station logo (step S34 in FIG. 6), then a search range of 2x2 pixels is set (step S35 in FIG. 6), motion estimation is started in a smaller search range (step S36 in FIG. 6), and finally the determination of the optimal motion vector predictor MV is determined_bestAnd minimum value of inter prediction distortion (SATD)_inter(step S37 of fig. 6).

The encoding module 400 encodes each coding unit, and in the present embodiment, encodes according to HEVC and AVS2 standards.

In the encoding process, parameters are set as follows.

For the original image of the I frame, according to the prediction angle Dir_bestConstructing a prediction reference angle set, traversing the angles in the set, comparing rate distortion optimization RDO values corresponding to all the angles, and obtaining an optimal prediction angle Dir required by real coding_best。

For the original image of P frame or B frame, according to the selection result of the initial selection prediction step, if the selected prediction is intra-frame prediction, the optimal prediction angle Dir required by real coding is obtained according to the same method as the I frame_best(ii) a If the inter-frame prediction is selected, stretching the motion vector obtained in the preliminary selection prediction step according to the scaling scale, and comparing rate distortion optimization RDO values of different stretched motion vectors in the same search range to obtain the optimal prediction motion vector required by the real coding and the corresponding rate distortion optimization RDO value.

During the encoding process, the encoding module (400) allocates a reference quantization parameter QP to the original image_baseThe sum of the SATD values of the distortion degrees of different interested areas in the original image is counted, a local target code rate is allocated to the interested areas according to the area ratio of the interested areas to the original image, the sum of the SATD values is used as the input of a code rate control algorithm, a quantization parameter QP is allocated to each interested area according to the local target code rate,

wherein, clip3(x, min, max) limits x to (min, max).

In the invention, the down-sampling of the video frame is performed to perform rough selection prediction, so that the rough selection prediction speed can be improved; when the region of interest is a face, the edge of the face and the region where the five sense organs are located are only divided into smaller regions, so that the calculated amount in the encoding process is reduced, and the real-time performance of the system is ensured; by fine adjustment of the code rate, the code rate distribution of the region of interest is more reasonable.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim.

Claims

1. A region-of-interest based video coding method is characterized by comprising,

the method comprises the steps of reading information, namely sequentially acquiring original image data of each frame of a video and pixel position information of an interested area in the original image, wherein the original image at least comprises one interested area;

for the original image of the P frame or the B frame, searching the coordinates of the area corresponding to the interested area in the adjacent frame, calculating the motion vector corresponding to the change of the barycentric position of the interested area between the reference frame and the reference frame, searching by taking the motion vector as a starting vector, sequentially calculating the SATD under different offsets, and determining the optimal motion vector prediction value MV_bestAnd minimum value of inter prediction distortion (SATD)_interComparing the minimum distortion degree SATD_bestAnd the framePrediction distortion minimum SATD_interSelecting a prediction result with small distortion as a result of the optimal initial selection prediction mode of the current macro block;

an encoding step of setting an encoding unit in the original image to encode each encoding unit, and in the encoding process, for the original image of the I frame, encoding the original image according to the prediction angle Dir_bestConstructing a prediction reference angle set, traversing the angles in the set, comparing rate distortion optimization RDO values corresponding to all the angles, and obtaining an optimal prediction angle Dir required by real coding_best；

2. The region-of-interest-based video coding method according to claim 1, wherein: in the encoding step, a reference quantization parameter QP is allocated to the original image_baseCounting the sum of the SATD values of the distortion degrees of different interested areas in the original image, allocating a local target code rate to the interested areas according to the area ratio of the interested areas to the original image, taking the sum of the SATD values as the input of a code rate control algorithm, allocating a quantization parameter QP to each interested area according to the local target code rate,

wherein, clip3(x, min, max) limits x to (min, max).

3. A region-of-interest based video coding method according to claim 1 or 2, characterized in that: the original image comprises Y, U, V data of three channels, and in the down-sampling step, the data of Y component in the original image is down-sampled to obtain a low-resolution image.

4. The region-of-interest-based video coding method according to claim 3, wherein: in the down-sampling step, pixels closest to the edge of the low-resolution image are sequentially copied and then added with outward-extended pixels around the low-resolution image.

5. The region-of-interest-based video coding method according to claim 4, wherein: in the encoding step, the encoding unit divides the original image of the I frame into the minimum size, constructs a prediction reference angle set for macro blocks of different division levels, traverses angles in the set, compares rate distortion optimization RDO values corresponding to the angles, and obtains the optimal prediction angle of the real encoding process of each layer.

6. The region-of-interest-based video coding method according to claim 5, wherein: the region of interest is a face region, in the encoding step, when the original image of the P frame or the B frame contains the face region in the current encoding unit, whether the encoding unit contains the face region and the edge of the background or five sense organs is judged, and if the encoding unit contains the face region and the edge of the background or five sense organs, the encoding unit is divided into the minimum size; and if not, dividing a layer, and when the sum of the rate-distortion optimized RDO values of all the sub-units after division is smaller than the rate-distortion RDO value corresponding to the optimal prediction mode when the sub-units are not divided, dividing the coding unit by a layer.

7. The region-of-interest-based video coding method according to claim 5, wherein: the interested region is a caption region, in the encoding step, when the current encoding unit contains a caption region for the original image of the P frame or the B frame, whether the encoding unit contains the boundary position of the caption and the background region is judged, and if so, the encoding unit is divided into the minimum size; if not, the coding unit is not divided.

8. The region-of-interest-based video coding method according to claim 5, wherein: the region of interest is a fixed identification region, in the encoding step, when the original image of the P frame or the B frame contains a fixed identification in the current encoding unit, whether the encoding unit contains the edge of the fixed identification is judged, and if the encoding unit contains the edge of the fixed identification, the encoding unit is divided into the minimum size; if not, the coding unit is not divided.

9. A region-of-interest based video coding system, comprising,

the information reading module (100) is used for sequentially acquiring original image data of each frame of a video and pixel position information of an interested area in the original image, wherein the original image at least comprises one interested area;

a down-sampling module (200) for down-sampling the original image to obtain a low-resolution image;

a primary selection prediction module (300) which divides the low-resolution image into a plurality of macro blocks, performs intra-frame prediction on the macro blocks in the region of interest, traverses the prediction angle supported in the coding standard, calculates the distortion SATD value of the intra-frame prediction pixel after projection reconstruction and the pixel of the low-resolution image, and obtains the inter-frame minimum distortion SATD value_bestCorresponding predicted angle Dir_best，

for the original image of P frame or B frame, searching the coordinates of the region corresponding to the region of interest in the adjacent frame, calculating the motion vector corresponding to the change of the barycentric position of the region of interest between the reference frames, and taking the motion vector as the starting directionSearching, calculating SATD under different offset in turn, and determining the optimal motion vector predicted value MV_bestAnd minimum value of inter prediction distortion (SATD)_interComparing the minimum distortion degree SATD_bestAnd the minimum value SATD of the inter-prediction distortion degree_interSelecting a prediction result with small distortion as a result of the optimal initial selection prediction mode of the current macro block;

an encoding module (400) for setting encoding units in the original image to encode each encoding unit, and during encoding, for the original image of the I frame, according to the prediction angle Dir_bestConstructing a prediction reference angle set, traversing the angles in the set, comparing the RDO values corresponding to all the angles, and obtaining the optimal prediction angle Dir required by real coding_best；

10. The region-of-interest based video coding system of claim 9, wherein: the encoding module (400) allocates a reference quantization parameter QP to the original picture_baseCounting the sum of the SATD values of the distortion degrees of different interested areas in the original image, allocating a local target code rate to the interested areas according to the area ratio of the interested areas to the original image, taking the sum of the SATD values as the input of a code rate control algorithm, allocating a quantization parameter QP to each interested area according to the local target code rate,

wherein, clip3(x, min, max) limits x to (min, max).