CN111726633A - Compressed video stream recoding method based on deep learning and significance perception - Google Patents

Compressed video stream recoding method based on deep learning and significance perception Download PDF

Info

Publication number
CN111726633A
CN111726633A CN202010394906.1A CN202010394906A CN111726633A CN 111726633 A CN111726633 A CN 111726633A CN 202010394906 A CN202010394906 A CN 202010394906A CN 111726633 A CN111726633 A CN 111726633A
Authority
CN
China
Prior art keywords
frame
image
video
compressed
video image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010394906.1A
Other languages
Chinese (zh)
Other versions
CN111726633B (en
Inventor
李永军
李莎莎
杜浩浩
邓浩
陈立家
曹雪
王赞
陈竞
李鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University
Original Assignee
Henan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University filed Critical Henan University
Priority to CN202010394906.1A priority Critical patent/CN111726633B/en
Publication of CN111726633A publication Critical patent/CN111726633A/en
Application granted granted Critical
Publication of CN111726633B publication Critical patent/CN111726633B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/61Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/177Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/625Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using discrete cosine transform [DCT]

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Discrete Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps: constructing and training a compressed domain video image significance detection deep learning model; inputting a compressed video image X to be recoded into the compressed domain video image significance detection depth learning model CDVNet trained in the step 1; decoding a compressed video image X part to be re-encoded by using a compressed domain video image significance detection deep learning model CDVNet; the video image is recoded by using an HEVC (high efficiency video coding) technology and combining the quantization parameter updated by each coding unit; the method adopts the saliency characteristic extraction based on the compression domain, and utilizes the data information obtained by partial decoding to carry out saliency detection in the compressed code stream, thereby overcoming the defect that the saliency detection based on the pixel domain in the prior art can carry out the characteristic extraction and the saliency detection only after completely decompressing the compressed videos to the pixel domain, and having the advantages of small calculated amount and low time consumption.

Description

Compressed video stream recoding method based on deep learning and significance perception
Technical Field
The invention relates to the technical field of video image processing, in particular to a compressed video stream recoding method based on deep learning and significance perception in the technical field of video image compression.
Background
Systematization and standardization of video image compression techniques, such as JPEG, JPEG2000, h.264/AVC, HEVC, etc., make it normal for massive amounts of video image data to be stored and transmitted in a compressed form. Subject to commercial, privacy, or bandwidth constraints, it is desirable in some applications to compress image data to provide or transmit image data at different resolutions. For example, the resolution is required to be reduced and the transmission rate is required to be reduced when high-definition video images are transmitted on a network with limited bandwidth; in the space-based integrated combat command system, the hyperspectral images transmitted to a military command center from a communication satellite are different in grade from the hyperspectral images transmitted to each combat individual soldier. In addition, the display accuracy of various display devices and communication terminals on the market is greatly different, and video images with different resolutions are also required. This requires efficient re-encoding of already compressed video image data in order to meet the requirements of different transmission bandwidths and different code rates for application scenarios such as display terminals and communication terminals.
At present, the recoding of a compressed video image is mainly realized by cascading two independent image decoders and encoders, the data of the input compressed video image is completely decoded to restore the signals of the pixel domain of the original video image, and then the secondary compression is carried out according to the requirements of different application scenes. South rui group limited disclosed a video image recompression method in its patent application "a video image recompression method" (patent application No. 201811379107.6, publication No. 109640100 a). The method comprises the steps of completely decoding a compressed video image, classifying video segments formed by dividing an original video according to an SBD technology, respectively processing different types of video segments, and finally recompressing according to requirements. The method has a certain effect on the compression ratio, but the structure of 'full decompression and full compression' cannot well utilize the information obtained by first compression, so that not only are the calculation and cache resources wasted, but also the compression time is long, and the real-time processing is difficult to realize.
Disclosure of Invention
The invention aims to provide a compressed video stream re-encoding method based on deep learning and significance perception, which can overcome the defect that in the prior art, the significance detection based on a pixel domain must completely decompress compressed videos to the pixel domain, and then feature extraction and significance detection can be carried out.
In order to achieve the purpose, the invention adopts the following technical scheme:
the compressed video stream recoding method based on deep learning and significance perception comprises the following steps:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of compressed domain video images used for training and corresponding video image significance mapping maps;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Figure BDA0002485992030000021
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping map into a deep learning model CDVNet, and training the model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is Batch which is 64, the Momentum is Momentum which is 0.9, and the learning rate is initially set to lr which is 0.001; and (5) training the batch of Epoch to 200, and finally obtaining the trained compressed domain video image saliency detection deep learning model CDVNet.
And 2, inputting the compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1.
Step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP
The number of groups of pictures (GOPs) G of the compressed video image X to be re-encoded, the number of video frames F of each group of GOPs, the number K of coding units CU contained in each frame, and the total number of frames R of the video image.
Step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
Figure BDA0002485992030000031
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,
Figure BDA0002485992030000032
the motion vector of the motion vector;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
Figure BDA0002485992030000033
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,
Figure BDA0002485992030000034
is the RDCN eigenvalue after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, t ∈ {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]The interval ⊙ represents dot product, α is QP/(3 · l)QP),
β=2·(1-(QP-3)/(3·lQP)),
Figure BDA0002485992030000041
Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral
Figure BDA0002485992030000042
Wherein,xiAnd yiIndicating the position in the image to which the macroblock corresponds,
Figure BDA0002485992030000043
Figure BDA0002485992030000044
Figure BDA0002485992030000045
indicating the number of macroblocks per line of the video frame,
Figure BDA0002485992030000046
indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
Figure BDA0002485992030000047
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager
Sr=Sfuse⊙Scentral
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG
Figure BDA0002485992030000051
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, offset, default of 0.75, y ROI ratio,
Figure BDA0002485992030000052
NGSFIfor the number of significant macroblocks in a GOP group,
Figure BDA0002485992030000053
varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF
Figure BDA0002485992030000054
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU
Figure BDA0002485992030000055
Wherein, PCUObtaining the probability of the characteristic value of the macro block in each frame after the RDCN is normalizedA value;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ
Figure BDA0002485992030000056
wherein α and β are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, α and β are updated continuously according to the self-adaptation of the content, and C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
The HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.
The invention has the beneficial effects that:
firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;
secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;
thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1: the invention relates to a compressed video stream recoding method based on deep learning and significance perception, which comprises the following steps:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of compressed domain video images used for training and corresponding video image significance mapping maps;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Figure BDA0002485992030000071
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping map into a deep learning model CDVNet, and training the model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is Batch which is 64, the Momentum is Momentum which is 0.9, and the learning rate is initially set to lr which is 0.001; and (5) training the batch of Epoch to 200, and finally obtaining the trained compressed domain video image saliency detection deep learning model CDVNet.
And 2, inputting the compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1.
Step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP
The number of groups of pictures (GOPs) G of the compressed video image X to be re-encoded, the number of video frames F of each group of GOPs, the number K of coding units CU contained in each frame, and the total number of frames R of the video image.
Step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
Figure BDA0002485992030000081
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,
Figure BDA0002485992030000082
the motion vector of the motion vector;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
Figure BDA0002485992030000083
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,
Figure BDA0002485992030000084
is the RDCN eigenvalue after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, t ∈ {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]The interval ⊙ represents dot product, α is QP/(3 · l)QP),
β=2·(1-(QP-3)/(3·lQP)),
Figure BDA0002485992030000091
Here QP and lQPIs a quantization parameter decoded from a portion of the compressed video imageAnd the number of quantization parameters;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral
Figure BDA0002485992030000092
Wherein x isiAnd yiIndicating the position in the image to which the macroblock corresponds,
Figure BDA0002485992030000093
Figure BDA0002485992030000094
Figure BDA0002485992030000095
indicating the number of macroblocks per line of the video frame,
Figure BDA0002485992030000096
indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
Figure BDA0002485992030000097
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager
Sr=Sfuse⊙Scentral
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG
Figure BDA0002485992030000101
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, offset, default of 0.75, y ROI ratio,
Figure BDA0002485992030000102
NGSFIfor the number of significant macroblocks in a GOP group,
Figure BDA0002485992030000103
varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF
Figure BDA0002485992030000104
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs the target number of bits, ω, that the current GOP has consumediIs based on the target ratioFrame-level bit distribution weights adjusted by characteristics of the bit, the coding structure and the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU
TCU=PCUTF,
Figure BDA0002485992030000105
Wherein, PCUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ
Figure BDA0002485992030000106
wherein α and β are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, α and β are updated continuously according to the self-adaptation of the content, and C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
Firstly, the saliency feature extraction based on the compression domain is adopted, and the data information obtained by partial decoding is utilized to carry out saliency detection in the compressed code stream, so that the defect that in the prior art, the saliency detection based on the pixel domain must completely decompress the compressed videos to the pixel domain before feature extraction and saliency detection can be carried out is overcome, and the method has the advantages of small calculated amount and low time consumption;
secondly, because the method of the deep convolutional neural network is adopted, the high-level saliency characteristics in the code stream are extracted from the constructed and trained network model CDVNet, the defect that the interested saliency obtained by the traditional detection method only exists in the visual information such as the brightness, the chromaticity, the edge and the like of the image is overcome, the capability of extracting the high-level characteristics of the image is realized, and the deep-level characterization problem of scene saliency can be well processed;
thirdly, as the invention adopts an improved algorithm based on an R-lambda model, the quantization step sizes with different sizes are adjusted according to the quantization parameters in the model for the significant region and the non-significant region contained in the fused characteristic diagram to realize the reasonable distribution of the bit rate, thereby overcoming the defects of video distortion and video perception effect reduction, having good coding performance and achieving better subjective quality with higher compression ratio.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (2)

1. The method for recoding the compressed video stream based on deep learning and significance perception is characterized by comprising the following steps of:
step 1, constructing and training a compressed domain video image significance detection deep learning model, and specifically adopting the following method:
step 1.1, carrying out batch normalization on Discrete Cosine Transform (DCT) residual coefficients of compressed domain video images used for training and corresponding video image significance mapping maps;
step 1.2, taking a Resnext network as a feature extraction network, and constructing a compressed domain video image saliency detection model CDVNet by using a loss function loss of the feature extraction network; specifically, the method comprises the following steps: the loss function loss of the feature extraction network is
Figure FDA0002485992020000011
Wherein, G (i, j) ═ 1 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is significant, and G (i, j) ═ 0 indicates that the image position corresponding to the ith row and jth column residual DCT macro block is not significant; s (i, j) represents the probability that the residual DCT coefficient of the ith row and the jth column is predicted to be a significance value; wherein α is 0.5 and γ is 2; further, taking alpha as 0.5 to balance the uneven proportion of the positive and negative samples; taking gamma 2 is used to adjust the rate of simple sample weight reduction;
step 1.3, sending the DCT residual coefficients of the Batch of normalized compressed domain video images and the corresponding video image significance mapping map into a deep learning model CDVNet, and training the model CDVNet by using a random optimization algorithm Adam, wherein the size of a training Batch is Batch which is 64, the Momentum is Momentum which is 0.9, and the learning rate is initially set to lr which is 0.001; and (5) training the batch of Epoch to 200, and finally obtaining the trained compressed domain video image saliency detection deep learning model CDVNet.
And 2, inputting the compressed video image X to be recoded into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1.
Step 3, utilizing the significance of the video image in the compressed domain to detect the deep learning model CDVNet to decode the part of the compressed video image X to be recoded; in particular, the method comprises the following steps of,
partially decoding the compressed video image X to be recoded to obtain
The predicted residual DCT coefficient of each frame of image of the compressed video image X to be recoded;
height H and width W of the video frame image;
quantization parameter QP, number of quantization parameters lQP
The number of groups of pictures (GOPs) G of the compressed video image X to be re-encoded, the number of video frames F of each group of GOPs, the number K of coding units CU contained in each frame, and the total number of frames R of the video image.
Step 4, extracting local significant features of the partially decoded compressed video image X to be recoded in the step 3; specifically, the method comprises the following steps:
step 4.1, initializing the frame number r of the video frame image of the partially decoded compressed video image X to be recoded to 1;
step 4.2, calculating the norm after quantizing the prediction residual DCT coefficient of each macro block in the r frame in the video frame image in the step 4.1 to obtain the RDCN characteristic diagram, and specifically adopting the following method:
Figure FDA0002485992020000021
wherein RDCN is the norm of the DCT coefficient of the prediction residual error,
Figure FDA0002485992020000022
the motion vector of the motion vector;
step 4.3, performing maximum and minimum value normalization on the RDCN feature map of the r frame in the video frame image obtained in the step 4.2;
4.4, performing convolution on the RDCN characteristic diagram normalized by the maximum and minimum values obtained in the step 4.3 by using a Gaussian filter of 3 multiplied by 3 to realize spatial filtering;
step 4.5, performing motion median filtering on the feature map subjected to spatial filtering in the step 4.4 by using the previous r frame to obtain a local saliency feature map SRDCN of the r frame in the video frame image; specifically, the following method is adopted:
Figure FDA0002485992020000023
wherein, Med [ ·]Represents the median of the spatially filtered prior r frame eigenvalues,
Figure FDA0002485992020000024
is the RDCN eigenvalue after spatial filtering of the ith row and jth column macro block of the r-t frame in the video frame image, t ∈ {1,2, … r-2 };
and 5: the method for extracting the high-level saliency features of the compressed video image X by using the compressed domain video image saliency detection deep learning model CDVNet comprises the following steps:
step 5.1, normalizing the DCT residual coefficient of the compressed video image X, so that the normalized data is distributed around a 0 value;
step 5.2, inputting the DCT residual coefficient normalized in the step 5.1 into the compressed domain video image significance detection deep learning model CDVNet trained in the step 1 to obtain a global significance characteristic map GSFI of the r-th frame of the video frame image of the compressed video image X;
step 6, fusing and enhancing the local saliency characteristic map SRDCN and the global saliency characteristic map GSFI of the r frame in the video frame image, wherein the method comprises the following steps:
step 6.1, fusing the local saliency characteristic map SRDCN of the r frame in the video frame image obtained in the step 4.5 and the global saliency characteristic map GSFI of the r frame of the video frame image obtained in the step 5.2 according to the following formula to obtain a fused saliency map S of the r frame of the video frame imagefuse
Sfuse=Norm(α·GSFI+β·SRDCN+γ·SRDCN⊙GSFI);
Wherein Norm (. cndot.) represents normalization to [0,1 ]]The interval ⊙ represents dot product, α is QP/(3 · l)QP),β=2·(1-(QP-3)/(3·lQP)),
Figure FDA0002485992020000031
Here QP and lQPThe quantization parameters and the number of the quantization parameters obtained by decoding the compressed video image part;
step 6.2, a fusion saliency mapping image S of the r-th frame of the video frame image through the central saliency map based on the Gaussian model according to the following formulafusePerforming significance enhancement and non-significance inhibition to obtain a position S in the image corresponding to the fused characteristic valuecentral
Figure FDA0002485992020000032
Wherein x isiAnd yiIndicating the position in the image to which the macroblock corresponds,
Figure FDA0002485992020000033
Figure FDA0002485992020000034
Figure FDA0002485992020000035
indicating the number of macroblocks per line of the video frame,
Figure FDA0002485992020000036
indicating the number of macroblocks in each column of the video frame. Wherein xcAnd ycDenotes SfuseThe mean of the coordinates of the first 10 maxima, and
Figure FDA0002485992020000037
wherein S isfuse(xi,yi) Is a fused significant feature value, Sfuse(x1,y1)≥Sfuse(x2,y2)≥…≥Sfuse(xN,yN);
Step 6.3, the fusion significance mapping map S of the r frame of the video frame image obtained in the step 6.1 is obtained through the following formulafuseCombining the position of the enhanced saliency characteristic map obtained in the step 6.2 to obtain a final saliency map S of the No. r frame of the video frame imager
Sr=Sfuse⊙Scentral
Step 6.4, adding 1 to the video frame serial number R of the R frame video frame image, and judging whether the video frame serial number added with 1 is equal to the total number R of the video frames; if yes, executing step 7, otherwise, executing step 4.1;
step 7, constructing an R-lambda model of the region of interest, comprising the following steps:
step 7.1, respectively initializing the GOP group number g of the compressed video image X obtained in the step 3, the video frame number f of each group of GOPs and the number k of the coding unit CU of each frame to 1;
step 7.2, combining the final saliency map S of the r-th frame of the video frame image obtained in step 6.3rReallocating the target bit number T to the GOP group of the compressed video image X partially decoded in the step 3 according to the following formulaG
Figure FDA0002485992020000041
Wherein, TGTarget number of bits, R, allocated for group g GOPuFor a target code rate per frame, fpsVideo frame rate, offset, default of 0.75, y ROI ratio,
Figure FDA0002485992020000042
NGSFIfor the number of significant macroblocks in a GOP group,
Figure FDA0002485992020000043
varying between 0.75 and 1.75;
step 7.3, obtaining the target bit number T of the f frame according to the following formulaF
Figure FDA0002485992020000044
Wherein, TFNumber of bits, R, of the current frameGOPcodedIs consumed by the current GOPTarget number of bits, ωiIs a frame-level bit allocation weight adjusted according to the target bit, the coding structure and the characteristics of the coded frame, and the coded is the number of uncoded images;
step 7.4, obtaining the target bit T of the kth coding unit CU according to the following formulaCU
Figure FDA0002485992020000045
Wherein, PCUObtaining the probability value of the characteristic value of the macro block in each frame after RDCN normalization;
step 7.5, calculating the quantization parameter QP value and the lambda value of the kth coding unit CU according to the R-lambda model, and specifically adopting the following method:
λ=α×bppβ
Figure FDA0002485992020000046
wherein α and β are parameters related to the characteristics of the sequence content, the initial values are 3.2005 and-1.367, α and β are updated continuously according to the self-adaptation of the content, and C1=4.2005,C2=13.7122;
Step 7.6, adding 1 to the serial number K of the coding unit, and judging whether the serial number K of the coding unit after adding one is equal to the total number K of the coding unit; if yes, executing step 7.7, otherwise, executing step 7.4;
step 7.7, adding 1 to the sequence number F of the video frame, and judging whether the frame sequence number F after adding 1 is equal to the number F of the video frames in the GOP group; if yes, executing step 7.8, otherwise, executing step 7.3;
step 7.8, adding 1 to the number G of the GOP groups, and judging whether the sequence number G of the GOP groups after adding 1 is equal to the total number G of the GOPs; if yes, executing step 8, otherwise, executing step 7.1;
and 8, performing video image recoding by using an HEVC (high efficiency video coding) technology and combining the updated quantization parameter of each coding unit.
2. The method of claim 1, wherein the method comprises: the HEVC coding technique described in step 8 employs international standard "h.265" established in 2013.
CN202010394906.1A 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception Active CN111726633B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010394906.1A CN111726633B (en) 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010394906.1A CN111726633B (en) 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception

Publications (2)

Publication Number Publication Date
CN111726633A true CN111726633A (en) 2020-09-29
CN111726633B CN111726633B (en) 2021-03-26

Family

ID=72564323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010394906.1A Active CN111726633B (en) 2020-05-11 2020-05-11 Compressed video stream recoding method based on deep learning and significance perception

Country Status (1)

Country Link
CN (1) CN111726633B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112399176A (en) * 2020-11-17 2021-02-23 深圳大学 Video coding method and device, computer equipment and storage medium
CN112399177A (en) * 2020-11-17 2021-02-23 深圳大学 Video coding method and device, computer equipment and storage medium
CN113242433A (en) * 2021-04-27 2021-08-10 中国科学院国家空间科学中心 Image compression method and image compression system based on ARM multi-core heterogeneous processor
CN113660498A (en) * 2021-10-20 2021-11-16 康达洲际医疗器械有限公司 Inter-frame image universal coding method and system based on significance detection
CN113709464A (en) * 2021-09-01 2021-11-26 展讯通信(天津)有限公司 Video coding method and related device
WO2022073159A1 (en) * 2020-10-07 2022-04-14 浙江大学 Feature data encoding method, apparatus and device, feature data decoding method, apparatus and device, and storage medium
CN114786011A (en) * 2022-06-22 2022-07-22 苏州浪潮智能科技有限公司 JPEG image compression method, system, equipment and storage medium
CN114866784A (en) * 2022-04-19 2022-08-05 东南大学 Vehicle detection method based on compressed video DCT (discrete cosine transformation) coefficients
WO2022206960A1 (en) * 2021-03-29 2022-10-06 京东方科技集团股份有限公司 Video transcoding method and system, and electronic device
CN115314722A (en) * 2022-06-17 2022-11-08 百果园技术(新加坡)有限公司 Video code rate distribution method, system, equipment and storage medium
CN116847101A (en) * 2023-09-01 2023-10-03 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107437096A (en) * 2017-07-28 2017-12-05 北京大学 Image classification method based on the efficient depth residual error network model of parameter
US20180240221A1 (en) * 2017-02-17 2018-08-23 Cogisen S.R.L. Method for image processing and video compression
CN109118469A (en) * 2018-06-20 2019-01-01 国网浙江省电力有限公司 Prediction technique for saliency
CN109309834A (en) * 2018-11-21 2019-02-05 北京航空航天大学 Video-frequency compression method based on convolutional neural networks and the significant information of HEVC compression domain
CN109451310A (en) * 2018-11-21 2019-03-08 北京航空航天大学 A kind of Rate-distortion optimization method and device based on significance weighted
CN109547803A (en) * 2018-11-21 2019-03-29 北京航空航天大学 A kind of detection of time-space domain conspicuousness and fusion method
CN109859166A (en) * 2018-12-26 2019-06-07 上海大学 It is a kind of based on multiple row convolutional neural networks without ginseng 3D rendering method for evaluating quality
CN110135435A (en) * 2019-04-17 2019-08-16 上海师范大学 A kind of conspicuousness detection method and device based on range learning system
CN111028153A (en) * 2019-12-09 2020-04-17 南京理工大学 Image processing and neural network training method and device and computer equipment
CN111083477A (en) * 2019-12-11 2020-04-28 北京航空航天大学 HEVC (high efficiency video coding) optimization algorithm based on visual saliency

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240221A1 (en) * 2017-02-17 2018-08-23 Cogisen S.R.L. Method for image processing and video compression
CN107437096A (en) * 2017-07-28 2017-12-05 北京大学 Image classification method based on the efficient depth residual error network model of parameter
CN109118469A (en) * 2018-06-20 2019-01-01 国网浙江省电力有限公司 Prediction technique for saliency
CN109309834A (en) * 2018-11-21 2019-02-05 北京航空航天大学 Video-frequency compression method based on convolutional neural networks and the significant information of HEVC compression domain
CN109451310A (en) * 2018-11-21 2019-03-08 北京航空航天大学 A kind of Rate-distortion optimization method and device based on significance weighted
CN109547803A (en) * 2018-11-21 2019-03-29 北京航空航天大学 A kind of detection of time-space domain conspicuousness and fusion method
CN109859166A (en) * 2018-12-26 2019-06-07 上海大学 It is a kind of based on multiple row convolutional neural networks without ginseng 3D rendering method for evaluating quality
CN110135435A (en) * 2019-04-17 2019-08-16 上海师范大学 A kind of conspicuousness detection method and device based on range learning system
CN111028153A (en) * 2019-12-09 2020-04-17 南京理工大学 Image processing and neural network training method and device and computer equipment
CN111083477A (en) * 2019-12-11 2020-04-28 北京航空航天大学 HEVC (high efficiency video coding) optimization algorithm based on visual saliency

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022073159A1 (en) * 2020-10-07 2022-04-14 浙江大学 Feature data encoding method, apparatus and device, feature data decoding method, apparatus and device, and storage medium
CN112399177A (en) * 2020-11-17 2021-02-23 深圳大学 Video coding method and device, computer equipment and storage medium
CN112399176A (en) * 2020-11-17 2021-02-23 深圳大学 Video coding method and device, computer equipment and storage medium
CN112399176B (en) * 2020-11-17 2022-09-16 深圳市创智升科技有限公司 Video coding method and device, computer equipment and storage medium
WO2022206960A1 (en) * 2021-03-29 2022-10-06 京东方科技集团股份有限公司 Video transcoding method and system, and electronic device
CN113242433A (en) * 2021-04-27 2021-08-10 中国科学院国家空间科学中心 Image compression method and image compression system based on ARM multi-core heterogeneous processor
CN113242433B (en) * 2021-04-27 2022-01-21 中国科学院国家空间科学中心 Image compression method and image compression system based on ARM multi-core heterogeneous processor
CN113709464A (en) * 2021-09-01 2021-11-26 展讯通信(天津)有限公司 Video coding method and related device
CN113660498A (en) * 2021-10-20 2021-11-16 康达洲际医疗器械有限公司 Inter-frame image universal coding method and system based on significance detection
CN113660498B (en) * 2021-10-20 2022-02-11 康达洲际医疗器械有限公司 Inter-frame image universal coding method and system based on significance detection
CN114866784A (en) * 2022-04-19 2022-08-05 东南大学 Vehicle detection method based on compressed video DCT (discrete cosine transformation) coefficients
CN115314722A (en) * 2022-06-17 2022-11-08 百果园技术(新加坡)有限公司 Video code rate distribution method, system, equipment and storage medium
CN115314722B (en) * 2022-06-17 2023-12-08 百果园技术(新加坡)有限公司 Video code rate distribution method, system, equipment and storage medium
WO2023241376A1 (en) * 2022-06-17 2023-12-21 广州市百果园信息技术有限公司 Video bitrate allocation method, system and device, and storage medium
CN114786011A (en) * 2022-06-22 2022-07-22 苏州浪潮智能科技有限公司 JPEG image compression method, system, equipment and storage medium
CN116847101A (en) * 2023-09-01 2023-10-03 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network
CN116847101B (en) * 2023-09-01 2024-02-13 易方信息科技股份有限公司 Video bit rate ladder prediction method, system and equipment based on transform network

Also Published As

Publication number Publication date
CN111726633B (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN111726633B (en) Compressed video stream recoding method based on deep learning and significance perception
US5892548A (en) Adaptive quantizer with modification of high frequency coefficients
EP1867175B1 (en) Method for locally adjusting a quantization step
EP3179720B1 (en) Quantization method and apparatus in encoding/decoding
JP6141295B2 (en) Perceptually lossless and perceptually enhanced image compression system and method
US20070025621A1 (en) Coding device, coding method, decoding device, decoding method, and programs of same
WO2020238439A1 (en) Video quality-of-service enhancement method under restricted bandwidth of wireless ad hoc network
JP2002543693A (en) Quantization method and video compression device
CN103501438B (en) A kind of content-adaptive method for compressing image based on principal component analysis
US6934418B2 (en) Image data coding apparatus and image data server
CN111131828B (en) Image compression method and device, electronic equipment and storage medium
CN113301336A (en) Video coding method, device, equipment and medium
CN115604488A (en) Method and device for loop filtering
Kumar et al. Human visual system based enhanced AMBTC for color image compression using interpolation
JP3532470B2 (en) Techniques for video communication using coded matched filter devices.
CN116582685A (en) AI-based grading residual error coding method, device, equipment and storage medium
CN116916036A (en) Video compression method, device and system
US8139881B2 (en) Method for locally adjusting a quantization step and coding device implementing said method
CN112001854A (en) Method for repairing coded image and related system and device
CN111277835A (en) Monitoring video compression and decompression method combining yolo3 and flownet2 network
CN113194312B (en) Planetary science exploration image adaptive quantization coding system combined with visual saliency
CN112040231B (en) Video coding method based on perceptual noise channel model
Peng et al. An optimized algorithm based on generalized difference expansion method used for HEVC reversible video information hiding
Ning et al. Video Reversible Data Hiding Based on Motion Vector
CN112040246B (en) Low-delay low-complexity fixed code rate control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant