CN116095328A - Video encoding method, model training method, apparatus, and storage medium - Google Patents

Video encoding method, model training method, apparatus, and storage medium Download PDF

Info

Publication number
CN116095328A
CN116095328A CN202111290503.3A CN202111290503A CN116095328A CN 116095328 A CN116095328 A CN 116095328A CN 202111290503 A CN202111290503 A CN 202111290503A CN 116095328 A CN116095328 A CN 116095328A
Authority
CN
China
Prior art keywords
video frame
coding
video
loss value
target loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111290503.3A
Other languages
Chinese (zh)
Inventor
孔德辉
徐科
宋剑军
任聪
易自尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sanechips Technology Co Ltd
Original Assignee
Sanechips Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sanechips Technology Co Ltd filed Critical Sanechips Technology Co Ltd
Priority to CN202111290503.3A priority Critical patent/CN116095328A/en
Priority to PCT/CN2022/080753 priority patent/WO2023077707A1/en
Publication of CN116095328A publication Critical patent/CN116095328A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/115Selection of the code volume for a coding unit prior to coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/154Measured or subjectively estimated visual quality after decoding, e.g. measurement of distortion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/70Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by syntax aspects related to video coding, e.g. related to compression standards

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a video coding method, a model training method, equipment and a storage medium, wherein the model training method comprises the steps of extracting a current video frame from an original video sample and obtaining coding constraint parameters corresponding to the current video frame; obtaining a reconstructed video frame corresponding to the current video frame according to the current video frame and the coding constraint parameter; determining a target loss value according to the current video frame and the reconstructed video frame; and training a video coding model according to the target loss value until the video coding model meets a convergence condition. The embodiment of the invention trains the video coding model based on the video frames to be coded and the reconstructed frames corresponding to the video frames so as to realize the coding of global information based on the video frames, overcomes the defect that the classical coding and decoding scheme is based on block division, and improves the video coding quality under the same constraint.

Description

Video encoding method, model training method, apparatus, and storage medium
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a video encoding method, a model training method, a device, and a storage medium.
Background
The video coding takes the high correlation of video signals and the visual characteristics of human eyes as starting points, and eliminates redundancy generated by various correlations and the human eye characteristics by a proper coding mode so as to achieve the purposes of compressing the video signals and reducing the transmission code rate.
Current video coding standards basically follow a block-based video coding scheme, i.e. by dividing a video frame into coded blocks, then intra-or inter-coding the coded blocks, compressing the coded residual with a given quantization parameter (Quantitative parameters, QP) by transformation to achieve compression of each frame of data by block-wise coding. Because the standard coding frames are concentrated on shallow image information and local blocks, the coding is not performed based on the global information of the video frames, so that the coding effect is low.
Disclosure of Invention
The embodiment of the invention provides a video coding method, a model training method, equipment and a storage medium, so as to improve coding effect.
In a first aspect, an embodiment of the present invention provides a model training method, where the method includes:
extracting a current video frame from an original video sample;
acquiring coding constraint parameters corresponding to the current video frame;
obtaining a reconstructed video frame corresponding to the current video frame according to the current video frame and the coding constraint parameter;
determining a target loss value according to the current video frame and the reconstructed video frame;
and training a video coding model according to the target loss value until the video coding model meets a convergence condition.
In a second aspect, an embodiment of the present invention provides a video encoding method, including:
acquiring an original video;
dividing the original video into a plurality of video frames;
and obtaining an output code stream according to each video frame, the preset coding constraint parameters and the video coding model obtained through training by the model training method according to the first aspect.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the model training method as provided above in the first aspect or the video encoding method as provided above in the second aspect when the computer program is executed.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the model training method provided in the first aspect above, or the video encoding method provided in the second aspect above.
According to the scheme of the embodiment of the invention, a current video frame is firstly extracted from an original video sample, and coding constraint parameters corresponding to the current video frame are obtained; obtaining a reconstructed video frame corresponding to the current video frame according to the current video frame and the coding constraint parameter; determining a target loss value according to the current video frame and the reconstructed video frame; and training a video coding model according to the target loss value until the video coding model meets a convergence condition. The embodiment of the invention trains the video coding model based on the video frames to be coded and the reconstructed frames corresponding to the video frames so as to realize the coding of global information based on the video frames, overcomes the defect that the classical coding and decoding scheme is based on block division, and improves the video coding quality under the same constraint.
Drawings
Fig. 1 is a schematic diagram of a video encoding module according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a video coding module according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating steps of a model training method according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of the substeps of step S330 in FIG. 3;
FIG. 5 is a schematic flow chart of the substeps of step S330 in FIG. 3;
fig. 6 is a flowchart illustrating steps of a video encoding method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the technical solutions according to the embodiments of the present invention will be given with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It should be appreciated that in the description of embodiments of the present invention, the descriptions of "first," "second," etc. are for the purpose of distinguishing between technical features only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated. "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any of these items, including any group of single or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c, or a and b and c, wherein a, b and c may be single or multiple.
The embodiment of the invention provides a video coding method, a model training method, equipment and a storage medium, which are used for coding based on global information of video frames so as to overcome the defect that a classical coding and decoding scheme is based on block division and improve the video coding quality under the same constraint.
First, a description will be given of a framework of a video coding model to which the video coding method of the embodiment of the present invention is applied. Referring to fig. 1, fig. 1 is a schematic diagram illustrating a frame of a video coding module according to an embodiment of the invention. The video coding module of the embodiment of the invention can comprise a coding element generating module and a coding module, wherein the coding element generating module is used for determining a coding element according to a video frame of an input model and coding constraint parameters; the coding module is used for coding the current video frame according to the coding element and outputting the coded video frame.
It should be appreciated that video encoding as described by embodiments of the present invention generally refers to the process of compressing an original video into a bitstream to reduce the amount of data required to represent the video, thereby more efficiently storing and/or transmitting. The video decoding is to reconstruct the video code stream to obtain the reconstructed video in the reverse processing process of the encoding. In most coding standards, adaptive inter/intra prediction is used on a block basis, and a basic block unit for video coding is often referred to as a Coding Unit (CU). A CU is typically obtained by block segmentation of video frames of the original video.
In the encoding process, a reference block is generally generated through spatial (intra) prediction and temporal (inter) prediction, subtracted from a current block (a block currently processed or to be processed) to obtain a residual block, transformed in a transform domain, and quantized (quantitive) to reduce the amount of data to be transmitted (compressed). In inter coding, a reference block may be selected from other video frames and the motion estimation unit is provided with a reference picture and/or an offset (spatial offset) between the position of the reference block (X, Y coordinates) and the position of the current block as inter prediction parameters. This offset is also called Motion Vector (MV).
The embodiment of the invention constructs the coding element generation module of the video coding model based on the coding elements used in the coding process. Specifically, the encoding element generation module may include at least one among a CU partition sub-module, a QP determination sub-module, and an MV estimation sub-module.
Referring to fig. 2, fig. 2 is a schematic diagram illustrating an architecture of a video coding model according to an embodiment of the invention. In the example shown in fig. 2, the video coding model employs a graph neural network (Grahp Nueral Network, GNN) based architecture. The coding element generation module comprises a CU dividing sub-module, a QP determining sub-module and an MV estimating sub-module. The system comprises a CU dividing submodule, a QP determining submodule, a MV estimating submodule and a MV estimating submodule, wherein the CU dividing submodule is used for outputting CU dividing results based on video data of an input model, the QP determining submodule is used for outputting QP values based on the video data of the input model, and the MV estimating submodule is used for outputting MV estimating results based on the video data of the input model. It should be appreciated that the video data input to the model includes the current (to be encoded) video frame and the encoding constraint parameter, and when the encoding mode is inter-frame encoding, the video data also includes the reference video frame.
The CU dividing sub-module, the QP determining sub-module and the MV estimating sub-module are mutually related, the CU dividing sub-module is further used for updating self-output results based on the output results of the QP determining sub-module, the QP determining sub-module is further used for updating self-output results based on the output results of the CU dividing sub-module and the MV estimating sub-module, and the MV estimating sub-module is further used for updating self-output results based on the output results of the CU dividing sub-module. Of course, in a particular implementation, the encoding element generation module may include more or fewer sub-modules than the example shown in fig. 2.
The video coding model based on the GNN architecture can be deployed to the GNN accelerator to reduce the iterative hardware cost, and can be deployed to the similar GNN accelerator through tool chain recompilation under the condition of introducing new nodes or edges (equivalent to updating the coding and decoding schemes), so that the product iteration can be realized without replacing a hardware encoder, and the method has better flexibility.
Referring to fig. 3, fig. 3 illustrates a model training method according to an embodiment of the present invention, which can be applied to the video coding model shown in fig. 1 or fig. 2. The model training method provided by the embodiment of the invention comprises the following steps:
step S310, extracting a current video frame from an original video sample;
step S320, obtaining coding constraint parameters corresponding to the current video frame;
step S330, obtaining a reconstructed video frame corresponding to the current video frame according to the current video frame and the coding constraint parameter;
step S340, determining a target loss value according to the current video frame and the reconstructed video frame;
and step S350, training a video coding model according to the target loss value until the video coding model meets a convergence condition.
The model training method provided by the embodiment of the invention can be executed by a server, and the server can be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
In step S310, original videos (uncompressed videos) of different application scenes may be collected to construct training samples, and when there is an explicit deployment scene requirement (for example, for a video cloud conference scene or a video monitoring scene), the duty ratio of the original video samples in the corresponding scene may be increased, so as to improve the processing performance of the video coding model on the video of the scene. It should be appreciated that each of the acquired original video samples includes a plurality of video frames, and the video frame currently to be encoded is selected from the plurality of video frames included in the original video sample as the current video frame.
In step S320, the coding constraint parameters may be preconfigured by the user, and illustratively, the coding constraint parameters may include at least one of coding mode indication parameters and rate control parameters, for example, the coding mode indication parameters represent intra-coding with 0 and inter-coding with 1; the code rate control parameter can also adopt specific numerical values to express the corresponding requirement of the additional code rate. It should be noted that the encoding constraint parameters may also include other types of parameters besides the above-mentioned parameters, which are not limited by the embodiment of the present invention.
The coding constraint parameters are used for indicating the coding mode of the video coding model, so that the video coding model can be flexibly coded based on the coding constraint parameters, and the video coding model has stronger universality. In addition, the coding constraint parameters such as the coding mode indication parameters and the code rate control parameters are introduced into the coding framework, so that finer coding control can be realized.
In step S330, the reconstructed video frame corresponding to the current video frame may be determined by different method steps based on the training method adopted. The training mode herein includes a supervised training mode and an unsupervised training mode.
Referring to fig. 4, for example, in the supervised training mode, in step S330, the acquisition of the reconstructed video frame corresponding to the current video frame may be specifically implemented by the following method steps:
step S331, a first code stream sample is obtained, where the first code stream sample is obtained by pre-encoding the original video sample based on a preset video encoding standard.
Step S332, decoding the first code stream sample, and obtaining a reconstructed video frame corresponding to the current video frame according to a decoding result.
Illustratively, in step S331, the original video samples may be pre-encoded by the advanced video coding standard h.266 to obtain an output code stream.
In step S332, the output code stream obtained in step S331 may be decoded based on the advanced video coding standard to obtain a decoding result, where the decoding result is a reconstructed video, and the reconstructed video frame corresponding to the current video frame is obtained from the reconstructed video.
It will be appreciated that embodiments of the present invention train a video coding model based on determining a target loss value from a current video frame and a reconstructed video frame. In the supervised training mode, in step S340, the target loss value may be specifically determined in the following three ways.
In a first way, a first target loss value is determined from a norm of a difference between the reconstructed video frame and the current video frame. Specifically, the first target loss value may be calculated by the following formula (1):
f1=||x′-x|| 1 (1)
in equation (1), x represents the current video frame, x' represents the reconstructed video frame, I.I 1 The L1 norm is represented, and f1 represents the first target loss value.
In a second mode, inputting the current video frame into a preset discriminator network to obtain a first image quality discrimination result; inputting the reconstructed video frame into a preset discriminator network to obtain a second image quality discrimination result; and determining a second target loss value according to the norm of the difference between the second image quality judging result and the first image quality judging result.
The arbiter network may be a trained generated countermeasure network (Generative Adversarial Networks, GAN) network arbiter, or a trained Enhanced Super-resolution generated countermeasure network (Enhanced Super-Resolution Generative Adversarial Networks) network arbiter, where the embodiment of the invention does not limit the specific type of arbiter.
In a second manner, based on the image quality degradation of the reconstructed video frame relative to the original video frame, the current video frame is input into a discriminator network to obtain a first image quality discrimination result, the reconstructed video frame is input into a preset discriminator network to obtain a second image quality discrimination result, and a target loss value is determined based on the first image quality discrimination result and the second image quality discrimination result to evaluate the encoding quality based on the target loss value.
Specifically, the second target loss value may be calculated by the following formula (2):
f2=||g(h′(h(x)))-g(x)|| 1 (2)
in formula (2), x represents the current video frame, h represents the encoding function, h ' represents the decoding function, (h ' (h (x))) represents the reconstructed video frame, g represents the discriminator output, g (x) represents the first image quality discrimination result, g (h ' (h (x))) represents the second image quality discrimination result, and f2 represents the second target loss value.
A third way is to obtain a first target loss value, wherein the first target loss value is determined according to a norm of a difference value between the reconstructed video frame and the current video frame; acquiring a length parameter, wherein the length parameter is determined according to the length of the current video frame after precoding; and determining a third target loss value according to the first target loss value and the length parameter. Specifically, the third target loss value may be calculated by the following formula (3):
f3=||x′-x|| 1 +λlen(h(x)) (3)
in formula (3), x represents the current video frame; x' represents a reconstructed video frame; I.I 1 Represents an L1 norm; ||x' -x|| 1 Representing a first target loss value; len (h (x)) represents the length of the current video frame after pre-encoding, i.e. the length parameter; λ represents a coefficient; f3 represents a third target loss value.
In step S350, training a video coding model according to the target loss value until the video coding model meets a convergence condition may include: and training a video coding model according to at least one of the first target loss value, the second target loss value and the third target loss value until the video coding model meets a convergence condition.
In particular implementations, the objective loss function (e.g., any one or more of formulas (1), (2), and (3)) may be flexibly selected, then the objective loss value may be determined based on the current video frame and the reconstructed video frame and the selected objective loss function, and finally the video coding model may be trained based on the objective loss value.
For example, in the inter-frame coding mode, the convergence condition may be determined according to the target loss value, the current video frame, the coding constraint parameter and the reference video frame are used as inputs, the video coding model is trained, the current loss value is determined according to the coding result output by the video coding model and the target loss function, and whether the video coding model meets the convergence condition is determined by comparing the current loss value with the target loss value. If the convergence condition is not satisfied, adjusting model parameters of the video coding model, and iteratively executing the following steps: training a video coding model by taking a current video frame, coding constraint parameters and a reference video frame as inputs, determining a current loss value according to a coding result and a target loss function output by the video coding model, and determining whether the video coding model meets a convergence condition by comparing the current loss value with the target loss value; if the convergence condition has been met, the training is ended.
It will be appreciated that if the coding scheme is intra-coding, the video coding model is trained with the current video frame and coding constraint parameters as inputs.
In the above-mentioned supervised training method, the original video samples are first pre-encoded based on the advanced coding standard to obtain an output code stream, the output code stream obtained in step S331 is decoded based on the advanced video coding standard to obtain a reconstructed video, a reconstructed video frame corresponding to the current video frame is obtained from the reconstructed video, and a target loss value is determined based on the reconstructed video frame to perform model training based on the target loss value. Because the target loss value is determined based on the reconstructed video frame, the overall reconstruction effect of the model output code stream can be improved.
It can be understood that the video coding model provided by the embodiment of the invention is also applicable to an unsupervised training mode.
Referring to fig. 5, in the unsupervised training mode, in step S330, the obtaining the reconstructed video frame corresponding to the current video frame may be specifically implemented by the following method steps:
step S333, performing CU division on the current video frame by using the video coding model to obtain a plurality of CU division results;
step S334, coding based on each CU partition result by using the video coding model, to obtain a plurality of second code stream samples corresponding to the plurality of CU partition results one by one;
step S335, decoding each of the second code stream samples to obtain a plurality of reconstructed video frames corresponding to the plurality of CU partitioning results one by one.
Illustratively, the current video frame is input to the video coding model multiple times, each input resulting in a CU partitioning result. And outputting corresponding second code stream samples by the video coding model under each CU dividing result. And decoding each second code stream sample to obtain a plurality of reconstructed video frames corresponding to a plurality of CU dividing results one by one.
In an unsupervised training mode, in step S340, the target loss value may be specifically determined by the following method steps: determining norms of differences between the reconstructed video frame and the current video frame corresponding to each CU dividing result to obtain a fourth target loss value corresponding to each CU dividing result; and selecting a minimum value from fourth target loss values corresponding to the CU division results respectively as a fifth target loss value.
Specifically, the fifth target loss value may be calculated by the following formula (4):
f=min||x-h′(h(g(x)))|| 1 (4)
in formula (4), g (x) represents a current CU partition result, and h' (h (g (x))) represents a reconstructed video frame obtained based on the current CU partition result.
In the above-mentioned unsupervised training process, based on CU partition modeling, to find the best partition for processing video frames, maximum coupling within the CU under coding parameter constraints, and minimum correlation between CUs are achieved.
According to the scheme of the embodiment of the invention, a current video frame is firstly extracted from an original video sample, and coding constraint parameters corresponding to the current video frame are obtained; obtaining a reconstructed video frame corresponding to the current video frame according to the current video frame and the coding constraint parameter; determining a target loss value according to the current video frame and the reconstructed video frame; and training a video coding model according to the target loss value until the video coding model meets a convergence condition. The embodiment of the invention trains the video coding model based on the video frames to be coded and the reconstructed frames corresponding to the video frames so as to realize the coding of global information based on the video frames, overcomes the defect that the classical coding and decoding scheme is based on block division, and improves the video coding quality under the same constraint.
In addition, the embodiment of the invention provides a video coding model based on the GNN architecture, which can be deployed to the GNN accelerator to reduce the iterative hardware cost.
The embodiment of the invention also provides a video coding method. Referring to fig. 6, the video encoding method provided by the embodiment of the invention includes, but is not limited to, the following steps:
s410, acquiring an original video;
s420, dividing the original video into a plurality of video frames;
s430, obtaining an output code stream according to each video frame, the preset coding constraint parameters and the video coding model obtained through training by the model training method provided by any one of the previous embodiments.
It should be noted that, the video encoding method provided in the embodiment of the present invention may be executed by a terminal device, where the terminal device may be hardware or software. When the terminal device is hardware, it may be a variety of electronic devices including, but not limited to, smartphones, tablet computers, electronic book readers, car-mounted computers, laptop and desktop computers, and the like. When the terminal device is software, it can be installed in the above-listed electronic device. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The county body limitation is not made here.
In particular, when the method is implemented, the terminal equipment divides an uncompressed original video into a plurality of video frames, the video frames are input into a trained video coding model, and the video frames are coded by utilizing the video coding model based on preset coding constraint parameters.
Fig. 7 shows an electronic device 500 provided by an embodiment of the invention. The electronic device 500 includes, but is not limited to:
a memory 510 for storing a program;
the processor 520 is configured to execute the program stored in the memory 510, and when the processor 520 executes the program stored in the memory 510, the processor 520 is configured to execute the model training method or the video encoding method described above.
The processor 520 and the memory 510 may be connected by a bus or other means.
The memory 510 serves as a non-transitory computer readable storage medium that can be used to store non-transitory software programs as well as non-transitory computer executable programs, such as the model training method or the video encoding method described in any of the embodiments of the present invention. The processor 520 implements the model training method or video encoding method described above by running non-transitory software programs and instructions stored in the memory 510.
Memory 510 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store a model training method or a video encoding method that performs the above. In addition, memory 510 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, memory 510 may optionally include memory located remotely from processor 520, which may be connected to processor 520 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software programs and instructions required to implement the above-described model training method or video encoding method are stored in the memory 510, which when executed by the one or more processors 520, perform the model training method or video encoding method provided by any embodiment of the present invention.
The embodiment of the invention also provides a storage medium which stores computer executable instructions for executing the model training method or the video coding method.
In an embodiment, the storage medium stores computer-executable instructions that are executed by one or more control processors 520, for example, by one of the processors 520 in the electronic device 500, so that the one or more processors 520 perform the model training method or the video encoding method provided in any embodiment of the present invention.
The embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media.
The preferred embodiments of the present invention have been described in detail, but the present invention is not limited to the above embodiments, and those skilled in the art will appreciate that the present invention may be practiced without departing from the spirit of the present invention. Various equivalent modifications and substitutions may be made in the shared context, and are intended to be included within the scope of the present invention as defined in the following claims.

Claims (11)

1. A method of model training, the method comprising:
extracting a current video frame from an original video sample;
acquiring coding constraint parameters corresponding to the current video frame;
obtaining a reconstructed video frame corresponding to the current video frame according to the current video frame and the coding constraint parameter;
determining a target loss value according to the current video frame and the reconstructed video frame;
and training a video coding model according to the target loss value until the video coding model meets a convergence condition.
2. The method of claim 1, wherein the obtaining the reconstructed video frame corresponding to the current video frame comprises:
acquiring a first code stream sample, wherein the first code stream sample is obtained by pre-coding the original video sample based on a preset video coding standard;
and decoding the first code stream sample, and acquiring a reconstructed video frame corresponding to the current video frame according to a decoding result.
3. The method of claim 1, wherein said determining a target loss value from said current video frame and said reconstructed video frame comprises:
determining a first target loss value according to a norm of a difference between the reconstructed video frame and the current video frame;
the training of the video coding model according to the target loss value comprises the following steps:
and training a video coding model according to the first target loss value.
4. The method of claim 1, wherein said determining a target loss value from said current video frame and said reconstructed video frame comprises:
inputting the current video frame into a preset discriminator network to obtain a first image quality discrimination result;
inputting the reconstructed video frame into a preset discriminator network to obtain a second image quality discrimination result;
determining a second target loss value according to the norm of the difference between the second image quality discrimination result and the first image quality discrimination result;
the training of the video coding model according to the target loss value comprises the following steps:
and training a video coding model according to the second target loss value.
5. The method of claim 1, wherein said determining a target loss value from said current video frame and said reconstructed video frame comprises:
obtaining a first target loss value, wherein the first target loss value is determined according to a norm of a difference value between the reconstructed video frame and the current video frame
Acquiring a length parameter, wherein the length parameter is determined according to the length of the current video frame after precoding;
determining a third target loss value according to the first target loss value and the length parameter;
the training of the video coding model according to the target loss value comprises the following steps:
and training a video coding model according to the third target loss value.
6. The method of claim 1, wherein the obtaining the reconstructed video frame corresponding to the current video frame comprises:
dividing a current video frame into coding units CUs by using the video coding model to obtain a plurality of CU dividing results;
coding based on each CU dividing result by utilizing the video coding model to obtain a plurality of second code stream samples corresponding to the CU dividing results one by one;
decoding each second code stream sample to obtain a plurality of reconstructed video frames corresponding to a plurality of CU dividing results one by one;
the determining a target loss value according to the current video frame and the reconstructed video frame comprises the following steps:
determining norms of differences between the reconstructed video frame and the current video frame corresponding to each CU dividing result to obtain a fourth target loss value corresponding to each CU dividing result;
selecting a minimum value from fourth target loss values corresponding to the CU division results respectively as a fifth target loss value;
the training of the video coding model according to the target loss value comprises the following steps:
and training a video coding model according to the fifth target loss value.
7. The method of claim 1, wherein the coding constraint parameters comprise at least one of coding mode indication parameters and rate control parameters.
8. The method of claim 1, wherein the video coding model comprises a coding element generation module and a coding module;
the coding element generation is used for generating a coding element according to the current video frame of the input model and the coding constraint parameters;
the encoding module is used for encoding the current video frame according to the encoding element;
the coding element generation module comprises at least one of a CU partitioning sub-module, a quantization parameter QP determination sub-module, and a motion vector MV estimation sub-module.
9. A video encoding method, the method comprising:
acquiring an original video;
dividing the original video into a plurality of video frames;
obtaining an output code stream according to each video frame, preset coding constraint parameters and a video coding model obtained by training according to the method of any one of claims 1-8.
10. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of claims 1-9 when the computer program is executed.
11. A computer readable storage medium, characterized in that a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1-9.
CN202111290503.3A 2021-11-02 2021-11-02 Video encoding method, model training method, apparatus, and storage medium Pending CN116095328A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111290503.3A CN116095328A (en) 2021-11-02 2021-11-02 Video encoding method, model training method, apparatus, and storage medium
PCT/CN2022/080753 WO2023077707A1 (en) 2021-11-02 2022-03-14 Video encoding method, model training method, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111290503.3A CN116095328A (en) 2021-11-02 2021-11-02 Video encoding method, model training method, apparatus, and storage medium

Publications (1)

Publication Number Publication Date
CN116095328A true CN116095328A (en) 2023-05-09

Family

ID=86210650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111290503.3A Pending CN116095328A (en) 2021-11-02 2021-11-02 Video encoding method, model training method, apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN116095328A (en)
WO (1) WO2023077707A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233445A (en) * 2023-05-10 2023-06-06 腾讯科技(深圳)有限公司 Video encoding and decoding processing method and device, computer equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117459732A (en) * 2023-10-25 2024-01-26 书行科技(北京)有限公司 Video encoding method, apparatus, device, readable storage medium, and program product

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111163314A (en) * 2018-11-07 2020-05-15 合肥图鸭信息科技有限公司 Image compression method and system
CN110267041B (en) * 2019-06-28 2021-11-09 Oppo广东移动通信有限公司 Image encoding method, image encoding device, electronic device, and computer-readable storage medium
CN111919220A (en) * 2019-11-13 2020-11-10 深圳信息职业技术学院 Adaptive pre-coding model training method, adaptive pre-coding method and base station
CN113132723B (en) * 2019-12-31 2023-11-14 武汉Tcl集团工业研究院有限公司 Image compression method and device
CN111862995A (en) * 2020-06-22 2020-10-30 北京达佳互联信息技术有限公司 Code rate determination model training method, code rate determination method and device
CN113194320B (en) * 2021-04-30 2022-11-22 北京达佳互联信息技术有限公司 Parameter prediction model training method and device and parameter prediction method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116233445A (en) * 2023-05-10 2023-06-06 腾讯科技(深圳)有限公司 Video encoding and decoding processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
WO2023077707A1 (en) 2023-05-11

Similar Documents

Publication Publication Date Title
CN107105278B (en) The video coding and decoding system that motion vector automatically generates
EP3821373B1 (en) Video processing
US20170359584A1 (en) A method and apparatus for performing graph-based prediction using optimazation function
CN111263161B (en) Video compression processing method and device, storage medium and electronic equipment
CN113766249B (en) Loop filtering method, device, equipment and storage medium in video coding and decoding
Pessoa et al. End-to-end learning of video compression using spatio-temporal autoencoders
EP3334163A1 (en) Device and method for performing transform by using singleton coefficient update
CN116095328A (en) Video encoding method, model training method, apparatus, and storage medium
CN110740319B (en) Video encoding and decoding method and device, electronic equipment and storage medium
Ayzik et al. Deep image compression using decoder side information
JP2017537518A (en) Method and apparatus for decoding / encoding a video signal using a transformation derived from a graph template
KR102059842B1 (en) Method and apparatus for performing graph-based transformation using generalized graph parameters
JP2024520151A (en) Feature data encoding and decoding method and apparatus
CN115442618A (en) Time domain-space domain self-adaptive video compression based on neural network
CN114157863B (en) Video coding method, system and storage medium based on digital retina
KR102605285B1 (en) Method and device for encoding/decoding video signals using optimized transformation according to a multigraph-based model
Zhou et al. $\ell_ {2} $ Restoration of $\ell_ {\infty} $-Decoded Images Via Soft-Decision Estimation
US9648336B2 (en) Encoding apparatus and method
EP2510694A1 (en) Method and apparatus for coding and decoding an image block
CN110876058B (en) Historical candidate list updating method and device
JP2024511084A (en) Multidistribution entropy modeling of latency features in image and video coding using neural networks
US11589038B2 (en) Methods for video encoding and video decoding
CN116320410A (en) Data processing method, device, equipment and readable storage medium
CN114422804A (en) Method, device and system for jointly encoding and decoding digital retina video stream and feature stream
US11647228B2 (en) Method and apparatus for encoding and decoding video signal using transform domain prediction for prediction unit partition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination