WO2024093627A1 - 一种视频压缩方法、视频解码方法和相关装置 - Google Patents

一种视频压缩方法、视频解码方法和相关装置 Download PDF

Info

Publication number
WO2024093627A1
WO2024093627A1 PCT/CN2023/123893 CN2023123893W WO2024093627A1 WO 2024093627 A1 WO2024093627 A1 WO 2024093627A1 CN 2023123893 W CN2023123893 W CN 2023123893W WO 2024093627 A1 WO2024093627 A1 WO 2024093627A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frame
position information
processed
video
previous
Prior art date
Application number
PCT/CN2023/123893
Other languages
English (en)
French (fr)
Inventor
罗凤
项进喜
田宽
张军
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024093627A1 publication Critical patent/WO2024093627A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • H04N19/517Processing of motion vectors by encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display

Definitions

  • the present application relates to the field of communication technology, and in particular to video compression technology and video decoding technology.
  • Video communication is widely used in scenarios such as video conferencing, online education, and online entertainment.
  • how to reduce video freezes, reduce the bandwidth requirements of video communication, and ensure the user's video communication experience is an urgent problem to be solved.
  • Video compression is the key technology to solve this problem.
  • the main method is to calculate the motion information of the video frame to be processed compared with the previous video frame, and then send the motion information so as to restore the video frame to be processed based on the previous video frame and the motion information.
  • the motion information consumes a large byte stream, and it is difficult to estimate the motion information when complex picture motion occurs in the video frame, and the reconstructed picture is prone to distortion.
  • the present application provides a video compression method, a video decoding method and related devices, thereby alleviating the video frame distortion phenomenon caused by complex picture motion and improving the robustness of the algorithm.
  • the video compression file includes the first position information and the second position information, not the dense feature vector representing the motion information, so that when the video compression is realized, the byte stream consumed by the motion information is greatly reduced, and the transmission bandwidth of the video compression file is reduced.
  • an embodiment of the present application provides a video compression method, the method being executed by a computer device, the method comprising:
  • the video is compressed according to the first position information, the second position information and the latent feature to obtain a video compression file.
  • an embodiment of the present application provides a video decoding method, the method being executed by a computer device, the method comprising:
  • the video compression file includes first position information of a first key point of a video frame to be processed, second position information of a second key point of a previous video frame, and a hidden feature, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the initial video frame is repaired twice using the latent features to obtain a final video frame.
  • an embodiment of the present application provides a video compression device, which is deployed on a computer device, and includes an acquisition unit, an extraction unit, a determination unit, a repair unit, and a compression unit:
  • the acquisition unit is used to acquire a video frame to be processed and a previous video frame of the video frame to be processed, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the extraction unit is used to extract key points from the video frame to be processed to obtain first position information of a first key point in the video frame to be processed, and to extract key points from the previous video frame to obtain second position information of a second key point in the previous video frame;
  • the determining unit is configured to perform motion estimation according to the first position information and the second position information to obtain motion information of the to-be-processed video frame relative to the previous video frame;
  • the restoration unit is used to perform image restoration according to the motion information and the previous video frame to obtain an initial video frame
  • the determining unit is further configured to determine a latent feature according to the to-be-processed video frame and the initial video frame, wherein the latent feature represents a restoration deviation of the initial video frame relative to the to-be-processed video frame;
  • the compression unit is used to compress the video according to the first position information, the second position information and the latent feature to obtain a video compression file.
  • an embodiment of the present application provides a video decoding device, which is deployed on a computer device, and includes an acquisition unit, a determination unit, and a repair unit:
  • the acquisition unit is used to acquire a video compression file, wherein the video compression file includes first position information of a first key point of a video frame to be processed, second position information of a second key point of a previous video frame, and a hidden feature, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the determining unit is configured to perform motion estimation according to the first position information and the second position information to obtain motion information of the to-be-processed video frame relative to the previous video frame;
  • the restoration unit is used to perform image restoration according to the motion information and the previous video frame to obtain an initial video frame
  • the restoration unit is further configured to perform secondary restoration on the initial video frame using the latent features to obtain a final video frame.
  • an embodiment of the present application provides a computer device, the computer device comprising a processor and a memory:
  • the memory is used to store program code and transmit the program code to the processor
  • the processor is configured to execute the method described in any one of the preceding aspects according to instructions in the program code.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium is used to store program code, and when the program code is executed by a processor, the processor executes the method described in any of the aforementioned aspects.
  • an embodiment of the present application provides a computer program product, including a computer program, which implements the method described in any of the aforementioned aspects when executed by a processor.
  • the video frame to be processed and the previous video frame of the video frame to be processed can be obtained, and the previous video frame is a video frame adjacent to the video frame to be processed and located before the video frame to be processed in the video frame sequence.
  • the key points of the video frame to be processed and the previous video frame are extracted respectively to obtain the first position information of the first key point of the video frame to be processed and the second position information of the second key point of the previous video frame, so as to perform motion estimation according to the first position information and the second position information, and obtain the motion information of the video frame to be processed relative to the previous video frame.
  • Image restoration is performed according to the motion information and the previous video frame to obtain the initial video frame.
  • the present application can further determine hidden features according to the video frame to be processed and the initial video frame during video compression, and characterize the restoration deviation of the initial video frame relative to the video frame to be processed by the hidden features, so as to perform video compression according to the first position information, the second position information and the hidden features to obtain a video compression file.
  • the video receiving end can calculate the motion information based on the first position information and the second position information, and perform image restoration based on the motion information and the previous video frame to obtain the initial video frame.
  • the video compression file also includes latent features, the latent features represent the restoration deviation of the initial video frame relative to the video frame to be processed, so the latent features can be further used to perform secondary restoration on the initial video frame, thereby alleviating the video frame distortion caused by complex picture motion and improving the robustness of the algorithm.
  • the video compression file includes the first position information and the second position information, not the dense feature vector representing the motion information, so that when video compression is achieved, the byte stream consumed by the motion information is greatly reduced, and the transmission bandwidth of the video compression file is reduced.
  • FIG1 is an application scenario architecture diagram of a video compression method provided by an embodiment of the present application.
  • FIG2 is a flow chart of a video compression method provided by an embodiment of the present application.
  • FIG3 is a structural example diagram of a video compression model provided in an embodiment of the present application.
  • FIG4 is a diagram showing a specific process example of probability modeling provided by an embodiment of the present application.
  • FIG5 is a flowchart of a video decoding method provided by an embodiment of the present application.
  • FIG6 is a diagram showing a specific process example of AI video compression provided by an embodiment of the present application.
  • FIG7 is a performance comparison diagram of various solutions provided in the embodiments of the present application in terms of subjective image quality
  • FIG8 is a performance comparison diagram of various solutions provided in the embodiments of the present application in complex scene reconstruction
  • FIG9 is a performance comparison diagram of various solutions provided in embodiments of the present application in non-face scenes.
  • FIG10 is a structural diagram of a video compression device provided in an embodiment of the present application.
  • FIG11 is a structural diagram of a video decoding device provided in an embodiment of the present application.
  • FIG12 is a structural diagram of a terminal provided in an embodiment of the present application.
  • FIG. 13 is a structural diagram of a server provided in an embodiment of the present application.
  • Video communication is widely used in scenarios such as video conferencing, online education, and online entertainment.
  • video communication is widely used in scenarios such as video conferencing, online education, and online entertainment.
  • the form of business communication between people has gradually shifted from offline to online, making video communication more widely used in video conferencing.
  • online video conferencing reduces the spatial location restrictions of participants and promotes efficient and cost-effective collaboration.
  • how to reduce video freezes, reduce the demand for bandwidth for video conferencing, and ensure the user's video conferencing experience is an urgent problem to be solved.
  • Video compression is a key technology to solve this problem.
  • Video compression is divided into lossy video compression and lossless video compression according to the quality difference between the decompressed video and the original video. This application focuses on lossy video compression.
  • video compression When performing video compression, a video is known, and the first video frame in the video frame sequence is recorded as an I frame, and the remaining video frames are recorded as P frames. Since there is often repeated and redundant information between different video frames of the same video, video compression restores the current video frame based on the previous video frame x t-1 of the current video frame (i.e., the video frame to be processed) x t .
  • x t-1 When x t-1 is known, it is only necessary to determine the difference between the current video frame x t and the previous video frame x t-1 to reconstruct the current video frame.
  • the difference between the current video frame xt and the previous video frame xt-1 can be reflected by motion information, so in the related art, the motion information of the current video frame compared with the previous video frame is usually obtained, and then the motion information is sent to restore the current video frame based on the previous video frame and the motion information.
  • the motion information is a dense motion feature vector, which consumes a large byte stream, and it is difficult to estimate the motion information when complex picture motion occurs in the video frame, and the reconstructed picture is easily distorted.
  • an embodiment of the present application provides a video compression method, which performs video compression according to the first position information of the first key point of the video frame to be processed, the second position information of the second key point of the previous video frame and the hidden features to obtain a video compression file.
  • the video receiving end can calculate the motion information according to the first position information and the second position information, and perform image repair based on the motion information and the previous video frame to obtain the initial video frame.
  • the video compression file also includes hidden features
  • the hidden features represent the repair deviation of the initial video frame relative to the video frame to be processed, so the hidden features can be further used to perform secondary repair on the initial video frame, thereby alleviating the video frame distortion caused by complex picture motion and improving the robustness of the algorithm.
  • the video compression file includes the first position information and the second position information, not the dense feature vector representing the motion information, so that when the video compression is realized, the byte stream consumed by the motion information is greatly reduced, and the transmission bandwidth of the video compression file is reduced.
  • video compression method provided in the embodiment of the present application is applicable to various video communication scenarios, such as video conferencing, online education, online entertainment and the like.
  • the video compression method provided in the embodiment of the present application can be executed by a computer device, which can be used as a video transmitter.
  • the computer device can be, for example, a server or a terminal.
  • the server can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.
  • the terminal includes but is not limited to a smart phone, a computer, an intelligent voice interaction device, a smart home appliance, a vehicle terminal, an aircraft, etc.
  • FIG1 shows an application scenario architecture diagram of a video compression method
  • the application scenario may include a video transmitter 101 and a video receiver 102.
  • the video transmitter 101 and the video receiver 102 perform video communication
  • the video transmitter 101 needs to transmit video frames to the video receiver 102, and all transmitted video frames can constitute a video frame sequence transmitted in the video communication.
  • the video frames to be sent are usually compressed.
  • the current video frame When the current video frame is transmitted, the current video frame can be used as a video frame to be processed, and video compression can be performed on the video frame to be processed.
  • the video transmitting end 101 can obtain the video frame to be processed and the previous video frame of the video frame to be processed.
  • the video frame to be processed is a video frame that currently needs to be compressed and transmitted to the video receiving end 102
  • the previous video frame is a video frame in the video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed.
  • the video sending end 101 can extract key points from the video frame to be processed to obtain the first position information of the first key point of the video frame to be processed, and extract key points from the previous video frame to obtain the second position information of the second key point of the previous video frame, so as to perform motion estimation based on the first position information and the second position information to obtain the motion information of the video frame to be processed relative to the previous video frame.
  • the key points can be representative points on the objects included in the video frame, and the objects included in the video frame are represented by the key points.
  • the objects can be people, animals, etc.
  • the key points can be representative points on various body parts of the human body included in the video frame, and various body parts can include, for example, human face, hands, arms, body, feet, legs, etc.
  • the key points can be representative points on the human face; when the body parts included in the video frame are human face and hands, the key points can be representative points on the human face and hands, and so on.
  • the key point of the video frame to be processed can be called the first key point, which can be a representative point on the first object included in the video frame to be processed.
  • the key point of the previous video frame can be called the second key point, which can be a representative point on the second object included in the previous video frame.
  • the first object and the second object can be the same or different.
  • the video transmitter 101 performs image restoration based on the motion information and the previous video frame to obtain an initial video frame.
  • the video transmitter 101 of the present application can further determine hidden features based on the video frame to be processed and the initial video frame during video compression.
  • the hidden features can be the features of the unclear parts of the initial video frame after image restoration relative to the video frame to be processed, and the hidden features are used to characterize the restoration deviation of the initial video frame relative to the video frame to be processed.
  • the video sending end 101 can compress the video according to the first position information, the second position information and the latent feature to obtain a video compression file, and send the video compression file to the video receiving end 102.
  • the video receiving end 102 can calculate the motion information according to the first position information and the second position information, and perform image restoration based on the motion information and the previous video frame to obtain the initial video frame.
  • the video compression file also includes latent features, the latent features represent the restoration deviation of the initial video frame relative to the video frame to be processed, so the latent features can be further used to restore the initial video frame.
  • the video frame is repaired twice, thereby alleviating the video frame distortion caused by complex picture motion and improving the robustness of the algorithm.
  • the video compression file includes the first position information and the second position information, not the dense feature vector representing the motion information, so that when the video compression is realized, the byte stream consumed by the motion information is greatly reduced, and the transmission bandwidth of the video compression file is reduced.
  • the method provided in the embodiment of the present application mainly involves artificial intelligence technology, and video compression and video decoding are automatically performed through artificial intelligence (AI) technology.
  • AI artificial intelligence
  • the video compression model can be trained through machine learning, and the video frame to be processed and the previous video frame can be pre-processed through image processing in computer vision technology, and key points, hidden features, etc. can be extracted through image semantic understanding.
  • FIG2 shows a flow chart of a video compression method, the method comprising:
  • a video transmitter obtains a video frame to be processed and a previous video frame of the video frame to be processed, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed.
  • the current video frame can be used as a video frame to be processed for video compression.
  • the previous video frame is usually used as a reference to reconstruct the video frame to be processed based on the difference between the video frame to be processed and the previous video frame.
  • the video transmitter can obtain the video frame to be processed and the previous video frame of the video frame to be processed.
  • the video frame to be processed is a video frame that currently needs to be video compressed and transmitted to the video receiver
  • the previous video frame is a video frame in the video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed.
  • the video frame to be processed can be represented by x t
  • the previous video frame can be represented by x t-1 .
  • FIG. 3 shows a structural example diagram of a video compression model, and the video frame to be processed x t and the previous video frame x t-1 can be shown in FIG.
  • a video frame sequence is obtained by sorting multiple video frames to be transmitted according to their time order in the time domain.
  • the previous video frame refers to a video frame in the video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed in the time domain.
  • the video sending end extracts key points from the video frame to be processed to obtain first position information of a first key point in the video frame to be processed, and extracts key points from the previous video frame to obtain second position information of a second key point in the previous video frame.
  • the video transmitter can determine the difference between the two, and the difference between the two can be represented by motion information.
  • the byte stream consumed by motion information can be reduced.
  • the main body in the picture is generally a face, a human body and other instances with low complexity.
  • the motion information between video frames can be measured through special points in the instance, such as key points. Since the motion information in common video compression is stored in a dense feature vector of size (N, 2, H/16, W/16), recording the position information of the key points can greatly reduce the byte stream consumed by the motion information.
  • N represents the number of key points used to determine the motion information
  • H represents the height of the video frame to be processed
  • W represents the width of the video frame to be processed.
  • the video sending end can extract key points of the video frame to be processed and the previous video frame respectively, identify the first key point of the video frame to be processed and the second key point of the previous video frame, and then obtain the first position information of the first key point and the second position information of the second key point, so as to replace the dense feature vector representing the motion information with the first position information and the second position information to transmit to the video receiving end.
  • the position information may be represented by coordinates, that is, the first position information may be the coordinates of the first key point, and the second position information may be the coordinates of the second key point.
  • the key points may be facial landmarks, which are a set of fixed points predefined based on the structure of a person's facial features.
  • the objects included in the video frame may include not only faces, but also hands, arms, feet, etc.
  • the key points may be the key points of various parts of the object in the video frame.
  • the first key points may include the key points of various body parts included in the first object in the video frame to be processed
  • the second key points may include the key points of various body parts included in the second object in the previous video frame.
  • the first key points may be the key points of the face and the key points of the hands.
  • the second key points may be the key points of the face.
  • this method makes the extracted key points applicable to various scenarios, thereby making the video compression method more scalable and improving the effect of subsequent reconstruction.
  • extracting key points from the video frame to be processed and the previous video frame to obtain the corresponding key points can be performed by identifying the body parts included in the first object in the video frame to be processed, and identifying the body parts included in the second object in the previous video frame, and then determining the key points corresponding to the body parts included in the first object according to the mapping relationship between the body parts and the key points, and determining the first position information of the key points corresponding to the body parts included in the first object in the video frame to be processed.
  • the mapping relationship between the body parts and the key points can be predetermined, and the key points of each body part can be, for example, a set of fixed points predefined according to the structure of the body part, and the embodiment of the present application does not limit the definition method of the key points of each body part.
  • the key point extraction of the video frame to be processed to obtain the first position information of the first key point in the video frame to be processed, and the key point extraction of the previous video frame to obtain the second position information of the second key point in the previous video frame can be performed by extracting the key point of the video frame to be processed through the key point detection model on the video sending end to obtain the first position information, and the key point extraction of the previous video frame through the key point detection model to obtain the second position information.
  • the key point detection model is obtained by training based on training samples, and the training samples include multiple sample images, and the sample objects in each sample image include body parts, and the body parts included in the sample objects in the multiple sample images cover various body parts.
  • the key point detection model can be trained by adaptive learning, so as to learn which key points are extracted from video frames in different scenes.
  • the key point detection model will gradually acquire the ability to predict the key points of video frames in different scenes.
  • the key point detection model is obtained through adaptive learning and training, so that the key point detection model has adaptive capabilities, so that the method provided in the embodiment of the present application can be applied to various scenarios, including scenarios other than human faces, thereby improving the key point extraction capability.
  • the key point detection model can be, for example, a key point detection network or a key point detector (Keypoint Detector).
  • the embodiment of the present application mainly introduces the key point detection model as a key point detector.
  • the key point detector can be, for example, a deep residual network (ResNet), specifically ResNet18.
  • the video frame to be processed xt or the previous video frame xt -1 can be used as image I, and xt and xt-1 can be respectively input into the key point detector to predict the corresponding key points, which are the first key point and the second key point, respectively.
  • the first key point can be expressed as
  • the second key point can be expressed as See Figure 3.
  • the predicted key points can be represented by position information, such as first position information for the first key point and second position information for the second key point.
  • S203 The video sending end performs motion estimation according to the first position information and the second position information to obtain motion information of the to-be-processed video frame relative to the previous video frame.
  • the first position information and the second position information can be used to represent motion information for use in encoding and decoding of the video frame to be processed.
  • the video transmitter can perform motion estimation based on the first position information and the second position information to obtain the motion information of the video frame to be processed relative to the previous video frame, so as to preliminarily predict the initial video frame restored after video compression based on the first position information and the second position information, and then determine the distortion of the initial video frame, thereby alleviating possible distortion during the video compression process.
  • the video frame to be processed may have relative motion relative to the previous video frame, so the motion information obtained here is also relative motion information.
  • the motion estimation is performed based on the first position information and the second position information
  • the motion information of the video frame to be processed relative to the previous video frame can be obtained by the video transmitter performing a thin plate spline transformation (Thin Plate Spline transformation, TPS transformation) based on the first position information and the second position information to obtain a thin plate spline transformation matrix.
  • the previous video frame is transformed according to the thin plate spline transformation matrix to obtain a transformed image, and then a contribution map is output through a motion network based on the transformed image.
  • the contribution map is used to represent the contribution of the thin plate spline transformation matrix to the motion of each pixel on the previous video frame, so that the motion information can be calculated based on the contribution map and the thin plate spline transformation matrix.
  • the first key points of the image to be processed and the second key points of the previous video frame can be divided into K groups, and each group of key points can include a first key point and a second key point.
  • thin plate spline transformation can be performed on each group to obtain K thin plate spline transformation matrices.
  • the thin plate spline transformation matrix can be represented by T k , and It means that the size of each thin plate spline transformation matrix is H ⁇ W, H is the height of the image to be processed, and W is the width of the image to be processed.
  • the size of the transformed image obtained can be (K+1, 3, H, W).
  • K+1 contribution graphs can be obtained according to the transformed image, and the contribution graph can be represented by M k . Indicates that the size of each contribution graph is H ⁇ W.
  • the contribution graph can be used as a weight to linearly weight different thin plate spline transformation matrices at the same position to obtain motion information.
  • the motion information can be an optical flow field.
  • the calculation formula for calculating motion information based on the contribution graph and the thin plate spline transformation matrix can be as follows:
  • T(x,y) represents motion information (such as optical flow field)
  • Mk (x,y) represents the kth contribution map
  • Tk (x,y) represents the kth thin plate spline transformation matrix
  • K represents the number of key point groups (i.e., the number of thin plate spline transformation groups)
  • (x,y) represents the coordinates of each pixel.
  • the above process can be implemented through a motion network (Motion Network) on the video sending end.
  • the motion network can be used to predict motion information (see Figure 3) for subsequent image restoration.
  • the embodiment of the present application does not limit the network structure of the motion network.
  • the video frame to be processed may include a background area in addition to the first object.
  • the first object serves as the foreground and may obstruct the background area to a certain extent.
  • mask information can also be output through the motion network based on the transformed image. The mask information is used to indicate that the attention of image restoration should be more focused on the foreground (ie, the first object), thereby reducing the influence of the background area on the foreground image restoration, so as to improve the image restoration effect.
  • the embodiment of the present application can also additionally predict the affine transformation matrix of the background to model the background motion.
  • the video frame to be processed and the previous video frame can be spliced, and the second splicing result obtained after splicing is input into the background motion prediction network to obtain the affine transformation matrix, which is used to represent the background motion of the video frame to be processed relative to the previous video frame.
  • the prediction of the above-mentioned affine transformation matrix can be realized by the background motion prediction network (BG Motion Predictor) on the video transmitter, the video frame to be processed xt-1 and the previous video frame xt are spliced in the channel direction and the second splicing result is input into the background motion prediction network, the image features of the second splicing result are extracted by the background motion prediction network, and then a single fully connected layer is used to regress the two-dimensional affine transformation matrix (see FIG3).
  • BG Motion Predictor the background motion prediction network
  • the embodiment of the present application does not limit the network structure of the background motion prediction network, for example, it can be another ResNet18, and the affine transformation matrix can be expressed as A bg , Indicates that the size of the affine transformation matrix is 2 ⁇ 3.
  • the method of transforming the previous video frame according to the thin plate spline transformation matrix to obtain the transformed image can be to transform the previous video frame using the thin plate spline transformation matrix and the affine transformation matrix to obtain the transformed image.
  • the affine transformation matrix since it is necessary to use both the affine transformation matrix and the thin plate spline transformation matrix to transform the previous video frame, in order to facilitate the operation between the affine transformation matrix and the thin plate spline transformation matrix, the affine transformation matrix can be converted into a two-dimensional vector of the same size as the thin plate spline transformation matrix T k using the following formula:
  • p (x, y) T (x ⁇ 0,1,..H-1 ⁇ , y ⁇ 0,1,..W-1 ⁇ ), represents the coordinates of each pixel, H represents the height of the previous video frame, and W represents the width of the previous video frame.
  • S204 The video sending end performs image restoration according to the motion information and the previous video frame to obtain an initial video frame.
  • the video transmitter can perform image restoration based on the motion information and the previous video frame to obtain an initial video frame, so as to determine the possible distortion when directly reconstructing the video frame to be processed using the motion information, thereby alleviating such distortion.
  • the image restoration method can be, for example, image distortion (such as warp) processing.
  • the motion information can reflect the difference between the video frame to be processed and the previous video frame, so the initial video frame can be obtained by performing transformation on the basis of the previous video frame according to the motion information.
  • the image restoration network can adopt an encoder-decoder structure.
  • the image restoration network takes the previous video frame xt-1 and motion information (such as the optical flow field T(x, y)) as input, and outputs the transformed initial video frame (see FIG3).
  • the initial video frame can be used
  • x t-1 can be downsampled and upsampled four times in sequence through the image restoration network.
  • the optical flow field T is used to transform the feature maps at all levels, and finally the initial video frame is output.
  • the aforementioned motion network may output mask information.
  • the image restoration is performed according to the motion information and the previous video frame to obtain the initial video frame.
  • the method can be to perform image restoration according to the motion information, the mask information and the previous video frame to obtain the initial video frame.
  • the mask information is used to indicate that the attention of the image restoration should be more focused on the foreground (i.e., the first object), that is, the background area is ignored under the instruction of the mask information, and then the initial video frame is obtained by transforming the previous video frame according to the motion information, thereby reducing the influence of the background area on the foreground image restoration, so as to improve the effect of image restoration.
  • the above-mentioned image restoration network can also be used for implementation, and its implementation process is similar to that described in FIG3 above, which will not be repeated here.
  • the video sending end determines a latent feature according to the to-be-processed video frame and the initial video frame, where the latent feature represents a restoration deviation of the initial video frame relative to the to-be-processed video frame.
  • the video transmitter of the embodiment of the present application can further determine latent features based on the video frame to be processed and the initial video frame. It can be the feature of the unclear repaired part of the initial video frame after image restoration relative to the video frame to be processed. The latent feature is used to characterize the restoration deviation of the initial video frame relative to the video frame to be processed.
  • the initial video frame may be compared with the video frame to be processed to obtain a repair deviation of the initial video frame repaired by the image relative to the video frame to be processed, that is, the latent features.
  • the determination of hidden features can be realized by a video frame refinement module based on a condition (context), that is, the video frame to be processed and the initial video frame can be input into the video frame refinement module, thereby outputting hidden features.
  • the core of S205 implementation can be to use the initial video frame obtained by the aforementioned image restoration as a condition to assist the video compression in the next stage.
  • the video frame refinement module can include a feature extractor (Feature Extractor) and a conditional encoder (Context Encoder).
  • the video transmitter can extract features from the initial video frame through the feature extractor in the video frame refinement module to obtain a feature vector of the initial video frame, and use the feature vector of the initial video frame as a video frame compression condition, and then use the video frame compression condition to assist in encoding the video frame to be processed.
  • the pixel matrix of the video frame to be processed and the video compression condition can be spliced, and the first splicing result obtained after splicing is input into the conditional encoder to obtain hidden features.
  • the process can be shown in Figure 3. Among them, hidden features can be represented by y t .
  • the feature vector of the initial video frame can reflect the features of the initial video frame
  • the pixel matrix of the video frame to be processed can reflect the features of the video frame to be processed, and then more accurate latent features can be obtained based on the pixel matrix and video compression conditions.
  • the feature extractor can be represented by fex
  • the video frame compression condition obtained by the feature extractor from the features of the initial video frame can be expressed by the following formula:
  • the conditional encoder can be represented by f enc .
  • the conditional encoder encodes the first splicing result obtained by splicing the processed video frame and the video compression condition to obtain the latent feature which can be represented by the following formula:
  • yt latent features
  • fenc represents conditional encoder
  • xt the video frame to be processed, which can specifically refer to its corresponding pixel matrix.
  • the feature extractor may include 1 convolutional layer, 2 residual modules, and 1 convolutional layer, wherein the size of the convolutional layer may be 3 ⁇ 3, and the convolutional layer may be represented as conv3 ⁇ 3.
  • the feature extractor takes the initial video frame as As input, it passes through 1 conv3 ⁇ 3, 2 residual modules, and 1 conv3 ⁇ 3 in sequence, and the output channel number is 64 feature vectors, that is, the video frame compression condition.
  • the video frame compression condition can be expressed as express, Indicates that its size is 64 ⁇ H ⁇ W, H represents the height of the initial video frame, and W represents the width of the initial video frame. Compared with the commonly used residual compensation bar, the video frame compression condition assists the encoding and decoding of the video frame from the feature domain, which can provide more flexible and rich auxiliary information.
  • the conditional encoder is composed of three convolutional layers and a normalization layer module stacked together, wherein the normalization module can be various types of normalization modules. Since generalized normalization (GDN) is more suitable for image reconstruction, the normalization layer used here can be GDN.
  • GDN generalized normalization
  • the latent feature when determining the latent feature, you can first extract the feature of the video frame to be processed to obtain a feature vector of the video frame to be processed, then splice the feature vector of the video frame to be processed with the video compression condition, and input the first splicing result obtained after splicing into the conditional encoder to obtain the latent feature.
  • the embodiment of the present application does not limit the specific implementation method of determining the latent feature, and any method that can achieve a similar effect can be used as an implementation method for determining the latent feature.
  • the video sending end compresses the video according to the first position information, the second position information and the hidden feature to obtain a video compression file.
  • the first position information, the second position information and the latent features can be obtained.
  • the motion information is also obtained, in order to reduce the consumption of byte streams compared with related technologies, the embodiment of the present application uses the first position information and the second position information obtained based on the key points to replace the motion information of the dense feature vector, so as to compress the video according to the first position information, the second position information and the latent features to obtain a video compressed file.
  • the video transmitter can also send the video compressed file to the video receiver.
  • the affine transformation matrix when the affine transformation matrix is obtained by the above method, in order to use the affine transformation matrix for decoding at the video receiving end, the affine transformation matrix can also be written into a video compression file, that is, the video compression file can be obtained by performing video compression according to the first position information, the second position information and the hidden features, and the first position information, the second position information, the hidden features and the affine transformation matrix can be written into the video compression file.
  • the video receiving end receives the video compression file and decodes and reconstructs the video frame to be processed according to the video compression file, the affine transformation matrix can be used to improve the accuracy of motion estimation, thereby improving the reconstruction effect.
  • the latent features include information that can reflect the repair deviation, which can be numbers. Some numbers in the latent features have a significantly higher probability of appearing.
  • the video transmitter can perform probability modeling on the latent features to obtain distribution parameters. The distribution parameters are used to represent the distribution of different information in the latent features, and then the distribution parameters are used to assist the latent features in arithmetic coding to obtain the encoded latent features.
  • the latent features included in the video compression file are encoded latent features, that is, the way to obtain the video compression file by performing video compression according to the first position information, the second position information and the latent features can be to write the first position information, the second position information, the encoded latent features and the distribution parameters into the video compression file.
  • the expression form of the latent features can be a feature map. Assuming that the latent feature yt obeys the Laplace distribution, the distribution parameters can be ⁇ t and ⁇ t .
  • distribution parameters can be obtained, which can reflect the distribution of different information in the latent features, and then reflect the probability of different information appearing in the latent features, so that the latent features can be encoded according to the distribution parameters, so that fewer bits can be used to encode the information with a higher probability, thereby further reducing the redundant information in the latent features.
  • the above process can be implemented by an entropy model, that is, the video frame extraction module can also include an entropy model, and the entropy model is used to perform probability modeling on the latent features to obtain distribution parameters, as shown in FIG3.
  • the greater the probability of the occurrence of certain information in the latent features the smaller the entropy output by the entropy model.
  • a priori prediction structure that integrates hierarchical information, spatial information, and temporal information can be used to predict more accurate distribution parameters.
  • the method of probabilistically modeling the latent features and obtaining the distribution parameters of different information in the latent features can be to perform hierarchical prior learning on the latent features to obtain the first prior information (i.e., hierarchical information), and perform spatial prior learning on the latent features to obtain the second prior information (i.e.,
  • the first prior information can be obtained by hierarchical prior learning through a super prior model (this process is the hierarchical prior branch)
  • the second prior information can be obtained by spatial prior learning through an autoregressive network (this process is the spatial prior branch)
  • the third prior information can be obtained by temporal prior learning through a temporal prior encoder (this process is the temporal prior branch).
  • the entropy model when used for probability modeling (i.e., the entropy model is used as a prior prediction structure), the entropy model can include a super prior model, an autoregressive network, and a temporal prior encoder.
  • the hyper-prior model may include a hyper-prior encoder (HPE) and a hyper-prior decoder (HPD).
  • HPE hyper-prior encoder
  • HPD hyper-prior decoder
  • the embodiment of the present application does not limit the network structure of the hyper-prior encoder and the hyper-prior decoder.
  • the hyper-prior encoder may be composed of three layers of convolutional layers
  • the hyper-prior decoder may be composed of three layers of deconvolutional layers.
  • the embodiment of the present application does not limit the network structure of the temporal prior encoder.
  • the temporal prior encoder may be composed of 3 multi-layer deconvolutional layers, a denormalization layer such as image generalized normalization (IGDN) and 1 convolutional layer (e.g., conv3 ⁇ 3).
  • IGDN image generalized normalization
  • conv3 ⁇ 3 convolutional layer
  • Figure 4 takes the entropy model as an example to show an example diagram of probability modeling using an entropy model.
  • the hierarchical prior branch takes y t as input, quantizes it after passing through a super prior encoder composed of three convolutional layers, and then passes through a super prior decoder composed of three deconvolutional layers to output a size of
  • the hierarchical prior feature map (i.e., the first prior information) is quantized by Q.
  • the spatial prior branch quantizes the input yt , it is passed through the autoregressive network to obtain a size of The spatial prior feature map (i.e., the second prior information) of the temporal prior branch is based on the video compression condition As input, it passes through 3 multi-layer deconvolution layers, denormalization layers and a conv3 ⁇ 3 temporal prior encoder to obtain a size of The first prior information, the second prior information, and the third prior information are concatenated in the channel dimension and input into the stacked three-layer convolution to obtain ⁇ t and ⁇ t for the prediction probability model of y t .
  • the probability model is used to guide the quantization of y t (the quantization of y t can be expressed as See Figure 3) for arithmetic encoding (AE) and arithmetic decoding (AD), thereby reducing the byte stream consumption of the video compression file. It should be noted that because the feature vector of the initial video frame is used, that is, the video compression condition This can make the estimation of the probability model more accurate.
  • the distribution parameters of latent features can be estimated more accurately, reducing the byte stream consumed by compressing latent features, thereby reducing the byte stream required for video frame compression.
  • the quantized results may be arithmetically encoded and arithmetically decoded, and the results of the arithmetic decoding output may be input into the super-a priori decoder.
  • the distribution parameters can be used to assist in video compression and subsequent video decoding.
  • the distribution parameters need to be further processed to obtain the cumulative probability density, so as to use the cumulative probability density for video compression or video decoding.
  • Video compression can also be called video coding, which can be implemented by an arithmetic encoder. There are many implementation versions of the arithmetic encoder, and the embodiment of the present application uses an open source version.
  • the formula for calculating the cumulative probability density using the distribution parameters is as follows:
  • cdf represents the cumulative probability density
  • G() represents probability modeling
  • yt represents the latent feature
  • the video frame to be processed and the previous video frame of the video frame to be processed can be obtained, and the previous video frame is a video frame in the video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed.
  • the key points of the video frame to be processed and the previous video frame are extracted respectively to obtain the first position information of the key points of the video frame to be processed and the second position information of the key points of the previous video frame, so as to perform motion estimation according to the first position information and the second position information, and obtain the motion information of the video frame to be processed relative to the previous video frame.
  • Image restoration is performed according to the motion information and the previous video frame to obtain the initial video frame.
  • the present application can further determine hidden features according to the video frame to be processed and the initial video frame during video compression, and characterize the restoration deviation of the initial video frame relative to the video frame to be processed by the hidden features, so as to perform video compression according to the first position information, the second position information and the hidden features to obtain a video compression file.
  • the video receiving end can calculate the motion information based on the first position information and the second position information, and perform image repair based on the motion information and the previous video frame to obtain the initial video frame.
  • the video compression file also includes latent features, the latent features represent the repair deviation of the initial video frame relative to the video frame to be processed, so the latent features can be further used to perform secondary repair on the initial video frame, thereby alleviating the video frame distortion caused by complex picture motion and improving the robustness of the algorithm.
  • the video compression file includes the first position information and the second position information, not the dense feature vector representing the motion information, so that when video compression is achieved, the byte stream consumed by the motion information is greatly reduced, and the transmission bandwidth of the video compression file is reduced.
  • conditional-based methods can achieve better video compression by compensating video frames from the feature space.
  • the above embodiment introduces a video compression method.
  • the video sending end compresses the video frame to be processed based on the above method to obtain a video compression file, and after the video compression file is sent to the video receiving end, the video receiving end can perform video decoding according to the video compression file to reconstruct the video frame to be processed.
  • the video decoding method will be described in detail, as shown in FIG5, the method includes:
  • the video receiving end obtains a video compression file.
  • the video receiving end may obtain the video compression file by receiving the video compression file sent by the video sending end, wherein the content included in the video compression file is the content added to the video compression file by the video sending end.
  • the video compression file includes at least the first position information of the first key point of the video frame to be processed, the second position information of the second key point of the previous video frame, and the hidden feature.
  • the previous video frame is a video frame in the video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed.
  • the video receiving end performs motion estimation according to the first position information and the second position information to obtain motion information of the to-be-processed video frame relative to the previous video frame.
  • the video receiving end may perform motion estimation based on the received first position information and second position information, thereby obtaining motion information of the to-be-processed video frame relative to the previous video frame, so as to perform image restoration based on the motion information.
  • the video receiving end performs motion estimation based on the first position information and the second position information, and obtains the motion information of the to-be-processed video frame relative to the previous video frame by performing thin plate spline transformation based on the first position information and the second position information to obtain a thin plate spline transformation matrix. Then, the previous video frame is transformed according to the thin plate spline transformation matrix. A video frame is transformed to obtain a transformed image, and then a contribution map is output through the motion network based on the transformed image. The contribution map is used to represent the contribution of the thin plate spline transformation matrix to the motion of each pixel on the previous video frame, so that the motion information can be calculated based on the contribution map and the thin plate spline transformation matrix.
  • the video compression file also includes an affine transformation matrix.
  • the method of transforming the previous video frame according to the thin plate spline transformation matrix to obtain a transformed image can be to transform the previous video frame using the thin plate spline transformation matrix and the affine transformation matrix to obtain a transformed image.
  • the method of outputting the contribution map through the motion network according to the transformed image may be to output the contribution map and mask information through the motion network according to the transformed image.
  • the method of performing image restoration according to the motion information and the previous video frame to obtain the initial video frame may be to perform image restoration according to the motion information, the mask information and the previous video frame to obtain the initial video frame.
  • the mask information is used to indicate that the attention of the image restoration should be more focused on the foreground (ie, the first object), thereby reducing the influence of the background area on the foreground image restoration, so as to improve the effect of the image restoration.
  • S503 The video receiving end performs image restoration according to the motion information and the previous video frame to obtain an initial video frame.
  • the motion information reflects the difference between the video frame to be processed and the previous video frame, and then the video receiving end can perform image restoration based on the motion information and the previous video frame to obtain the initial video frame.
  • S504 The video receiving end performs secondary repair on the initial video frame using the latent features to obtain a final video frame.
  • the video receiving end can also use the hidden features included in the video compression file to perform secondary repair on the initial video frame to obtain a higher quality final video frame (see Figure 3). Specifically, the video receiving end can extract features from the initial video frame to obtain a feature vector of the initial video frame, and then use the feature vector of the initial video frame as a video frame compression condition combined with the hidden features for secondary repair to obtain the final video frame.
  • the feature extraction in this step can be implemented by a feature extractor, and the final secondary restoration can be implemented by a context decoder.
  • the latent features can be quantized first, and then secondary restoration is performed based on the quantized latent features to obtain the final video frame.
  • the formula for reconstructing the final video frame is as follows:
  • f dec () represents the conditional decoder
  • round() represents quantization
  • latent features f enc () represents conditional encoder
  • x t represents the video frame to be processed (specifically, it can refer to its corresponding pixel matrix)
  • surface Indicates the video frame compression conditions represents the initial video frame
  • f ex () represents the feature extractor.
  • the processing process can be implemented at the video sending end, and the video receiving end can directly use the obtained latent features.
  • conditional encoder and feature extractor can be referred to the embodiment corresponding to Figure 2, and will not be repeated here.
  • the conditional decoder used in the embodiment of the present application can be composed of 3 multi-layer deconvolution layers, an inverse normalization layer IGDN, 1 conv3 ⁇ 3, 2 residual modules, and 1 conv3 ⁇ 3 superposition.
  • the quantized latent features are input to the conditional decoder, and the reconstructed image features are obtained after passing through the 3 multi-layer deconvolution layers and inverse normalization layer IGDN of the conditional decoder.
  • the reconstructed image features are compared with the video compression conditions. After splicing in the channel direction, the subsequent convolutional layer and residual module are input to obtain the final video frame
  • the video compression file may also include distribution parameters.
  • the latent features included in the video compression file may be latent features obtained through arithmetically encoding based on the distribution parameters.
  • the initial video frame is repaired twice using the latent features.
  • the latent features assisted by the distribution parameters may be arithmetically decoded to obtain the latent features, and then the latent features arithmetically decoded may be used to perform secondary repair.
  • the embodiments corresponding to FIG. 2 and FIG. 5 introduce the entire process of video compression and video decoding, which can be called AI video compression technology based on key points and can be implemented through a video compression model.
  • the above method can be integrated into the video communication software as a video communication tool to ensure the user's video communication experience.
  • the video compression model provided in the embodiment of the present application is mainly divided into a key point-based motion estimation module and a conditional video frame extraction module, wherein the motion estimation module is composed of four sub-modules, namely a key point detector, a background motion prediction network, a motion network, and an image restoration network, and the conditional video frame extraction module mainly includes two parts, a conditional encoder and a conditional decoder.
  • the specific process of AI video compression through the video compression model can be seen in Figure 6, which is mainly divided into two stages. The first stage is to predict motion information through the video frame to be processed and the previous video frame, and thereby obtain the initial video frame. The second stage is to encode and decode the video frame to be processed using the initial video frame as a priori condition to obtain the final video frame.
  • the video transmitter writes the first position information and the second position information output by the key point detector in the motion estimation module, the affine transformation matrix output by the background motion prediction network in the motion estimation module, and the hidden features and distribution parameters output by the conditional video frame extraction module into a video compression file, and stores the video compression file and transmits it to the video receiver.
  • the video receiver determines the motion information through the motion network based on the first position information, the second position information and the affine transformation matrix, and then performs image restoration through the image restoration network based on the motion information to obtain the initial video frame. Then, the initial video frame is restored twice using the hidden features to obtain the final video frame.
  • the size of the video compression file transmitted by the above method is significantly reduced compared to the video frame to be processed, which can reduce the file transmission bandwidth.
  • the video compression model used in the embodiment of the present application can be pre-trained, and the training data can come from VoxCeleb (a data set), with a total of 145,569 256*256 videos used as training sets and 4,911 used as test sets.
  • the number of frames per video in the training data ranges from 64 to 1024.
  • the video compression model can be set to train for a total of 2e6 steps, and the optimizer uses Adam by default, with an initial learning rate of 1e-4. Training After 1.8e6 steps, the learning rate drops to 1e-5.
  • R is the bit rate of the video compression file
  • D1 represents the restored quality of the initial video frame
  • D2 represents the restored quality of the final video frame, which is calculated using the perceptual loss form
  • the performance of the video compression method provided by the embodiment of the present application is better than the method provided by the related art.
  • the embodiment of the present application compares the performance of the video compression model based on facial landmark (marked as solution one), the video compression model using only key points (marked as solution two), and the video compression model provided by the embodiment of the present application (marked as solution three) from different aspects.
  • the embodiment of the present application further provides a video compression device 1000.
  • the video compression device 1000 includes an acquisition unit 1001, an extraction unit 1002, a determination unit 1003, a repair unit 1004 and a compression unit 1005:
  • the acquisition unit 1001 is used to acquire a video frame to be processed and a previous video frame of the video frame to be processed, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the extraction unit 1002 is used to extract key points from the video frame to be processed to obtain first position information of a first key point in the video frame to be processed, and to extract key points from the previous video frame to obtain second position information of a second key point in the previous video frame;
  • the determining unit 1003 is configured to perform motion estimation according to the first position information and the second position information to obtain motion information of the to-be-processed video frame relative to the previous video frame;
  • the restoration unit 1004 is used to perform image restoration according to the motion information and the previous video frame to obtain an initial video frame;
  • the determining unit 1003 is further configured to determine a latent feature according to the to-be-processed video frame and the initial video frame, wherein the latent feature represents a restoration deviation of the initial video frame relative to the to-be-processed video frame;
  • the compression unit 1005 is used to perform video compression according to the first position information, the second position information and the latent feature to obtain a video compression file.
  • the determining unit 1003 is specifically configured to:
  • the pixel matrix of the video frame to be processed and the video compression condition are spliced, and a first splicing result obtained after the splicing is input into a conditional encoder to obtain the latent feature.
  • the device further includes a modeling unit and an encoding unit:
  • the modeling unit is used to perform probability modeling on the latent features to obtain distribution parameters, where the distribution parameters are used to represent the distribution of different information in the latent features;
  • the encoding unit is used to use the distribution parameter to assist the latent feature in performing arithmetic encoding to obtain the encoded latent feature;
  • the compression unit 1005 is specifically used for:
  • the first position information, the second position information, the encoded latent features and the distribution parameters are written into the video compression file.
  • the modeling unit is specifically used to:
  • the first priori information, the second priori information and the third priori information are fused to obtain the distribution parameter.
  • the determining unit 1003 is specifically configured to:
  • the motion information is calculated based on the contribution map and the thin plate spline transformation matrix.
  • the determining unit 1003 is further configured to:
  • the determining unit 1003 is specifically configured to:
  • the compression unit 1005 is specifically used for:
  • the first position information, the second position information, the latent features and the affine transformation matrix are written into the video compression file.
  • the determining unit 1003 is specifically configured to:
  • the repair unit 1004 is specifically used for:
  • Image restoration is performed according to the motion information, the mask information and the previous video frame to obtain the initial video frame.
  • the first key points include key points of various body parts included in the first object in the video frame to be processed
  • the second key points include key points of various body parts included in the second object in the previous video frame.
  • the extracting unit 1002 is specifically configured to:
  • the key points corresponding to the body parts included in the first object are determined, and the first position information of the key points corresponding to the body parts included in the first object in the video frame to be processed is determined; and according to the mapping relationship between body parts and key points, the key points corresponding to the body parts included in the second object are determined, and the second position information of the key points corresponding to the body parts included in the second object in the previous video frame is determined.
  • the extracting unit 1002 is specifically configured to:
  • the key point detection model is used to extract key points from the video frame to be processed to obtain the first position information, and the key point detection model is used to extract key points from the previous video frame to obtain the second position information;
  • the key point detection model is trained based on training samples, and the training samples include multiple sample images, and the sample objects in each of the sample images include body parts, and the body parts included in the sample objects in the multiple sample images cover various body parts.
  • the video frame to be processed and the previous video frame of the video frame to be processed can be obtained, and the previous video frame is a video frame in the video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed.
  • key point extraction is performed on the video frame to be processed and the previous video frame respectively to obtain the first position information of the key points of the video frame to be processed and the second position information of the key points of the previous video frame, so as to perform motion estimation based on the first position information and the second position information to obtain the motion information of the video frame to be processed relative to the previous video frame.
  • Image restoration is performed based on the motion information and the previous video frame to obtain the initial video frame.
  • the present application can further determine the hidden features according to the video frame to be processed and the initial video frame, and characterize the repair deviation of the initial video frame relative to the video frame to be processed by the hidden features, so as to obtain the video compression file by performing video compression according to the first position information, the second position information and the hidden features.
  • the video receiving end can calculate the motion information according to the first position information and the second position information, and perform image repair based on the motion information and the previous video frame to obtain the initial video frame.
  • the video compression file also includes hidden features, the hidden features characterize the repair deviation of the initial video frame relative to the video frame to be processed, so the hidden features can be further used to perform secondary repair on the initial video frame, thereby alleviating the video frame distortion caused by complex image motion and improving the robustness of the algorithm.
  • the video compression file includes the first position information and the second position information, not the dense feature vector representing the motion information, so that when the video compression is realized, the byte stream consumed by the motion information is greatly reduced, and the transmission bandwidth of the video compression file is reduced.
  • the embodiment of the present application further provides a video decoding device 1100.
  • the video decoding device 1100 includes an acquisition unit 1101, a determination unit 1102, and a repair unit 1103:
  • the acquisition unit 1101 is used to acquire a video compression file, wherein the video compression file includes first position information of a first key point of a video frame to be processed, second position information of a second key point of a previous video frame, and a latent feature, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the determining unit 1102 is configured to perform motion estimation according to the first position information and the second position information to obtain motion information of the to-be-processed video frame relative to the previous video frame;
  • the restoration unit 1103 is used to perform image restoration according to the motion information and the previous video frame to obtain an initial video frame;
  • the restoration unit 1103 is further configured to perform secondary restoration on the initial video frame using the latent features to obtain a final video frame.
  • the video compression file further includes distribution parameters
  • the latent features included in the video compression file are latent features obtained by arithmetic encoding of the distribution parameters.
  • the device further includes a decoding unit:
  • the decoding unit is used to perform arithmetic decoding on the latent features assisted by the distribution parameters to obtain the latent features before performing secondary repair on the initial video frame using the latent features to obtain the final video frame.
  • the determining unit 1102 is specifically configured to:
  • the motion information is calculated based on the contribution map and the thin plate spline transformation matrix.
  • the video compression file also includes an affine transformation matrix
  • the determining unit 1102 is specifically configured to:
  • the previous video frame is transformed using the thin plate spline transformation matrix and the affine transformation matrix to obtain the transformed image.
  • the determining unit 1102 is specifically configured to:
  • Performing image restoration according to the motion information and the previous video frame to obtain an initial video frame includes:
  • Image restoration is performed according to the motion information, the mask information and the previous video frame to obtain an initial video frame.
  • the embodiment of the present application also provides a computer device, which can be used as a video transmitter or a video receiver.
  • the computer device can be, for example, a terminal, and taking a smart phone as an example:
  • FIG12 is a block diagram showing a partial structure of a smartphone provided in an embodiment of the present application.
  • the smartphone includes: a radio frequency (full name in English: Radio Frequency, English abbreviation: RF) circuit 1210, a memory 1220, an input unit 1230, a display unit 1240, a sensor 1250, an audio circuit 1260, a wireless fidelity (English abbreviation: WiFi) module 1270, a processor 1280, and a power supply 1290 and other components.
  • the input unit 1230 may include a touch panel 1231 and other input devices 1232
  • the display unit 1240 may include a display panel 1241
  • the audio circuit 1260 may include a speaker 1261 and a microphone 1262.
  • the smartphone structure shown in FIG12 does not constitute a limitation on the smartphone, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.
  • the memory 1220 can be used to store software programs and modules.
  • the processor 1280 executes various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 1220.
  • the memory 1220 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area may store data created according to the use of the smartphone (such as audio data, a phone book, etc.), etc.
  • the memory 1220 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the processor 1280 is the control center of the smartphone, which uses various interfaces and lines to connect various parts of the entire smartphone, and executes various functions of the smartphone and processes data by running or executing software programs and/or modules stored in the memory 1220, and calling data stored in the memory 1220.
  • the processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 1280.
  • the processor 1280 in the smartphone may perform the following steps:
  • the video is compressed according to the first position information, the second position information and the latent feature to obtain a video compression file.
  • the video compression file includes first position information of a first key point of a video frame to be processed, second position information of a second key point of a previous video frame, and a hidden feature, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the initial video frame is repaired twice using the latent features to obtain a final video frame.
  • the computer device provided in the embodiment of the present application may also be a server, as shown in FIG. 13 , which is a structural diagram of a server 1300 provided in the embodiment of the present application.
  • the server 1300 may have relatively large differences due to different configurations or performances, and may include one or more processors, such as a central processing unit (CPU) 1322, and a memory 1332, one or more storage media 1330 (such as one or more mass storage devices) storing application programs 1342 or data 1344.
  • the memory 1332 and the storage medium 1330 may be temporary storage or permanent storage.
  • the program stored in the storage medium 1330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the server.
  • the central processing unit 1322 may be configured to communicate with the storage medium 1330 to execute a series of instruction operations in the storage medium 1330 on the server 1300.
  • the server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input and output interfaces 1358, and/or one or more operating systems 1341, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM , etc.
  • operating systems 1341 such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM , etc.
  • the CPU 1322 in the server 1300 may perform the following steps:
  • the video is compressed according to the first position information, the second position information and the latent feature to obtain a video compression file.
  • the video compression file includes first position information of a first key point of a video frame to be processed, second position information of a second key point of a previous video frame, and a hidden feature, wherein the previous video frame is a video frame in a video frame sequence that is adjacent to the video frame to be processed and is located before the video frame to be processed;
  • the initial video frame is repaired twice using the latent features to obtain a final video frame.
  • a computer-readable storage medium is provided, wherein the computer-readable storage medium is used to store program code, and the program code is used to execute the video compression method or the video decoding method described in the above-mentioned embodiments.
  • a computer program product comprising a computer program, the computer program being stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the method provided in various optional implementations of the above-mentioned embodiments.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be an indirect coupling or communication connection through some interfaces, devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium, including several instructions for a computer device (which can be a computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, and other media that can store computer programs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请公开一种视频压缩方法、视频解码方法和相关装置,对待处理视频帧和前一视频帧分别进行关键点提取,得到第一位置信息和第二位置信息,根据第一位置信息和第二位置信息进行运动估计得到运动信息。根据运动信息和前一视频帧进行图像修复得到初始视频帧。根据待处理视频帧和初始视频帧确定隐特征,根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件,减小运动信息消耗的字节流,减小传输带宽。由于视频压缩文件中包括隐特征,因此视频接收端在基于第一位置信息和第二位置信息得到初始视频帧后,利用隐特征对初始视频帧进行二次修复,缓解复杂画面运动造成的重建视频帧失真现象,提升算法鲁棒性。

Description

一种视频压缩方法、视频解码方法和相关装置
本申请要求于2022年11月4日提交中国专利局、申请号202211377480.4、申请名称为“一种视频压缩方法、视频解码方法和相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,特别是涉及视频压缩技术、视频解码技术。
背景技术
随着计算机技术、网络技术、通信技术和流媒体技术的迅速发展,为多媒体视频通信的发展提供了强有力的技术保障。视频通信被广泛的应用于如视频会议、在线教育、在线娱乐等场景中。然而如何减少视频卡顿,降低视频通信对带宽的需求,保证用户的视频通信体验是一个亟需解决的问题。
视频压缩是解决这个问题的关键技术,通过对视频帧进行压缩,使得能用较低字节流传输视频,并尽可能保证根据较低字节流的视频压缩文件恢复出高质量视频。目前,主要是计算待处理视频帧相较前一视频帧的运动信息,进而发送该运动信息以便基于前一视频帧和该运动信息恢复出待处理视频帧。
然而,这种方法中,运动信息消耗的字节流较大,并且在视频帧出现复杂画面运动的情况下很难估计运动信息,重建画面容易失真。
发明内容
为了解决上述技术问题,本申请提供了一种视频压缩方法、视频解码方法和相关装置,从而缓解复杂画面运动造成的视频帧失真现象,提升算法鲁棒性。另外,视频压缩文件中包括的是第一位置信息和第二位置信息,并非表示运动信息的稠密特征向量,从而在实现视频压缩的情况下,极大减小运动信息消耗的字节流,减小视频压缩文件传输带宽。
本申请实施例公开了如下技术方案:
一方面,本申请实施例提供一种视频压缩方法,所述方法由计算机设备执行,所述方法包括:
获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
一方面,本申请实施例提供一种视频解码方法,所述方法由计算机设备执行,所述方法包括:
获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
一方面,本申请实施例提供一种视频压缩装置,所述装置部署在计算机设备上,所述装置包括获取单元、提取单元、确定单元、修复单元和压缩单元:
所述获取单元,用于获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
所述提取单元,用于对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
所述确定单元,用于根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
所述修复单元,用于根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
所述确定单元,还用于根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
所述压缩单元,用于根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
一方面,本申请实施例提供一种视频解码装置,所述装置部署在计算机设备上,所述装置包括获取单元、确定单元和修复单元:
所述获取单元,用于获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
所述确定单元,用于根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
所述修复单元,用于根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
所述修复单元,还用于利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
一方面,本申请实施例提供一种计算机设备,所述计算机设备包括处理器以及存储器:
所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
所述处理器用于根据所述程序代码中的指令执行前述任一方面所述的方法。
一方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码当被处理器执行时使所述处理器执行前述任一方面所述的方法。
一方面,本申请实施例提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现前述任一方面所述的方法。
由上述技术方案可以看出,在需要对待处理视频帧进行视频压缩时,可以获取待处理视频帧和待处理视频帧的前一视频帧,前一视频帧是视频帧序列中与待处理视频帧相邻、且位于待处理视频帧之前的视频帧。接着,对待处理视频帧和前一视频帧分别进行关键点提取,得到待处理视频帧的第一关键点的第一位置信息和前一视频帧的第二关键点的第二位置信息,以便根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息。根据运动信息和前一视频帧进行图像修复,得到初始视频帧。为了避免在待处理视频帧中包括多个对象运动、出现前一视频帧未出现的对象等画面复杂的情况下,导致重建画面失真,本申请在视频压缩时,还可以进一步根据待处理视频帧和初始视频帧确定隐特征,通过隐特征表征初始视频帧相对于待处理视频帧的修复偏差,从而根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件。这样,视频接收端在获取到视频压缩文件后,便可以根据第一位置信息和第二位置信息计算得到运动信息,并基于运动信息和前一视频帧进行图像修复得到初始视频帧,由于视频压缩文件中还包括隐特征,隐特征表征初始视频帧相对于待处理视频帧的修复偏差,故可以进一步利用隐特征对初始视频帧进行二次修复,从而缓解复杂画面运动造成的视频帧失真现象,提升算法鲁棒性。另外,视频压缩文件中包括的是第一位置信息和第二位置信息,并非表示运动信息的稠密特征向量,从而在实现视频压缩的情况下,极大减小运动信息消耗的字节流,减小视频压缩文件传输带宽。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术成员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种视频压缩方法的应用场景架构图;
图2为本申请实施例提供的一种视频压缩方法的流程图;
图3为本申请实施例提供的一种视频压缩模型的结构示例图;
图4为本申请实施例提供的一种概率建模的具体过程示例图;
图5为本申请实施例提供的一种视频解码方法的流程图;
图6为本申请实施例提供的一种AI视频压缩的具体过程示例图;
图7为本申请实施例提供的多种方案在图像主观质量方面的性能对比图;
图8为本申请实施例提供的多种方案在复杂场景重建方面的性能对比图;
图9为本申请实施例提供的多种方案在非人脸场景的性能对比图;
图10为本申请实施例提供的一种视频压缩装置的结构图;
图11为本申请实施例提供的一种视频解码装置的结构图;
图12为本申请实施例提供的一种终端的结构图;
图13为本申请实施例提供的一种服务器的结构图。
具体实施方式
下面结合附图,对本申请的实施例进行描述。
随着计算机技术、网络技术、通信技术和流媒体技术的迅速发展,为多媒体视频通信的发展提供了强有力的技术保障。视频通信被广泛的应用于如视频会议、在线教育、在线娱乐等场景中。尤其是在近两年,由于病毒的传播,公司组织的运作方式发生了重大变化,人和人之间开展业务交流的形式逐步从线下转到线上,使得视频通信在视频会议中应用更为广泛。相较线下会议,线上的视频会议减少了参会者的空间地点限制,推动了高效且具有成本效益的协作。然而如何减少视频卡顿,降低视频会议对带宽的需求,保证用户的视频会议体验是一个亟需解决的问题。视频压缩是解决这个问题的关键技术,通过对视频帧进行压缩,使得能用较低字节流传输视频,并尽可能保证根据较低字节流文件恢复高质量视频。视频压缩根据解压视频与原始视频的质量差异分为有损视频压缩和无损视频压缩,本申请聚焦在有损视频压缩。
在进行视频压缩时,已知一个视频,将视频帧序列中的第一个视频帧记为I帧,将其余视频帧记为P帧。由于同一视频不同视频帧间往往存在重复、冗余信息,因此视频压缩依据当前视频帧(即待处理视频帧)xt的前一视频帧xt-1恢复当前视频帧。在已知xt-1的情况下,仅需要确定当前视频帧xt和前一视频帧xt-1的差异即可重建当前视频帧。
当前视频帧xt和前一视频帧xt-1的差异可以通过运动信息来体现,故相关技术中通常会得到当前视频帧相较前一视频帧的运动信息,进而发送该运动信息以便基于前一视频帧和该运动信息恢复出当前视频帧。然而,这种方法中,运动信息是稠密运动特征向量,消耗的字节流较大,并且在视频帧出现复杂画面运动的情况下很难估计运动信息,重建画面容易失真。
为了解决上述技术问题,本申请实施例提供一种视频压缩方法,该方法根据待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征进行视频压缩得到视频压缩文件。这样,视频接收端在获取到视频压缩文件后,便可以根据第一位置信息和第二位置信息计算得到运动信息,并基于运动信息和前一视频帧进行图像修复得到初始视频帧,由于视频压缩文件中还包括隐特征,隐特征表征初始视频帧相对于待处理视频帧的修复偏差,故可以进一步利用隐特征对初始视频帧进行二次修复,从而缓解复杂画面运动造成的视频帧失真现象,提升算法鲁棒性。另外,视频压缩文件中包括的是第一位置信息和第二位置信息,并非表示运动信息的稠密特征向量,从而在实现视频压缩的情况下,极大减小运动信息消耗的字节流,减小视频压缩文件传输带宽。
需要说明的是,本申请实施例提供的视频压缩方法适用于各种视频通信场景,例如视频会议、在线教育、在线娱乐等场景。
本申请实施例提供的视频压缩方法可以由计算机设备执行,该计算机设备可以作为视频发送端,该计算机设备例如可以是服务器,也可以是终端。服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式***,还可以是提供云计算服务的云服务器。终端包括但不限于智能手机、电脑、智能语音交互设备、智能家电、车载终端、飞行器等。
如图1所示,图1示出了一种视频压缩方法的应用场景架构图,该应用场景中可以包括视频发送端101和视频接收端102。当视频发送端101与视频接收端102之间进行视频通信时,视频发送端101需要向视频接收端102传输视频帧,传输的所有视频帧可以构成视频通信中传输的视频帧序列。在发送视频帧时,为了减少传输视频帧所消耗的字节流,通常会对需要发送的视频帧进行视频压缩。
当传输到当前视频帧时,可以将当前视频帧作为待处理视频帧,对该待处理视频帧进行视频压缩。具体地,视频发送端101可以获取待处理视频帧和待处理视频帧的前一视频帧。其中,待处理视频帧为当前需要进行视频压缩,并传输至视频接收端102的视频帧,前一视频帧是视频帧序列中与待处理视频帧相邻、且位于待处理视频帧之前的视频帧。
接着,视频发送端101可以对待处理视频帧进行关键点提取,得到待处理视频帧的第一关键点的第一位置信息,以及对前一视频帧进行关键点提取,得到前一视频帧的第二关键点的第二位置信息,以便根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息。其中,关键点可以是视频帧中所包括的对象上具有代表性的点,通过关键点代表视频帧中所包括的对象,对象可以是人、动物等。以对象是人为例,关键点可以是视频帧中所包括人体各个身体部位上具有代表性的点,各个身体部位例如可以包括人脸、手、胳膊、身体、脚、腿部等。例如当视频帧中包括的身体部位为人脸时,关键点可以是人脸上具有代表性的点;当视频帧中包括的身体部位是人脸和手时,关键点可以是人脸以及手上具有代表性的点,等等。待处理视频帧的关键点可以称为第一关键点,第一关键点可以是待处理视频帧中所包括的第一对象上具有代表性的点,前一视频帧的关键点可以称为第二关键点,第二关键点可以是前一视频帧所包括的第二对象上具有代表性的点,第一对象与第二对象可以相同,也可以不同。
视频发送端101根据运动信息和前一视频帧进行图像修复,得到初始视频帧。为了避免在待处理视频帧中包括多个对象运动、出现前一视频帧未出现的对象等画面复杂的情况下,导致重建画面失真,本申请在视频压缩时,视频发送端101还可以进一步根据待处理视频帧和初始视频帧确定隐特征。其中,隐特征可以是图像修复后的初始视频帧相对于待处理视频帧修复不清楚的部分的特征,隐特征用于表征初始视频帧相对于待处理视频帧的修复偏差。
之后,视频发送端101可以根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件,并将视频压缩文件发送至视频接收端102。视频接收端102在接收到视频压缩文件后,便可以根据第一位置信息和第二位置信息计算得到运动信息,并基于运动信息和前一视频帧进行图像修复得到初始视频帧,由于视频压缩文件中还包括隐特征,隐特征表征初始视频帧相对于待处理视频帧的修复偏差,故可以进一步利用隐特征对初始 视频帧进行二次修复,从而缓解复杂画面运动造成的视频帧失真现象,提升算法鲁棒性。另外,视频压缩文件中包括的是第一位置信息和第二位置信息,并非表示运动信息的稠密特征向量,从而在实现视频压缩的情况下,极大减小运动信息消耗的字节流,减小视频压缩文件传输带宽。
需要说明的是,本申请实施例提供的方法主要涉及人工智能技术,通过人工智能(Artificial Intelligence,AI)技术自动进行视频压缩、视频解码。在本申请实施例中,可以通过机器学习训练视频压缩模型,还可以通过计算机视觉技术中的图像处理对待处理视频帧和前一视频帧进行预处理,通过图像语义理解提取关键点、隐特征等。
接下来,将结合附图对本申请实施例提供的视频压缩方法进行详细介绍。参见图2,图2示出了一种视频压缩方法的流程图,所述方法包括:
S201、视频发送端获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧。
当视频发送端需要向视频接收端发送当前视频帧时,可以将当前视频帧作为待处理视频帧进行视频压缩。基于视频压缩的原理,通常将前一视频帧作为参考依据,以便根据待处理视频帧和前一视频帧之间的差异重建待处理视频帧。为此,视频发送端可以获取待处理视频帧和待处理视频帧的前一视频帧。其中,待处理视频帧为当前需要进行视频压缩,并传输至视频接收端的视频帧,前一视频帧是视频帧序列中与待处理视频帧相邻、且位于待处理视频帧之前的视频帧。待处理视频帧可以用xt表示,前一视频帧可以用xt-1表示。参见图3所示,图3示出了一种视频压缩模型的结构示例图,待处理视频帧xt和前一视频帧xt-1可以参见图3所示。
在一种可能的实现方式中,视频帧序列是需要传输的多个视频帧按照其在时域中的时间顺序进行排序得到的,相应的,前一视频帧是指视频帧序列中与待处理视频帧相邻、且时域中位于待处理视频帧之前的视频帧。
S202、视频发送端对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息。
在获取到待处理视频帧和前一视频帧后,视频发送端可以确定二者之间的差异,二者之间的差异可以用运动信息表示。在一些情况下,为了降低运动信息消耗字节流可以降低。比如,在视频会议等特殊场景中,画面中的主体一般是人脸、人体等复杂度不高的实例,在度量其运动信息时可以通过实例中的特殊点例如关键点,度量视频帧间的运动信息。由于常见视频压缩中的运动信息的保存形式是一个尺寸为(N,2,H/16,W/16)的稠密特征向量,记录关键点的位置信息能极大减小运动信息消耗的字节流。其中,N表示确定运动信息所使用的关键点的数量,H表示待处理视频帧的高度,W表示待处理视频帧的宽度。
基于此,为了减少所消耗的字节流,在本申请实施例中,视频发送端可以对待处理视频帧和所述前一视频帧分别进行关键点提取,识别出待处理视频帧的第一关键点和前一视频帧的第二关键点,进而得到第一关键点的第一位置信息和第二关键点的第二位置信息,以便通过第一位置信息和第二位置信息代替表示运动信息的稠密特征向量向视频接收端传 输。在一种可能的情况下,位置信息可以用坐标表示,即第一位置信息可以是第一关键点的坐标,第二位置信息可以是第二关键点的坐标。
在一种可能的实现方式中,关键点可以是人脸标记点(facial landmark),人脸标记点是根据人的五官结构预先定义好的一套固定点。然而在一些场景下,例如在人体运动场景下,视频帧中包括的对象可能不仅包括人脸,还可能包括手、胳膊、脚等。在这种情况下,为了使得提取到的关键点可以适用于各种场景,提高后续重建的效果,关键点可以是视频帧中对象所包括的各个部位的关键点。具体地,第一关键点可以包括待处理视频帧中第一对象所包括的各个身体部位的关键点,第二关键点可以包括前一视频帧中第二对象所包括的各个身体部位的关键点。
例如,第一对象与第二对象是同一个人,待处理视频帧中显示第一对象的人脸和手,则第一关键点可以为人脸的关键点和手的关键点。同理,前一视频帧中显示第二对象的人脸,则第二关键点可以为人脸的关键点。
通过提取上述关键点,该方法与相关技术提供的基于facial landmark的视频压缩算法相比,使得提取到的关键点可以适用于各种场景,进而使得视频压缩方法的可拓展性更强,提高后续重建的效果。
本申请实施例中提供了多种对待处理视频帧和前一视频帧分别进行关键点提取得到各自对应的关键点的方式。在一种可能的实现方式中,对待处理视频帧进行关键点提取,得到待处理视频帧中第一关键点的第一位置信息,以及对前一视频帧进行关键点提取,得到前一视频帧中第二关键点的第二位置信息的方式可以是识别待处理视频帧中第一对象所包括的身体部位,以及识别前一视频帧中所述第二对象包括的身体部位,进而根据身体部位与关键点的映射关系,确定第一对象所包括身体部位对应的关键点,并确定第一对象所包括身体部位对应的关键点在待处理视频帧中的第一位置信息。以及根据身体部位与关键点的映射关系,确定第二对象所包括身体部位对应的关键点,并确定第二对象所包括身体部位对应的关键点在前一视频帧中的第二位置信息。其中,身体部位与关键点的映射关系可以是预先确定的,每个身体部位的关键点例如可以是根据身体部位的结构预先定义好的一套固定点,本申请实施例对每个身体部位的关键点的定义方式不作限定。
在另一种可能的实现方式中,对待处理视频帧进行关键点提取,得到待处理视频帧中第一关键点的第一位置信息,以及对前一视频帧进行关键点提取,得到前一视频帧中第二关键点的第二位置信息的方式可以是通过视频发送端上的关键点检测模型对待处理视频帧进行关键点提取,得到第一位置信息,以及通过关键点检测模型对前一视频帧进行关键点提取,得到第二位置信息。其中,关键点检测模型是根据训练样本训练得到的,该训练样本包括多个样本图像,每个样本图像中的样本对象包括身体部位,多个样本图像中的样本对象包括的身体部位覆盖各种身体部位。也就是说,为了可以在不同的场景下提取到较为准确的关键点,可以通过自适应学习的方式对关键点检测模型进行训练,从而学习不同场景下的视频帧提取到的关键点有哪些,在训练的过程中,随着不断的迭代,关键点检测模型会逐步获得预测不同场景下视频帧关键点的能力。
通过自适应学习训练得到关键点检测模型,使得关键点检测模型具备的自适应能力,使得本申请实施例提供的方法可以适用于各种场景,包括人脸以外的其他场景,从而提高关键点提取能力。
需要说明的是,本申请实施例对关键点检测模型的网络结构不作限定,关键点检测模型例如可以是关键点检测网络,也可以是关键点检测器(Keypoint Detector),本申请实施例主要以关键点检测模型是关键点检测器为例进行介绍。关键点检测器例如可以是深度残差神经网络(Deep residual network,ResNet),具体可以是ResNet18。关键点检测器的使用原理是:以图像I为输入,使用ResNet18提取图像特征后,用单个全连接层回归图像I的K×N个关键点的位置信息,其中K可以表示关键点的组数,N可以表示每组关键点的数目,K和N可以是预先设置的,例如可以设置为K=10,N=5。基于上述原理,在本申请实施例中可以将待处理视频帧xt或前一视频帧xt-1作为图像I,将xt和xt-1分别输入关键点检测器,预测各自对应的关键点,分别是第一关键点和第二关键点,第一关键点可以表示为第二关键点可以表示为参见图3所示。通常情况下,预测到的关键点可以通过位置信息表示,例如针对第一关键点的第一位置信息,针对第二关键点的第二位置信息。
S203、视频发送端根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息。
在得到第一位置信息和第二位置信息后,便可以利用第一位置信息和第二位置信息表示运动信息,以便用于待处理视频帧的编码和解码。然而,为了缓解复杂画面运动的情况下造成的重建画面失真现象,视频发送端可以根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息,以便先初步预测出根据第一位置信息和第二位置信息进行视频压缩后恢复出的初始视频帧,进而确定初始视频帧的失真情况,从而在视频压缩过程中缓解可能出现的失真现象。可以理解的是,待处理视频帧相对于前一视频帧来说,画面可能会发生相对运动,故此处得到的运动信息也即相对运动信息。
一种可能的实现方式中,根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息的方式可以是视频发送端根据第一位置信息和第二位置信息进行薄板样条变换(Thin Plate Spline transformation,TPS transformation),得到薄板样条变换矩阵。然后根据薄板样条变换矩阵对前一视频帧进行变换得到变换图像,再根据变换图像通过运动网络输出贡献图。其中,贡献图用于表示薄板样条变换矩阵对前一视频帧上的每个像素的运动的贡献,这样,便可以根据贡献图和薄板样条变换矩阵计算运动信息。
其中,得到的待处理图像的第一关键点和前一视频帧的第二关键点可以分成K组,每组关键点可以包括一个第一关键点和一个第二关键点。针对K组关键点 在进行薄板样条变换时,可以对每组进行薄板样条变换,从而得到K个薄板样条变换矩阵。薄板样条变换矩阵可以用Tk表示,且表示每个薄板样条变换矩阵的大小为H×W,H为待处理图像的高度,W为待处理图像的宽度。
在计算得到贡献图时,根据前述得到的薄板样条变换矩阵对前一视频帧进行变换得到变换图像时,得到的变换图像的尺寸可以为(K+1,3,H,W)。此时根据变换图像可以得到K+1个贡献图,贡献图可以用Mk表示,表示每个贡献图的大小为H×W。
在根据贡献图和薄板样条变换矩阵计算运动信息时,可以将贡献图作为权重,将同位置的不同薄板样条变换矩阵线性加权从而得到运动信息,此时运动信息可以是光流场,根据贡献图和薄板样条变换矩阵计算运动信息的计算公式可以如下所示:
其中,T(x,y)表示运动信息(例如光流场),Mk(x,y)表示第k个贡献图,Tk(x,y)表示第k个薄板样条变换矩阵,K表示关键点的组数(即薄板样条变换的组数),(x,y)表示每个像素的坐标。
需要说明的是,上述过程可以通过视频发送端上的运动网络(Motion Network)实现,运动网络可以用于预测运动信息(参见图3所示),以便用于后续的图像修复,本申请实施例对运动网络的网络结构不做限定。
需要说明的是,在一些情况下,待处理视频帧除了第一对象之外,可能还包括背景区域,在这种情况下,第一对象作为前景,第一对象在一定程度上可能会遮挡住背景区域,为了避免将图像修复的关注点过于分散在背景区域,从而影响更为重要的第一对象(前景)的重建,在根据变换图像通过运动网络输出贡献图的同时,还可以根据变换图像,通过运动网络输出掩码信息,掩码信息用于指示图像修复的注意力应该更加集中在前景(即第一对象),从而减少背景区域对前景图像修复的影响,以便提高图像修复的效果。
在一些情况下,为了避免采集视频的摄像机运动导致预测关键点出现在背景区域,进而导致运动估计出现偏差的问题,本申请实施例还可以额外预测背景的仿射变换矩阵来建模背景运动。具体地,可以将待处理视频帧和前一视频帧进行拼接,并将拼接后得到的第二拼接结果输入至背景运动预测网络得到仿射变换矩阵,仿射变换矩阵用于表示待处理视频帧相对于前一视频帧的背景运动。
需要说明的是,上述仿射变换矩阵的预测可以是通过视频发送端上的背景运动预测网络(BG Motion Predictor)实现的,将待处理视频帧xt-1和前一视频帧xt在通道方向拼接并将第二拼接结果输入至背景运动预测网络,通过背景运动预测网络提取第二拼接结果的图像特征,然后用单个全连接层回归二维的仿射变换矩阵(参见图3所示)。本申请实施例对背景运动预测网络的网络结构不做限定,例如可以是另一个ResNet18,仿射变换矩阵可以表示为Abg表示仿射变换矩阵的大小为2×3。
在这种情况下,根据薄板样条变换矩阵对前一视频帧进行变换得到变换图像的方式可以是利用薄板样条变换矩阵和仿射变换矩阵对前一视频帧进行变换得到变换图像。在使用仿射变换矩阵时,由于既需要利用仿射变换矩阵,有需要利用薄板样条变换矩阵对前一视频帧进行变换,因此,为了方便仿射变换矩阵和薄板样条变换矩阵之间的运算,对于仿射变换矩阵可以使用如下公式将其转换为与薄板样条变换矩阵Tk等大的二维向量:
其中,p=(x,y)T(x∈{0,1,..H-1},y∈{0,1,..W-1}),表示每个像素的坐标,H表示前一视频帧的高度,W表示前一视频帧的宽度。
使用K个薄板样条变换矩阵和仿射变换矩阵分别对前一视频帧xt-1进行变换,其中表示前一视频帧的尺寸是3×H×W,从而得到尺寸为(K+1,3,H,W)的变换图像。
通过计算仿射变换矩阵,使用仿射变换矩阵得到变换图像,从而在确定变换图像时考虑到可能出现的背景运动,提高后续运动估计的准确性。
S204、视频发送端根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧。
视频发送端可以根据运动信息和前一视频帧进行图像修复,得到初始视频帧,以便确定利用运动信息直接重建待处理视频帧时可能出现的失真情况,从而缓解这种失真。其中,图像修复的方式例如可以是图像扭曲(例如warp)处理。
运动信息可以体现待处理视频帧与前一视频帧的区别,故可以按照运动信息在前一视频帧的基础上进行变换得到初始视频帧。
S204可以通过视频发送端上的图像修复网络(Inpainting Network)实现,即可以将运动信息和前一视频帧输入至图像修复网络,从而输出初始视频帧。本申请实施例对图像修复网络的网络结构不做限定,例如图像修复网络可以采用编码器-解码器结构。图像修复网络以前一帧视频帧xt-1和运动信息(例如光流场T(x,y))为输入,输出经过变换后的初始视频帧(参见图3所示),初始视频帧可以用表示。当图像修复网络采用编码器-解码器的网络结构时,可以通过图像修复网络依次对xt-1进行4次下采样和4次上采样。上采样阶段,利用光流场T对各级特征图进行变换,最终输出初始视频帧
当待处理图像中存在前景遮挡背景区域的情况,前述运动网络可能会输出掩码信息,在这种情况下,根据运动信息和前一视频帧进行图像修复,得到初始视频帧的方式可以是根据运动信息、掩码信息和前一视频帧进行图像修复,得到初始视频帧。掩码信息用于指示图像修复的注意力应该更加集中在前景(即第一对象),即在掩码信息的指示下忽略背景区域,进而按照运动信息在前一视频帧的基础上进行变换得到初始视频帧,从而减少背景区域对前景图像修复的影响,以便提高图像修复的效果。
可以理解的是,在根据运动信息、掩码信息和前一视频帧进行图像修复,得到初始视频帧时,也可以采用上述图像修复网络实现,其实现过程与上述图3介绍的类似,此处不再赘述。
S205、视频发送端根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差。
在得到初始图像视频帧后,为了避免在待处理视频帧中包括多个对象运动、出现前一视频帧未出现的对象等画面复杂的情况下,导致重建画面失真,本申请实施例在视频压缩时,视频发送端还可以进一步根据待处理视频帧和初始视频帧确定隐特征。其中,隐特征 可以是图像修复后的初始视频帧相对于待处理视频帧修复不清楚的部分的特征,隐特征用于表征初始视频帧相对于待处理视频帧的修复偏差。
可以理解的是,在确定隐特征时,可以将初始视频帧与待处理视频帧进行比对,得到图像修复出的初始视频帧相对于待处理视频帧的修复偏差,即隐特征。
隐特征的确定可以通过基于条件(context)的视频帧提炼(refine)模块实现,即可以将待处理视频帧和初始视频帧输入至视频帧提炼模块,从而输出隐特征。S205实现的核心可以在于使用前述图像修复得到的初始视频帧作为条件,辅助下一阶段的视频压缩。具体地,视频帧提炼模块可以包括特征提取器(Feature Extractor)和条件编码器(Context Encoder),视频发送端可以通过视频帧提炼模块中的特征提取器对初始视频帧进行特征提取,得到初始视频帧的特征向量,并将初始视频帧的特征向量作为视频帧压缩条件,然后利用该视频帧压缩条件辅助对待处理视频帧进行编码,具体可以将待处理视频帧的像素矩阵和视频压缩条件进行拼接,并将拼接后得到的第一拼接结果输入至条件编码器得到隐特征,该过程可以参见图3所示。其中,隐特征可以用yt表示。
在上述方式中,初始视频帧的特征向量可以体现初始视频帧的特征,待处理视频帧的像素矩阵可以体现待处理视频帧的特征,进而基于像素矩阵和视频压缩条件可以得到较为准确的隐特征。
其中,特征提取器可以用fex表示,特征提取器对初始视频帧的特征提取得到视频帧压缩条件可以通过以下公式表示:
其中,表示视频压缩条件,表示初始视频帧,fex()表示特征提取器。
条件编码器可以通过fenc表示,条件编码器对待处理视频帧和视频压缩条件进行拼接得到的第一拼接结果进行编码得到隐特征可以通过以下公式表示:
其中,yt表示隐特征,fenc表示条件编码器,xt表示待处理视频帧,具体可以指其对应的像素矩阵。
可以理解的是,本申请实施例对特征提取器和条件编码器的网络结构不做限定。在一种可能的实现方式中,特征提取器可以包括1个卷积(Convolutional)层、2个残差模块、1个卷积层,其中卷积层的大小可以是3×3,此时卷积层可以表示为conv3×3。特征提取器以初始视频帧为输入,依次经过1个conv3×3、2个残差模块、1个conv3×3,输出通道数为64的特征向量,即视频帧压缩条件,视频帧压缩条件可以用表表示,表示其大小为64×H×W,H表示初始视频帧的高度,W表示初始视频帧的宽度。相较常用的残差补偿条,视频帧压缩条件从特征域辅助视频帧编解码,能够提供更灵活丰富的辅助信息。
条件编码器由3个卷积层和归一化层模块叠加组成,其中,归一化模块可以是各种类型的归一化模块,由于广义归一化(Generalized Normalization,GDN)更加适用于图像重建,故此处使用的归一化层可以是GDN。
在另一种可能的实现方式中,在确定隐特征时,还可以先对待处理视频帧进行特征提取得到待处理视频帧的特征向量,进而将待处理视频帧的特征向量与视频压缩条件进行拼接,并将拼接后得到的第一拼接结果输入至条件编码器得到隐特征。本申请实施例对确定隐特征的具体实现方式不做限定,任何能实现类似作用的方式均可作为确定隐特征的实现方式。
S206、视频发送端根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
通过S201-S205可以得到第一位置信息、第二位置信息和隐特征,除此之外虽然也得到了运动信息,但是本申请实施例为了相对于相关技术能够减少字节流的消耗,采用基于关键点得到的第一位置信息和第二位置信息代替稠密特征向量的运动信息,从而根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件。在视频通信中,视频发送端还可以将视频压缩文件发送至视频接收端。
需要说明的是,当通过上述方法得到仿射变换矩阵时,为了可以利用仿射变换矩阵在视频接收端进行解码,还可以将仿射变换矩阵写入视频压缩文件,即根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件的方式可以是将第一位置信息、第二位置信息、隐特征和仿射变换矩阵写入视频压缩文件。这样,视频接收端在接收到视频压缩文件,根据视频压缩文件解码重建待处理视频帧时,可以通过仿射变换矩阵提高运动估计的准确性,进而提高重建效果。
在一些情况下,隐特征中包括能够体现修复偏差的信息,这些信息可以是数字,隐特征中有些数字出现的概率明显更高,为了减少这种数字造成的隐特征中的信息冗余,在一种可能的实现方式中,视频发送端可以对隐特征进行概率建模,得到分布参数。分布参数用于表示隐特征中不同信息的分布情况,进而利用该分布参数辅助隐特征进行算数编码得到编码后的隐特征。在这种情况下,视频压缩文件中包括的隐特征为编码后的隐特征,即根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件的方式可以是将第一位置信息、第二位置信息、编码后的隐特征和分布参数写入视频压缩文件。其中,隐特征的表现形式可以是特征图,假设隐特征yt服从拉普拉斯分布,则分布参数可以是μt和σt
通过概率建模方式可以得到分布参数,进而体现隐特征中不同信息的分布情况,进而体现不同信息在隐特征中出现的概率,以便按照分布参数对隐特征进行编码,从而针对概率更大的信息采用更少的位数对其进行编码,从而更进一步减小隐特征中的冗余信息。
需要说明的是,在本申请实施例中,可以通过熵模型(Entropy Model)实现上述过程,即视频帧提炼模块中还可以包括熵模型,利用熵模型对隐特征进行概率建模得到分布参数,参见图3所示。此时,隐特征中某些信息出现的概率越大,熵模型输出的熵越小。
在一种可能的实现方式中,为了提高概率建模的准确性,可以采用融合层次信息、空间信息和时序信息的先验预测结构预测得到更加准确的分布参数。在这种情况下,对隐特征进行概率建模,得到隐特征中不同信息的分布参数的方式可以是对隐特征进行层次先验学习得到第一先验信息(即层次信息),对隐特征进行空间先验学习得到第二先验信息(即 空间信息),对隐特征进行时序先验学习得到第三先验信息(即时序信息),进而将第一先验信息、第二先验信息和第三先验信息进行融合得到分布参数。其中,第一先验信息可以通过超先验模型进行层次先验学习得到(此过程即层次先验分支),第二先验信息可以通过自回归网络进行空间先验学习得到(此过程即空间先验分支),第三先验信息可以通过时序先验编码器(Temporal Prior Encoder)进行时序先验学习得到第三先验信息(此过程即时序先验分支),在这种情况下,当使用熵模型进行概率建模(即将熵模型作为先验预测结构)时,熵模型可以包括超先验模型、自回归网络和时序先验编码器。
其中,超先验模型可以包括超先验编码器(Hyper Prior Encoder,HPE)和超先验解码器(Hyper Prior Decoder,HPD),本申请实施例对超先验编码器和超先验解码器的网络结构不做限定,例如超先验编码器可以由三层卷积层构成,超先验解码器可以由三层反卷积层构成。本申请实施例对时序先验编码器的网络结构不做限定,例如时序先验编码器可以由3个多层反卷积层、反归一化层例如图像广义归一化(Image Generalized Normalization,IGDN)和1个卷积层(例如conv3×3)构成。
基于以上先验预测结构,概率建模的具体过程可以参见图4所示,图4以熵模型为例,示出了一种熵模型进行概率建模的示例图。层次先验分支以yt为输入,经过由三层卷积层构成的超先验编码器后量化(Quantize),再经由三层反卷积层够成的超先验解码器输出尺寸为的层次先验特征图(即第一先验信息)。其中,量化可以用Q表示。空间先验分支对输入yt进行量化后,将其经过自回归网络得到尺寸为的空间先验特征图(即第二先验信息)。时序先验分支以视频压缩条件为输入,经过3个多层反卷积层、反归一化层和1个conv3×3构成的时序先验编码器得到尺寸为的时序先验特征图(即第三先验信息)。将第一先验信息、第二先验信息和第三先验信息在通道维度拼接后输入堆叠的三层卷积,得到为yt预测概率模型的μt和σt。概率模型用于指导对量化yt(量化yt可以表示为参见图3所示)的算术编码(Arithmetic Encoding,AE)和算术解码(Arithmetic Decoding,AD),从而减少视频压缩文件的字节流消耗。需要说明的是,因为利用了初始视频帧的特征向量,即视频压缩条件可以使得概率模型的估计更准确。
采用融合层次信息、空间信息和时序信息的先验预测结构,可以更准确地估计隐特征的分布参数,减少压缩隐特征消耗的字节流,进而减少视频帧压缩所需字节流。
在一种可能的实现方式中,超先验模型进行层次先验学习的过程中,还可以对量化后的结果进行算数编码和算数解码,并将算数解码输出的结果输入至超先验解码器。
在得到分布参数之后,可以使用分布参数辅助进行视频压缩,以及后续的视频解码。在一种可能的实现方式中,为了能够利用分布参数辅助进行视频压缩,还需要对分布参数进行进一步处理得到累计概率密度,从而采用累计概率密度进行视频压缩或视频解码。视频压缩也可以称为视频编码,可以通过算术编码器实现,算术编码器有很多实现版本,本申请实施例采用的是开源的版本。
以分布参数是μt和σt为例,利用分布参数计算累计概率密度的公式如下所示:
其中,cdf表示累计概率密度,G()表示概率建模,yt表示隐特征。
由上述技术方案可以看出,在需要对待处理视频帧进行视频压缩时,可以获取待处理视频帧和待处理视频帧的前一视频帧,前一视频帧是视频帧序列中与待处理视频帧相邻、且位于待处理视频帧之前的视频帧。接着,对待处理视频帧和前一视频帧分别进行关键点提取,得到待处理视频帧的关键点的第一位置信息和前一视频帧的关键点的第二位置信息,以便根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息。根据运动信息和前一视频帧进行图像修复,得到初始视频帧。为了避免在待处理视频帧中包括多个对象运动、出现前一视频帧未出现的对象等画面复杂的情况下,导致重建画面失真,本申请在视频压缩时,还可以进一步根据待处理视频帧和初始视频帧确定隐特征,通过隐特征表征初始视频帧相对于待处理视频帧的修复偏差,从而根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件。这样,视频接收端在获取到视频压缩文件后,便可以根据第一位置信息和第二位置信息计算得到运动信息,并基于运动信息和前一视频帧进行图像修复得到初始视频帧,由于视频压缩文件中还包括隐特征,隐特征表征初始视频帧相对于待处理视频帧的修复偏差,故可以进一步利用隐特征对初始视频帧进行二次修复,从而缓解复杂画面运动造成的视频帧失真现象,提升算法鲁棒性。另外,视频压缩文件中包括的是第一位置信息和第二位置信息,并非表示运动信息的稠密特征向量,从而在实现视频压缩的情况下,极大减小运动信息消耗的字节流,减小视频压缩文件传输带宽。
相较相关技术提供的深度视频编码(Deep Video Compression)等基于残差的方法,基于条件的方法从特征空间弥补视频帧,能实现更好的视频压缩。
前述实施例介绍了视频压缩方法,在视频发送端基于上述方法将待处理视频帧进行视频压缩得到视频压缩文件,并将视频压缩文件发送至视频接收端后,视频接收端可以根据视频压缩文件进行视频解码,从而重建待处理视频帧。接下来将对视频解码方法进行详细介绍,参见图5所示,所述方法包括:
S501、视频接收端获取视频压缩文件。
视频接收端获取视频压缩文件的方式可以是视频接收端接收视频发送端发送的视频压缩文件,其中,视频压缩文件包括的内容即为视频发送端加入到视频压缩文件中的内容。通常情况下,视频压缩文件中至少包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征。前一视频帧是视频帧序列中与待处理视频帧相邻、且位于待处理视频帧之前的视频帧。
S502、视频接收端根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息。
视频接收端可以根据接收到的第一位置信息和第二位置信息进行运动估计,从而得到待处理视频帧相对于前一视频帧的运动信息,以便根据运动信息进行图像修复。
在一种可能的实现方式中,视频接收端根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息的方式可以是根据第一位置信息和第二位置信息进行薄板样条变换,得到薄板样条变换矩阵。然后根据薄板样条变换矩阵对前 一视频帧进行变换得到变换图像,在根据变换图像,通过运动网络输出贡献图。其中,贡献图用于表示薄板样条变换矩阵对前一视频帧上的每个像素的运动的贡献,这样便可以根据贡献图和薄板样条变换矩阵计算运动信息。
在一种可能的实现方式中,视频压缩文件中还包括仿射变换矩阵,根据薄板样条变换矩阵对前一视频帧进行变换得到变换图像的方式可以是利用薄板样条变换矩阵和仿射变换矩阵对前一视频帧进行变换得到变换图像。
通过使用仿射变换矩阵得到变换图像,从而在确定变换图像时考虑到可能出现的背景运动,提高后续运动估计的准确性。
在一种可能的实现方式中,根据变换图像,通过运动网络输出贡献图的方式可以是根据变换图像,通过运动网络输出贡献图和掩码信息。此时根据运动信息和前一视频帧进行图像修复,得到初始视频帧的方式可以是根据运动信息、掩码信息和前一视频帧进行图像修复,得到初始视频帧。
掩码信息用于指示图像修复的注意力应该更加集中在前景(即第一对象),从而减少背景区域对前景图像修复的影响,以便提高图像修复的效果。
需要说明的是,上述运动信息的计算可以通过运动网络实现,运动网络计算运动信息的具体实现方式可以参见图2对应的实施例所述,此处不再详细介绍。
S503、视频接收端根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧。
运动信息体现了待处理视频帧相对于前一视频帧的差异,进而视频接收端可以根据运动信息和前一视频帧进行图像修复得到初始视频帧。
S504、视频接收端利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
在得到初始图像视频帧后,为了避免在待处理视频帧中包括多个对象运动、出现前一视频帧未出现的对象等画面复杂的情况下,导致重建画面失真,在本申请实施例中,视频接收端还可以利用视频压缩文件中包括的隐特征对初始视频帧进行二次修复,得到质量更高的最终视频帧(参见图3所示)。具体地,视频接收端可以对初始视频帧进行特征提取,得到初始视频帧的特征向量,进而以初始视频帧的特征向量作为视频帧压缩条件结合隐特征进行二次修复,得到最终视频帧。
需要说明的是,该步骤中的特征提取可以通过特征提取器实现,最终的二次修复可以通过条件解码器(Context Decoder)实现,在一种可能的实现方式中,还可以先对隐特征进行量化,进而根据量化后的隐特征进行二次修复,得到最终视频帧。重建得到最终视频帧的公式如下所示:

其中,表示最终视频帧,fdec()表示条件解码器,round()表示量化,表示隐特征,fenc()表示条件编码器,xt表示待处理视频帧(具体可以指其对应的像素矩阵),表 示视频帧压缩条件,表示初始视频帧,fex()表示特征提取器。的处理过程可以在视频发送端实现,视频接收端可以直接使用得到的隐特征。
需要说明的是,上述条件编码器、特征提取器各自功能的实现方式可以参见图2对应的实施例所示,此处不再赘述。
本申请实施例使用的条件解码器可以由3个多层反卷积层、反归一化层IGDN和1个conv3×3、2个残差模块、1个conv3×3叠加构成。量化后的隐特征输入到条件解码器,在通过条件解码器的3个多层反卷积层和反归一化层IGDN后得到重建图像特征,重建图像特征与视频压缩条件在通道方向拼接后输入后续的卷积层、残差模块,得到最终视频帧
在一种可能的实现方式中,视频压缩文件中还可以包括分布参数,此时视频压缩文件中包括的隐特征可以是基于分布参数进行算数编码后得到的隐特征,在这种情况下,利用隐特征对初始视频帧进行二次修复,得到最终视频帧之前,可以先利用分布参数辅助编码后的隐特征进行算数解码得到隐特征,从而利用算数解码后的隐特征进行二次修复。
图2和图5对应的实施例介绍了视频压缩和视频解码的全部过程,上述全部过程可以称为基于关键点的AI视频压缩技术,可以通过视频压缩模型实现。在视频通信场景中,可以将上述方法可以集成在视频通信软件中,从作为视频通信工具,从而保证用户的视频通信体验。
本申请实施例提供的基于视频压缩模型主要分为基于关键点的运动估计模块和基于条件的视频帧提炼模块,其中,运动估计模块由关键点检测器、背景运动预测网络、运动网络、图像修复网络等4个子模块构成,基于条件的视频帧提炼模块主要包含条件编码器和条件解码器两部分。通过该视频压缩模型进行AI视频压缩的具体过程可以参见图6所示,主要分为两个阶段,第一阶段是通过待处理视频帧和前一视频帧预测运动信息,并以此得到初始视频帧。第二阶段是将初始视频帧作为先验条件对待处理视频帧进行编、解码,得到最终视频帧。
针对待处理视频帧,视频发送端将运动估计模块中的关键点检测器输出的第一位置信息、第二位置信息,以及运动估计模块中的背景运动预测网络输出的仿射变换矩阵,以及基于条件的视频帧提炼模块输出的隐特征及分布参数写入视频压缩文件,并将视频压缩文件存储,传输到视频接收端。视频接收端基于第一位置信息、第二位置信息和仿射变换矩阵,通过运动网络确定运动信息,进而根据运动信息,通过图像修复网络进行图像修复得到初始视频帧。接着,利用隐特征对初始视频帧进行二次修复,得到最终视频帧。
通过上述方法传输的视频压缩文件大小相对待处理视频帧显著降低,可以减小文件传输带宽。
需要说明的是,本申请实施例使用的视频压缩模型可以是预先训练得到的,训练数据可以来自于VoxCeleb(是一种数据集),共采用了145569个256*256的视频作为训练集,4911个作为测试集。训练数据中每个视频帧数范围在64到1024之间。在训练过程中,可以设置视频压缩模型共训练2e6步,优化器默认采用Adam,初始化学习率为1e-4。训练 1.8e6步后,学习率下降至1e-5。训练过程中可以采用损失函数进行模型参数优化,损失函数的公式可以如下所示:
Loss=R+λ1(D1+D2)
其中,R是视频压缩文件的码率,D1表示初始视频帧的恢复质量,D2表示最终视频帧的恢复质量,使用感知损失形式进行计算,λ表示调节因子,默认λ=0.0001。
通过分析可以知本申请实施例提供的视频压缩方法的性能优于相关技术提供的方法。本申请实施例从不同方面对比了基于facial landmark的视频压缩模型(标记为方案一)、仅使用关键点的视频压缩模型(标记为方案二)、本申请实施例提供的视频压缩模型(标记为方案三)的性能表现。
从图像主观质量方面进行比对,针对方案二,对比了关键点数目分别为15、25、50、75的模型性能,方案是三使用的关键点数目为50。对不同模型的主观质量进行评估,使用常用的学习感知图像块相似度(Learned Perceptual Image Patch Similarity,LPIPS)、Frechet Inception距离得分(Frechet Inception Distance score,FID)指标,其中LPIPS、FID均越小越好。参见图7所示,图7中纵坐标即图像主观质量,横坐标表示关键点数目(Num_keypoints),以及表示码率(也可以称为每像素位数(bit per pixel,bpp))。
从图7中可以看到,随着关键点数目的增加,图像主观质量整体呈上升趋势。对比三个方案,当使用方案二更少的关键点数目时,即码流消耗更小的情况下,方案二仍然能取得更好的主观质量结果,证明了模型的有效性。从第2列看,在一定的bpp消耗下,方案三能进一步提升方案二的图像主观质量。
从复杂场景重建性能进行比对,在视频帧出现运动物体等画面复杂的情况下,比对方案一、方案二和方案三的性能。参见图8所示,当画面出现一个运动的手时,通过方案一恢复的图像中手非常模糊,即方案一基本失效,方案二存在不同程度地失真,方案三可以更好地重建图像。因此,在后续使用中可以对简单视频帧使用方案二进行视频压缩以节省码流消耗,在复杂场景下使用方案三进行视频压缩保证画面质量。
从非人脸场景零次学习(zero-shot)测试进行比对,例如可以在非人脸场景的域外数据集直接测试了上述方案一和方案三的性能,即为zero-shot测试。参见图9所示,从图9中第二行可以看出,当手运动时,方案一无法恢复手使其与待处理视频帧本身相同,从第一行可以看出,即使可以恢复出手,但是方案一恢复的效果较为模糊,可见方案一在非人脸场景基本无法使用,而方案三仍然适用。
需要说明的是,本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
基于图2对应实施例提供的视频压缩方法,本申请实施例还提供一种视频压缩装置1000。参见图10,所述视频压缩装置1000包括获取单元1001、提取单元1002、确定单元1003、修复单元1004和压缩单元1005:
所述获取单元1001,用于获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
所述提取单元1002,用于对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
所述确定单元1003,用于根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
所述修复单元1004,用于根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
所述确定单元1003,还用于根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
所述压缩单元1005,用于根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
在一种可能的实现方式中,所述确定单元1003,具体用于:
通过特征提取器对所述初始视频帧进行特征提取,得到所述初始视频帧的特征向量,并将所述初始视频帧的特征向量作为视频帧压缩条件;
将所述待处理视频帧的像素矩阵和所述视频压缩条件进行拼接,并将拼接后得到的第一拼接结果输入至条件编码器得到所述隐特征。
在一种可能的实现方式中,所述装置还包括建模单元和编码单元:
所述建模单元,用于对所述隐特征进行概率建模,得到分布参数,所述分布参数用于表示所述隐特征中不同信息的分布情况;
所述编码单元,用于利用所述分布参数辅助所述隐特征进行算数编码得到编码后的隐特征;
所述压缩单元1005,具体用于:
将所述第一位置信息、所述第二位置信息、所述编码后的隐特征和所述分布参数写入所述视频压缩文件。
在一种可能的实现方式中,所述建模单元,具体用于:
对所述隐特征进行层次先验学习得到第一先验信息;
对所述隐特征进行空间先验学习得到第二先验信息;
对所述隐特征进行时序先验学习得到第三先验信息;
将所述第一先验信息、所述第二先验信息和所述第三先验信息进行融合得到所述分布参数。
在一种可能的实现方式中,所述确定单元1003,具体用于:
根据所述第一位置信息和所述第二位置信息进行薄板样条变换,得到薄板样条变换矩阵;
根据所述薄板样条变换矩阵对所述前一视频帧进行变换得到变换图像;
根据所述变换图像,通过运动网络输出贡献图,所述贡献图用于表示所述薄板样条变换矩阵对所述前一视频帧上的每个像素的运动的贡献;
根据所述贡献图和所述薄板样条变换矩阵计算所述运动信息。
在一种可能的实现方式中,所述确定单元1003还用于:
将所述待处理视频帧和所述前一视频帧进行拼接,并将拼接后得到的第二拼接结果输入至背景运动预测网络得到仿射变换矩阵,所述仿射变换矩阵用于表示所述待处理视频帧相对于所述前一视频帧的背景运动;
所述确定单元1003,具体用于:
利用所述薄板样条变换矩阵和所述仿射变换矩阵对所述前一视频帧进行变换得到所述变换图像;
所述压缩单元1005,具体用于:
将所述第一位置信息、所述第二位置信息、所述隐特征和所述仿射变换矩阵写入所述视频压缩文件。
在一种可能的实现方式中,所述确定单元1003,具体用于:
根据所述变换图像,通过所述运动网络输出所述贡献图和掩码信息;
所述修复单元1004,具体用于:
根据所述运动信息、所述掩码信息和所述前一视频帧进行图像修复,得到所述初始视频帧。
在一种可能的实现方式中,所述第一关键点包括所述待处理视频帧中第一对象所包括的各个身体部位的关键点,所述第二关键点包括所述前一视频帧中第二对象所包括的各个身体部位的关键点。
在一种可能的实现方式中,所述提取单元1002,具体用于:
识别所述待处理视频帧中所述第一对象所包括的身体部位,以及识别所述前一视频帧中所述第二对象所述包括的身体部位;
根据身体部位与关键点的映射关系,确定所述第一对象所包括身体部位对应的关键点,并确定所述第一对象所包括身体部位对应的关键点在所述待处理视频帧中的第一位置信息,以及根据身体部位与关键点的映射关系,确定所述第二对象所包括身体部位对应的关键点,并确定所述第二对象所包括身体部位对应的关键点在所述前一视频帧中的第二位置信息。
在一种可能的实现方式中,所述提取单元1002,具体用于:
通过关键点检测模型对所述待处理视频帧进行关键点提取,得到所述第一位置信息,以及通过所述关键点检测模型对所述前一视频帧进行关键点提取,得到所述第二位置信息;所述关键点检测模型是根据训练样本训练得到的,所述训练样本包括多个样本图像,每个所述样本图像中的样本对象包括身体部位,多个所述样本图像中的样本对象包括的身体部位覆盖各种身体部位。
由上述技术方案可以看出,在需要对待处理视频帧进行视频压缩时,可以获取待处理视频帧和待处理视频帧的前一视频帧,前一视频帧是视频帧序列中与待处理视频帧相邻、且位于待处理视频帧之前的视频帧。接着,对待处理视频帧和前一视频帧分别进行关键点提取,得到待处理视频帧的关键点的第一位置信息和前一视频帧的关键点的第二位置信息,以便根据第一位置信息和第二位置信息进行运动估计,得到待处理视频帧相对于前一视频帧的运动信息。根据运动信息和前一视频帧进行图像修复,得到初始视频帧。为了避免在 待处理视频帧中包括多个对象运动、出现前一视频帧未出现的对象等画面复杂的情况下,导致重建画面失真,本申请在视频压缩时,还可以进一步根据待处理视频帧和初始视频帧确定隐特征,通过隐特征表征初始视频帧相对于待处理视频帧的修复偏差,从而根据第一位置信息、第二位置信息和隐特征进行视频压缩得到视频压缩文件。这样,视频接收端在获取到视频压缩文件后,便可以根据第一位置信息和第二位置信息计算得到运动信息,并基于运动信息和前一视频帧进行图像修复得到初始视频帧,由于视频压缩文件中还包括隐特征,隐特征表征初始视频帧相对于待处理视频帧的修复偏差,故可以进一步利用隐特征对初始视频帧进行二次修复,从而缓解复杂画面运动造成的视频帧失真现象,提升算法鲁棒性。另外,视频压缩文件中包括的是第一位置信息和第二位置信息,并非表示运动信息的稠密特征向量,从而在实现视频压缩的情况下,极大减小运动信息消耗的字节流,减小视频压缩文件传输带宽。
基于图5对应实施例提供的视频解码方法,本申请实施例还提供一种视频解码装置1100。参见图11,所述视频解码装置1100包括获取单元1101、确定单元1102和修复单元1103:
所述获取单元1101,用于获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
所述确定单元1102,用于根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
所述修复单元1103,用于根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
所述修复单元1103,还用于利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
在一种可能的实现方式中,所述视频压缩文件中还包括分布参数,所述视频压缩文件中包括的隐特征是基于所述分布参数进行算数编码后得到的隐特征,所述装置还包括解码单元:
所述解码单元,用于在利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧之前,利用所述分布参数辅助编码后的隐特征进行算数解码得到所述隐特征。
在一种可能的实现方式中,所述确定单元1102,具体用于:
根据所述第一位置信息和所述第二位置信息进行薄板样条变换,得到薄板样条变换矩阵;
根据所述薄板样条变换矩阵对所述前一视频帧进行变换得到变换图像;
根据所述变换图像,通过运动网络输出贡献图,所述贡献图用于表示所述薄板样条变换矩阵对所述前一视频帧上的每个像素的运动的贡献;
根据所述贡献图和所述薄板样条变换矩阵计算所述运动信息。
在一种可能的实现方式中,所述视频压缩文件中还包括仿射变换矩阵,所述确定单元1102,具体用于:
利用所述薄板样条变换矩阵和所述仿射变换矩阵对所述前一视频帧进行变换得到所述变换图像。
在一种可能的实现方式中,所述确定单元1102,具体用于:
根据所述变换图像,通过所述运动网络输出所述贡献图和掩码信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧,包括:
根据所述运动信息、所述掩码信息和所述前一视频帧进行图像修复,得到初始视频帧。
本申请实施例还提供了一种计算机设备,该计算机设备可以作为视频发送端或者视频接收端。该计算机设备例如可以是终端,以终端为智能手机为例:
图12示出的是与本申请实施例提供的智能手机的部分结构的框图。参考图12,智能手机包括:射频(英文全称:Radio Frequency,英文缩写:RF)电路1210、存储器1220、输入单元1230、显示单元1240、传感器1250、音频电路1260、无线保真(英文缩写:WiFi)模块1270、处理器1280、以及电源1290等部件。输入单元1230可包括触控面板1231以及其他输入设备1232,显示单元1240可包括显示面板1241,音频电路1260可以包括扬声器1261和传声器1262。可以理解的是,图12中示出的智能手机结构并不构成对智能手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
存储器1220可用于存储软件程序以及模块,处理器1280通过运行存储在存储器1220的软件程序以及模块,从而执行智能手机的各种功能应用以及数据处理。存储器1220可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据智能手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1220可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器1280是智能手机的控制中心,利用各种接口和线路连接整个智能手机的各个部分,通过运行或执行存储在存储器1220内的软件程序和/或模块,以及调用存储在存储器1220内的数据,执行智能手机的各种功能和处理数据。可选的,处理器1280可包括一个或多个处理单元;优选的,处理器1280可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1280中。
在本实施例中,智能手机中的处理器1280可以执行以下步骤:
获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
或,
获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
本申请实施例提供的计算机设备还可以是服务器,请参见图13所示,图13为本申请实施例提供的服务器1300的结构图,服务器1300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器,例如中央处理器(Central Processing Units,简称CPU)1322,以及存储器1332,一个或一个以上存储应用程序1342或数据1344的存储介质1330(例如一个或一个以上海量存储设备)。其中,存储器1332和存储介质1330可以是短暂存储或持久存储。存储在存储介质1330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1322可以设置为与存储介质1330通信,在服务器1300上执行存储介质1330中的一系列指令操作。
服务器1300还可以包括一个或一个以上电源1326,一个或一个以上有线或无线网络接口1350,一个或一个以上输入输出接口1358,和/或,一个或一个以上操作***1341,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
在本实施例中,服务器1300中的中央处理器1322可以执行以下步骤:
获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
或,
获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
根据本申请的一个方面,提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码用于执行前述各个实施例所述的视频压缩方法或视频解码方法。
根据本申请的一个方面,提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行上述实施例各种可选实现方式中提供的方法。
上述各个附图对应的流程或结构的描述各有侧重,某个流程或结构中没有详述的部分,可以参见其他流程或结构的相关描述。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储计算机程序的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术成员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种视频压缩方法,所述方法由计算机设备执行,所述方法包括:
    获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
    对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
    根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
    根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
    根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
    根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
  2. 根据权利要求1所述的方法,所述根据所述待处理视频帧和所述初始视频帧确定隐特征,包括:
    通过特征提取器对所述初始视频帧进行特征提取,得到所述初始视频帧的特征向量,并将所述初始视频帧的特征向量作为视频帧压缩条件;
    将所述待处理视频帧的像素矩阵和所述视频压缩条件进行拼接,并将拼接后得到的第一拼接结果输入至条件编码器得到所述隐特征。
  3. 根据权利要求1或2所述的方法,所述方法还包括:
    对所述隐特征进行概率建模,得到分布参数,所述分布参数用于表示所述隐特征中不同信息的分布情况;
    利用所述分布参数辅助所述隐特征进行算数编码得到编码后的隐特征;
    所述根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件,包括:
    将所述第一位置信息、所述第二位置信息、所述编码后的隐特征和所述分布参数写入所述视频压缩文件。
  4. 根据权利要求3所述的方法,所述对所述隐特征进行概率建模,得到分布参数,包括:
    对所述隐特征进行层次先验学习得到第一先验信息;
    对所述隐特征进行空间先验学习得到第二先验信息;
    对所述隐特征进行时序先验学习得到第三先验信息;
    将所述第一先验信息、所述第二先验信息和所述第三先验信息进行融合得到所述分布参数。
  5. 根据权利要求1-4任一项所述的方法,所述根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息,包括:
    根据所述第一位置信息和所述第二位置信息进行薄板样条变换,得到薄板样条变换矩阵;
    根据所述薄板样条变换矩阵对所述前一视频帧进行变换得到变换图像;
    根据所述变换图像,通过运动网络输出贡献图,所述贡献图用于表示所述薄板样条变换矩阵对所述前一视频帧上的每个像素的运动的贡献;
    根据所述贡献图和所述薄板样条变换矩阵计算所述运动信息。
  6. 根据权利要求5所述的方法,所述方法还包括:
    将所述待处理视频帧和所述前一视频帧进行拼接,并将拼接后得到的第二拼接结果输入至背景运动预测网络得到仿射变换矩阵,所述仿射变换矩阵用于表示所述待处理视频帧相对于所述前一视频帧的背景运动;
    所述根据所述薄板样条变换矩阵对所述前一视频帧进行变换得到变换图像,包括:
    利用所述薄板样条变换矩阵和所述仿射变换矩阵对所述前一视频帧进行变换得到所述变换图像;
    所述根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件,包括:
    将所述第一位置信息、所述第二位置信息、所述隐特征和所述仿射变换矩阵写入所述视频压缩文件。
  7. 根据权利要求5或6所述的方法,所述根据所述变换图像,通过运动网络输出贡献图,包括:
    根据所述变换图像,通过所述运动网络输出所述贡献图和掩码信息;
    根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧,包括:
    根据所述运动信息、所述掩码信息和所述前一视频帧进行图像修复,得到所述初始视频帧。
  8. 根据权利要求1-7任一项所述的方法,所述第一关键点包括所述待处理视频帧中第一对象所包括的各个身体部位的关键点,所述第二关键点包括所述前一视频帧中第二对象所包括的各个身体部位的关键点。
  9. 根据权利要求8所述的方法,所述对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息,包括:
    识别所述待处理视频帧中所述第一对象所包括的身体部位,以及识别所述前一视频帧中所述第二对象所述包括的身体部位;
    根据身体部位与关键点的映射关系,确定所述第一对象所包括身体部位对应的关键点,并确定所述第一对象所包括身体部位对应的关键点在所述待处理视频帧中的第一位置信息,以及根据身体部位与关键点的映射关系,确定所述第二对象所包括身体部位对应的关键点,并确定所述第二对象所包括身体部位对应的关键点在所述前一视频帧中的第二位置信息。
  10. 根据权利要求8所述的方法,所述对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息,包括:
    通过关键点检测模型对所述待处理视频帧进行关键点提取,得到所述第一位置信息,以及通过所述关键点检测模型对所述前一视频帧进行关键点提取,得到所述第二位置信息;所述关键点检测模型是根据训练样本训练得到的,所述训练样本包括多个样本图像,每个所述样本图像中的样本对象包括身体部位,多个所述样本图像中的样本对象包括的身体部位覆盖各种身体部位。
  11. 一种视频解码方法,所述方法由计算机设备执行,所述方法包括:
    获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
    根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
    根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
    利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
  12. 根据权利要求11所述的方法,所述视频压缩文件中还包括分布参数,所述视频压缩文件中包括的隐特征是基于所述分布参数进行算数编码后得到的隐特征,在利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧之前,所述方法还包括:
    利用所述分布参数辅助编码后的隐特征进行算数解码得到所述隐特征。
  13. 根据权利要求11或12所述的方法,所述根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息,包括:
    根据所述第一位置信息和所述第二位置信息进行薄板样条变换,得到薄板样条变换矩阵;
    根据所述薄板样条变换矩阵对所述前一视频帧进行变换得到变换图像;
    根据所述变换图像,通过运动网络输出贡献图,所述贡献图用于表示所述薄板样条变换矩阵对所述前一视频帧上的每个像素的运动的贡献;
    根据所述贡献图和所述薄板样条变换矩阵计算所述运动信息。
  14. 根据权利要求13所述的方法,所述视频压缩文件中还包括仿射变换矩阵,所述根据所述薄板样条变换矩阵对所述前一视频帧进行变换得到变换图像,包括:
    利用所述薄板样条变换矩阵和所述仿射变换矩阵对所述前一视频帧进行变换得到所述变换图像。
  15. 根据权利要求13或14所述的方法,所述根据所述变换图像,通过运动网络输出贡献图,包括:
    根据所述变换图像,通过所述运动网络输出所述贡献图和掩码信息;
    根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧,包括:
    根据所述运动信息、所述掩码信息和所述前一视频帧进行图像修复,得到初始视频帧。
  16. 一种视频压缩装置,所述装置部署在计算机设备上,所述装置包括获取单元、提取单元、确定单元、修复单元和压缩单元:
    所述获取单元,用于获取待处理视频帧和所述待处理视频帧的前一视频帧,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
    所述提取单元,用于对所述待处理视频帧进行关键点提取,得到所述待处理视频帧中第一关键点的第一位置信息,以及对所述前一视频帧进行关键点提取,得到所述前一视频帧中第二关键点的第二位置信息;
    所述确定单元,用于根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
    所述修复单元,用于根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
    所述确定单元,还用于根据所述待处理视频帧和所述初始视频帧确定隐特征,所述隐特征表征所述初始视频帧相对于所述待处理视频帧的修复偏差;
    所述压缩单元,用于根据所述第一位置信息、所述第二位置信息和所述隐特征进行视频压缩得到视频压缩文件。
  17. 一种视频解码装置,所述装置部署在计算机设备上,所述装置包括获取单元、确定单元和修复单元:
    所述获取单元,用于获取视频压缩文件,所述视频压缩文件中包括待处理视频帧的第一关键点的第一位置信息、前一视频帧的第二关键点的第二位置信息和隐特征,所述前一视频帧是视频帧序列中与所述待处理视频帧相邻、且位于所述待处理视频帧之前的视频帧;
    所述确定单元,用于根据所述第一位置信息和所述第二位置信息进行运动估计,得到所述待处理视频帧相对于所述前一视频帧的运动信息;
    所述修复单元,用于根据所述运动信息和所述前一视频帧进行图像修复,得到初始视频帧;
    所述修复单元,还用于利用所述隐特征对所述初始视频帧进行二次修复,得到最终视频帧。
  18. 一种计算机设备,所述计算机设备包括处理器以及存储器:
    所述存储器用于存储程序代码,并将所述程序代码传输给所述处理器;
    所述处理器用于根据所述程序代码中的指令执行权利要求1-10任一项所述的方法或11-15任一项所述的方法。
  19. 一种计算机可读存储介质,所述计算机可读存储介质用于存储程序代码,所述程序代码当被处理器执行时使所述处理器执行权利要求1-10任一项所述的方法或11-15任一项所述的方法。
  20. 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现权利要求1-10任一项所述的方法或11-15任一项所述的方法。
PCT/CN2023/123893 2022-11-04 2023-10-11 一种视频压缩方法、视频解码方法和相关装置 WO2024093627A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211377480.4 2022-11-04
CN202211377480.4A CN116962713A (zh) 2022-11-04 2022-11-04 一种视频压缩方法、视频解码方法和相关装置

Publications (1)

Publication Number Publication Date
WO2024093627A1 true WO2024093627A1 (zh) 2024-05-10

Family

ID=88447972

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/123893 WO2024093627A1 (zh) 2022-11-04 2023-10-11 一种视频压缩方法、视频解码方法和相关装置

Country Status (2)

Country Link
CN (1) CN116962713A (zh)
WO (1) WO2024093627A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363623A (zh) * 2021-08-12 2022-04-15 财付通支付科技有限公司 图像处理方法、装置、介质及电子设备
WO2022088631A1 (zh) * 2020-10-28 2022-05-05 Oppo广东移动通信有限公司 图像编码方法、图像解码方法及相关装置
CN114449286A (zh) * 2022-02-15 2022-05-06 阿里巴巴(中国)有限公司 一种视频编码方法、解码方法及装置
CN115052147A (zh) * 2022-04-26 2022-09-13 中国传媒大学 基于生成模型的人体视频压缩方法、***
WO2022198465A1 (zh) * 2021-03-23 2022-09-29 华为技术有限公司 一种编码方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022088631A1 (zh) * 2020-10-28 2022-05-05 Oppo广东移动通信有限公司 图像编码方法、图像解码方法及相关装置
WO2022198465A1 (zh) * 2021-03-23 2022-09-29 华为技术有限公司 一种编码方法及装置
CN114363623A (zh) * 2021-08-12 2022-04-15 财付通支付科技有限公司 图像处理方法、装置、介质及电子设备
CN114449286A (zh) * 2022-02-15 2022-05-06 阿里巴巴(中国)有限公司 一种视频编码方法、解码方法及装置
CN115052147A (zh) * 2022-04-26 2022-09-13 中国传媒大学 基于生成模型的人体视频压缩方法、***

Also Published As

Publication number Publication date
CN116962713A (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
CN110798690B (zh) 视频解码方法、环路滤波模型的训练方法、装置和设备
WO2018150083A1 (en) A method and technical equipment for video processing
Wang et al. Towards analysis-friendly face representation with scalable feature and texture compression
WO2019001108A1 (zh) 视频处理的方法和装置
CN111263161B (zh) 视频压缩处理方法、装置、存储介质和电子设备
KR101855542B1 (ko) 예제 기반 데이터 프루닝을 이용한 비디오 부호화
WO2023016155A1 (zh) 图像处理方法、装置、介质及电子设备
CN110870310A (zh) 图像编码方法和装置
CN116233445B (zh) 视频的编解码处理方法、装置、计算机设备和存储介质
CN112702592A (zh) 端到端双目图像联合压缩方法、装置、设备和介质
WO2023050720A1 (zh) 图像处理方法、图像处理装置、模型训练方法
CN113192147A (zh) 显著性压缩的方法、***、存储介质、计算机设备及应用
US20220398692A1 (en) Video conferencing based on adaptive face re-enactment and face restoration
US20220335560A1 (en) Watermark-Based Image Reconstruction
Jiang et al. Multi-modality deep network for extreme learned image compression
US11095901B2 (en) Object manipulation video conference compression
WO2024093627A1 (zh) 一种视频压缩方法、视频解码方法和相关装置
WO2023077707A1 (zh) 视频编码方法、模型训练方法、设备和存储介质
WO2022204392A1 (en) Multi-distribution entropy modeling of latent features in image and video coding using neural networks
JP5809574B2 (ja) 符号化方法、復号方法、符号化装置、復号装置、符号化プログラム及び復号プログラム
WO2024007144A1 (zh) 编解码方法、码流、编码器、解码器以及存储介质
US11670011B2 (en) Image compression apparatus and learning apparatus and method for the same
US20230239470A1 (en) Video encoding and decoding methods, encoder, decoder, and storage medium
WO2022155818A1 (zh) 图像编码、解码方法及装置、编解码器
US20230040484A1 (en) Fast patch generation for video based point cloud coding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23884559

Country of ref document: EP

Kind code of ref document: A1