WO2024104014A1 - 视频的压缩方法、解压缩方法、装置、计算机设备和存储介质 - Google Patents

视频的压缩方法、解压缩方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2024104014A1
WO2024104014A1 PCT/CN2023/124015 CN2023124015W WO2024104014A1 WO 2024104014 A1 WO2024104014 A1 WO 2024104014A1 CN 2023124015 W CN2023124015 W CN 2023124015W WO 2024104014 A1 WO2024104014 A1 WO 2024104014A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
pixel
compressed
video
key frame
Prior art date
Application number
PCT/CN2023/124015
Other languages
English (en)
French (fr)
Inventor
关永航
田宽
项进喜
张军
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024104014A1 publication Critical patent/WO2024104014A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/157Assigned coding mode, i.e. the coding mode being predefined or preselected to be further used for selection of another element or parameter
    • H04N19/159Prediction type, e.g. intra-frame, inter-frame or bidirectional frame prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel

Definitions

  • the present application relates to the field of video processing technology, and in particular to a video compression method, decompression method, device, computer equipment and storage medium.
  • the general AI video compression algorithm is mainly aimed at forward search frame compression, that is, motion estimation uses an optical flow network to estimate the optical flow field between frames, and uses the optical flow field as a prediction offset to apply to the previous frame to obtain a predicted frame.
  • the optical flow network can only perform pixel-level offset prediction, and cannot accurately estimate the offset for more complex motion deformations.
  • additional residual compensation information will appear due to inaccurate offset estimation.
  • a video compression method a decompression method, an apparatus, a computer device, and a storage medium are provided.
  • the present application provides a video compression method, which is performed by a computer device, and the method includes:
  • the prediction frame is a video frame generated based on the target pixel kernel and the key frame;
  • a compressed video packet is obtained according to the compression atlas, the compression key frame and the compression pixel kernel.
  • the present application also provides a video compression device.
  • the device comprises:
  • a first generating module used for generating a pixel core of each pixel point in the key frame based on the key frame and the forward search frame of the video;
  • a constraint processing module used for performing smooth constraint processing on the pixel core to obtain a target pixel core
  • a first compression module used for compressing the key frame and the target pixel core to obtain a compressed key frame and a compressed pixel core
  • a second compression module configured to compress a residual map between the forward search frame and a prediction frame to obtain a compressed map;
  • the prediction frame is a video frame generated based on the target pixel kernel and the key frame;
  • the second generating module is used to obtain a compressed video packet according to the compression atlas, the compression key frame and the compression pixel core.
  • the present application further provides a computer device.
  • the computer device includes a memory and a processor.
  • the memory stores a computer program
  • the processor implements the steps of the video compression method when executing the computer program.
  • the present application further provides a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of the video compression method are implemented.
  • the present application further provides a computer program product, wherein the computer program product comprises a computer program, and when the computer program is executed by a processor, the steps of the video compression method are implemented.
  • the present application provides a video decompression method, which is performed by a computer device, and the method includes:
  • the predicted frame is subjected to image compensation processing based on the residual map to obtain a target predicted frame; wherein the key frame and the target predicted frame are video frames in a video.
  • the present application also provides a video decompression device.
  • the device comprises:
  • a first decompression module is used to decompress the compressed key frame and the compressed atlas in the video packet to obtain the key frame and the residual atlas;
  • a second decompression module used to decompress the compressed pixel core in the video packet to obtain a target pixel core of each pixel point in the key frame;
  • An offset processing module used for performing inter-frame offset processing on pixel points in the key frame and neighboring pixel points of the pixel points in turn based on the target pixel core to obtain a predicted frame;
  • a compensation module is used to perform image compensation processing on the predicted frame based on the residual map to obtain a target predicted frame; wherein the key frame and the target predicted frame are video frames in a video.
  • the present application further provides a computer device, wherein the computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the video compression method when executing the computer program.
  • the present application further provides a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of the video compression method are implemented.
  • the present application further provides a computer program product, wherein the computer program product comprises a computer program, and when the computer program is executed by a processor, the steps of the video compression method are implemented.
  • FIG1 is a diagram showing an application environment of a video compression method and a video decompression method in one embodiment
  • FIG2 is a schematic flow chart of a method for compressing a video in one embodiment
  • FIG3 is a schematic diagram of the system structure of a video compression method and a video decompression method in one embodiment
  • FIG4 is a schematic diagram of generating a target pixel kernel based on a local weighting module in one embodiment
  • FIG5 is a schematic diagram showing a comparison between pixel cores of different orders in one embodiment
  • FIG6 is a schematic diagram of compressing and decompressing a feature map of a target pixel kernel in one embodiment
  • FIG7 is a schematic diagram of compressing and decompressing a feature map of a target pixel kernel in another embodiment
  • FIG8 is a schematic diagram showing a comparison of the effects of different weights on the distribution of target pixel cores in one embodiment
  • FIG9 is a schematic diagram of a process for extracting a target feature map in one embodiment
  • FIG10 is a schematic diagram of a structural model based on a local weighting module in one embodiment
  • FIG11 is a schematic flow chart of a method for decompressing a video in one embodiment
  • FIG. 12 is a schematic diagram of jumping to a corresponding progress for playback according to a triggered interactive control when playing a video after decompression in one embodiment
  • FIG13 is a schematic diagram of video compression and decompression results in one embodiment
  • FIG14 is a structural block diagram of a video compression device in one embodiment
  • FIG15 is a structural block diagram of a video compression device in another embodiment
  • FIG16 is a structural block diagram of a video decompression device in one embodiment
  • FIG. 17 is a diagram showing the internal structure of a computer device in one embodiment.
  • the video compression method and decompression method provided in the embodiment of the present application can be applied to the application environment shown in FIG1.
  • the terminal 102, the server 104 and the terminal 106 can communicate with each other through a network.
  • the data storage system can store data that the server 104 needs to process.
  • the data storage system can be integrated on the server 104, or placed on a cloud or other network server.
  • Figure 1 shows that terminal 102 executes a video compression method and terminal 106 executes a video decompression method
  • the video compression method and decompression method can be applied to terminal 102, server 104 or terminal 106 in Figure 1, that is, any device among terminal 102, server 104 and terminal 106 can use the compression method and decompression method of the present application to compress and decompress the video.
  • the terminal 102 executes the video compression method
  • the terminal 102 can store the obtained video package locally, or upload it to the server 104 for the terminal 106 to watch on demand, or send it to the terminal 106 through the live broadcast platform, and the terminal 106 can execute the decompression method to decompress the received video package.
  • the terminal 106 executes the video compression method
  • the above method can be referred to.
  • the server 104 executes the video compression method
  • the server 104 can store the obtained video packets locally.
  • the video packets are sent to the terminal 102 or the terminal 106 in the form of a data stream, so that the terminal 102 or the terminal 106 executes the decompression method to decompress the received video packets.
  • the present application uses the key frames and forward search frames of the video to generate pixel cores of each pixel in the key frames, and performs smoothing constraint processing on the pixel cores, so as to obtain a target pixel core that characterizes the inter-frame motion. Since the target pixel core is for each pixel in the key frame, the target pixel core can be used to perform weighted processing on the pixel points in the key frame and their neighborhood pixel points, thereby modeling the inter-frame motion, effectively improving the accuracy of the estimated inter-frame motion, and thus eliminating the need for additional residual compensation during compression.
  • the terminals 102 and 106 may be smart phones, tablet computers, laptop computers, desktop computers, smart speakers, smart watches, IoT devices, and portable wearable devices.
  • IoT devices may be smart speakers, smart TVs, smart air conditioners, and smart vehicle-mounted devices, etc.
  • Portable wearable devices may be smart watches, smart bracelets, head-mounted devices, etc.
  • Server 104 can be an independent physical server or a service node in a blockchain system. Each service node in the blockchain system forms a peer-to-peer (P2P) network.
  • the P2P protocol is a protocol that runs on a transmission control
  • the server 104 may also be a server cluster composed of multiple physical servers, and may be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.
  • Terminal 102, terminal 106 and server 104 can be connected via Bluetooth, USB (Universal Serial Bus) or communication network or other communication connection methods, and this application does not make any restrictions here.
  • USB Universal Serial Bus
  • a video compression method is provided.
  • the method can be applied to the terminal 102, the server 104 or the terminal 106 in FIG. 1 .
  • the method is applied to the terminal 102 in FIG. 1 as an example for description, and includes the following steps:
  • the video can be various types of videos produced by the production object, such as short videos, long videos and live videos.
  • the short video and the long video can be saved in the content distribution network so that they can be played on demand when needed; and the live video can be transmitted to the object watching the live broadcast after being compressed using the compression method of the present application.
  • the video is a live video, it can be compressed in segments during the compression process, that is, each time a segment of video is obtained, the compression method of the present application can be used to compress the obtained segment of video; when the video is a short video or a long video for on-demand or local storage, the compression method of the present application can be used to compress the entire video.
  • a key frame can be a video frame to be encoded intra-frame in a video. For example, if there are multiple shots in a video (i.e., the segments between every two splicing points in the video), the first frame of each shot can be used as a key frame. For another example, when the difference between two adjacent video frames reaches a preset threshold, the next video frame can also be used as a key frame.
  • the key frame can be an RGB (Red Green Blue) three-channel image or other types of images.
  • the forward search frame may refer to a video frame other than a key frame in the video, and may be reconstructed using the key frame and inter-frame difference information (such as the residual map and target pixel kernel of the present application).
  • the forward search frame may be a three-channel image or an image of another type.
  • the relationship between each other may be understood in conjunction with the following examples: Assuming that the first frame, the fourth frame, and the seventh frame of video A are key frames, the other video frames in video A, such as the second frame, the third frame, the fifth frame, the sixth frame, the eighth frame, the ninth frame, and the tenth frame are forward search frames; the terminal may reconstruct the second and third frames using the first frame and the corresponding inter-frame difference information, reconstruct the fifth and sixth frames using the fourth frame and the corresponding inter-frame difference information, and reconstruct the eighth, ninth, and tenth frames using the seventh frame and the corresponding inter-frame difference information.
  • the pixel kernel can be a matrix containing inter-frame spatial motion information. After smoothing the pixel kernel, the pixel kernel can learn blur information and accurate inter-frame spatial motion (i.e., the spatial motion of objects between different frames, or the spatial motion of shooting equipment at different shooting times), and ensure that the pixel kernel structures of adjacent pixels are similar. It should be pointed out that when the number of forward search frames corresponding to a key frame is 1, the number of pixel kernels for each pixel in the key frame is 1 or 1 ⁇ k, where k is a positive integer greater than 1.
  • the corresponding pixel points in the forward search frame and the values of the pixel points can be predicted; when the number of forward search frames corresponding to a key frame is n (n ⁇ 2), the number of pixel kernels for each pixel point in the key frame is n or n ⁇ k. Therefore, by applying n or n ⁇ k pixel kernels to the corresponding pixel points and their neighborhoods of the key frame in sequence, the corresponding pixel points in n forward search frames and the values of the pixel points can be predicted.
  • the above-mentioned spatial motion may also be called inter-frame motion and spatial offset.
  • the terminal may input key frames and forward search frames in the video into a local weighting module, and the local weighting module may generate corresponding pixel kernels for each pixel point in the key frame according to the forward search frame and the key frame, as shown in FIG3 .
  • the local weighting module can be mainly composed of a feature extraction network, and the network structure of the feature extraction network can be an auto-encoder structure, which can include an encoder part and a decoder part, and the encoder in the encoder part and the decoder in the decoder part are jump-connected, and reference can be made to the local weighting module shown in Figure 4.
  • the local weighting module is a module proposed in this application, which can adaptively construct spatial and scale information between frames at the same time, and has good fine granularity.
  • the present application constructs a smooth constraint regularization term, so based on the smooth constraint regularization term, the pixel kernel can be subjected to smooth constraint processing.
  • the smooth constraint regularization term may include an intra-core smooth constraint regularization term and an inter-core smooth constraint regularization term
  • the smooth constraint processing may include an intra-core smooth constraint processing and an inter-core smooth constraint processing
  • the intra-core smooth constraint processing represents the smooth constraint processing inside the pixel kernel
  • the inter-core smooth constraint processing represents the smooth constraint processing between pixel kernels.
  • the target pixel kernel obtained by the smooth constraint processing, accurate inter-frame spatial motion can be learned (i.e., the inter-frame spatial motion can be accurately predicted using the target pixel kernel), and fuzzy information can also be learned.
  • the target pixel kernel can be a matrix composed of various weight values, and the distribution of non-zero weight values in the target pixel kernel can be used for spatial offset prediction.
  • the target pixel kernels of adjacent pixels have similar structures, which is more in line with the prior of motion estimation (i.e., the motion information of adjacent regions is relatively similar), which is conducive to information compression. It is important to point out that the fuzzy information can be used to model the uncertainty of spatial offset prediction, making it easier to compensate for the residual error without compensating for the additional error caused by motion estimation errors.
  • the terminal performs intra-core smooth constraint processing on the pixel core of each pixel point to obtain the processed pixel core of each pixel point; performs inter-core smooth constraint processing on the processed pixel core of each pixel point to obtain the target pixel core for characterizing inter-frame motion. Specifically, the terminal combines the pixel cores of each pixel point to obtain a feature map of the pixel core; then performs intra-core smooth constraint processing on the feature map of the pixel core, and then performs inter-core smooth constraint processing.
  • the terminal can perform intra-core smooth constraint processing on the pixel core of each pixel point based on the intra-core smooth constraint regularization term to obtain the processed pixel core of each pixel point; based on the inter-core smooth constraint regularization term, the processed pixel core of each pixel point performs inter-core smooth constraint processing to obtain the target pixel core for characterizing inter-frame motion.
  • the smooth constraint regularization term is implemented by constraining the L2 norm of the pixel kernel gradient (such as at least one of the first-order gradient, the second-order gradient or the gradient of other orders).
  • the feature map of the pixel kernel be Kernels, whose size is (K ⁇ K ⁇ H ⁇ W), where K ⁇ K represents the size of the pixel kernel, and H ⁇ W represents the spatial size (i.e., the image size) of the key frame and the forward search frame.
  • smooth intra mean(
  • smooth inter mean(
  • mean() in the above calculation formula represents mean calculation
  • 2 represents L2 norm.
  • the first element in the brackets represents the row element in the current pixel core
  • the second element represents the column element in the current pixel core
  • the third element represents the row element in other pixel cores
  • the fourth element represents the pixel points in each column in other pixel cores.
  • 1: in the calculation formula represents all weight values from the current weight value to the last weight value in the pixel core
  • : in the calculation formula represents all weight values from the first weight value to the last weight value in the pixel core
  • :K-1 in the calculation formula represents all weight values from the pixel core.
  • H-1 in the calculation formula represents the first weight value to H-1 weight values in other pixel cores
  • W-1 in the calculation formula represents the first weight value to W-1 weight values in other pixel cores.
  • a filter can also be used for smoothing constraint processing; or, a mean value is calculated and then the weight value in the pixel kernel is processed using the mean value, such as calculating the difference between the mean value and each weight value.
  • a first-order target pixel kernel is obtained; in addition, after obtaining the first-order target pixel kernel, the first-order target pixel kernel can be further subjected to smooth constraint processing in the above manner to obtain a second-order target pixel kernel; after obtaining the second-order target pixel kernel, the first-order and second-order target pixel kernels can be fused to obtain a fused target pixel kernel, as shown in FIG5 , which is a comparison diagram between pixel kernels of different orders.
  • the first-order target pixel kernel can be compressed, or the second-order target pixel kernel can be compressed, or after the first-order and second-order target pixel kernels are fused, the fused target pixel kernel is compressed.
  • Motion estimation based on pixel kernels may be affected by the size of the pixel kernel when predicting inter-frame offsets.
  • the accuracy of motion prediction using the pixel kernel will be reduced. Therefore, a small pixel kernel size will limit the offset prediction, while a large pixel kernel size will cause excessive video memory usage.
  • the present application proposes a strategy of cascading pixel kernels.
  • the terminal copies the target pixel kernel of each pixel point in the key frame so that each pixel point corresponds to at least two target pixel kernels, and cascades the target pixel kernels of the same pixel point, thereby achieving a larger size of inter-frame offset prediction with only a small increase in video memory, effectively improving the accuracy of spatial motion prediction.
  • the compressed key frame may refer to a video frame obtained by intra-coding a key frame (which belongs to a video compression technology).
  • the compressed key frame may be used as an I frame (Intra-frame, intra-coded frame).
  • the terminal when compressing a key frame, may encode the key frame in an intra-frame encoding manner to remove image spatial redundant information in the key frame to obtain a compressed key frame.
  • the target pixel core After the video is compressed, it can be stored and transmitted, and the target pixel core also needs to be stored and transmitted after compression, so that when the video is decompressed, the target pixel core can be used to reconstruct the corresponding video frame.
  • the target pixel core In order to reduce the space occupied by video storage or reduce the bandwidth occupied during video transmission, the target pixel core needs to be compressed, and the compression of the target pixel core can be carried out in the following two ways:
  • Compression method 1 directly compress the target pixel core.
  • the terminal converts the target pixel kernel into a first latent variable; quantizes the first latent variable to obtain a quantized first latent variable; and encodes the quantized first latent variable to obtain a compressed pixel kernel.
  • the first latent variable may be a variable in the latent space that matches the target pixel kernel.
  • the target pixel kernels of each pixel point of the key frame are combined into a three-dimensional feature map, which is input into the encoder, and the encoder maps the feature map to the latent space to obtain the first latent variable; then the first latent variable is lossy quantized and arithmetic encoded in sequence to obtain a compressed pixel kernel in the form of a byte stream for storage and transmission, as shown in Figure 6.
  • the probability of the input data can be estimated first, and then encoded.
  • Compression method 2 uses a codebook to compress the target pixel kernel.
  • the terminal obtains a codebook; maps a target pixel kernel to a word sequence based on the codebook, and encodes the word sequence to obtain a compressed pixel kernel.
  • the codebook may be a codebook constructed by selecting several representative video frames from the video to be compressed, and then using the selected video frames to construct the codebook. Each word in the codebook corresponds to a target pixel kernel of a specific offset and blur information.
  • the target pixel cores of each pixel point of the key frame are composed into a three-dimensional feature map, and then the feature map is mapped to a word using a codebook (that is, the word sequence number is used for representation).
  • the codebook can be used to compress the pixel core from the data volume H ⁇ W ⁇ K ⁇ K to the data volume H ⁇ W ⁇ 1, effectively reducing the data volume; then, the word can be entropy encoded for further compression, as shown in Figure 7. It should be pointed out that the codebook has been predefined at both the encoder and the decoder, and does not need to be transmitted.
  • the terminal may use any of the above compression methods to compress the at least two target pixel cores cascaded with each pixel point in the key frame to obtain a compressed pixel core.
  • the predicted frame is a video frame generated based on the target pixel kernel and the key frame.
  • the terminal can perform inter-frame offset processing on the pixels in the key frame and the neighboring pixels of the pixels in turn based on the target pixel kernel to obtain a predicted frame.
  • the target pixel kernel of the pixel in the i-th row and j-th column of the key frame be Kernels (i, j)
  • the pixel in the i-th row and j-th column of the key frame and the neighboring pixels of the pixel be pixel (i, j)
  • use the weight value in Kernels (i, j) to perform weighted processing on the pixel in the i-th row and j-th column of the key frame and the neighboring pixels of the pixel pixel (i, j) , so as to obtain a predicted frame.
  • the terminal determines a residual spectrum between a prediction frame and a forward search frame; converts the residual spectrum into a second latent variable; quantizes the second latent variable to obtain a quantized second latent variable; and encodes the quantized second latent variable to obtain a compressed spectrum.
  • the second latent variable may be a variable in the latent space that matches the residual spectrum.
  • the residual map is input to the encoder, and the residual map is mapped to the latent space through the encoder to obtain a second latent variable; then the second latent variable is lossily quantized and arithmetic encoded in sequence to obtain a compressed map in the form of a byte stream for storage and transmission.
  • the compression process can be referred to Figure 6.
  • the video packet may refer to a data packet of video content, which may be stored and transmitted in a network.
  • the terminal can directly package the compressed graph, compressed key frame and compressed pixel core to obtain a compressed video package.
  • the terminal can also store or transmit the video package, such as storing it in a video library, or transmitting it to a target end.
  • the target end can be a terminal that plays the video, or a CDN server.
  • the terminal can store the video package.
  • the compressed key frame and the compressed map in the video package are decompressed to obtain the key frame and the residual map
  • the compressed pixel core in the video package is decompressed to obtain the target pixel core
  • the pixel points in the key frame and the neighboring pixel points of the pixel points are sequentially subjected to inter-frame offset processing, such as weighted processing of the pixel points in the key frame and the neighboring pixel points of the pixel points based on the weight value in the target pixel core, to obtain the predicted frame
  • the predicted frame is subjected to image compensation processing based on the residual map to obtain the target predicted frame
  • the target predicted frame can also be called the target video frame, which is the reconstructed video frame of the forward search frame.
  • At least two copied target pixel cores are compressed to obtain a compressed pixel core during the compression process, then during decompression, at least two target pixel cores cascaded for each pixel point in the key frame can be obtained; then, based on the at least two target pixel cores cascaded for each pixel point in the key frame, inter-frame offset processing is performed on the pixel points in the key frame and the neighboring pixel points of the pixel points in turn to obtain a predicted frame.
  • Decompression of the compressed graph may specifically include: the terminal decodes the compressed graph in the video packet to obtain a quantized second latent variable; and converts the quantized second latent variable into a residual graph.
  • the compressed graph is input to a decoder, and the decoder decodes the compressed graph to obtain a quantized second latent variable; and then the second latent variable is converted from the latent space to the residual graph.
  • the decompression process can be referred to Figure 6 for the quantized residual spectrum.
  • Decompression of the compressed pixel core may specifically include: the terminal decodes the compressed pixel core in the video packet to obtain a quantized first latent variable; converts the quantized first latent variable into a target pixel core for each pixel point in the key frame to characterize the inter-frame motion.
  • the terminal decodes the compressed pixel core to obtain a word sequence; searches a codebook; and based on the codebook, converts the word sequence into a target pixel core for each pixel point in the key frame to characterize the inter-frame motion.
  • the compression method can be applied in the application stage and the training stage of video compression.
  • the intelligent compression model can be optimized using distortion loss and bit rate loss.
  • the distortion loss can measure the recovery quality of the decompressed frame
  • the bit rate loss can measure the number of bytes of the compressed file. The combination of the two can balance the relationship between the bit rate and the reconstruction quality.
  • the intelligent compression model can be a network model built based on the intelligent compression algorithm.
  • the residual map and the target pixel core are compressed by an intelligent (AI) compression algorithm, and the compressed map and the compressed pixel core are decompressed by an intelligent compression model.
  • the steps of optimizing the intelligent compression model include: the terminal determines the distortion loss between the prediction frame and the forward search frame to obtain a first distortion loss; determines the distortion loss between the target prediction frame and the forward search frame to obtain a second distortion loss; determines the bit rate loss of the target pixel core and the residual map respectively to obtain a first bit rate loss and a second bit rate loss; and adjusts the parameters of the intelligent compression model based on the first distortion loss, the second distortion loss, the first bit rate loss, and the second bit rate loss.
  • the first distortion loss can be the root mean square error (MSE) between the prediction frame and the forward search frame
  • the second distortion loss can be the root mean square error between the target prediction frame and the forward search frame
  • the first bit rate loss can be the ratio of the feature map of the target pixel kernel after compression between the file data volume and the number of pixels
  • the second bit rate loss can be the ratio of the residual map after compression between the file data volume and the number of pixels.
  • the terminal may determine a comprehensive loss based on the first distortion loss, the second distortion loss, the first bit rate loss, and the second bit rate loss, and adjust the parameters of the intelligent compression model according to the comprehensive loss.
  • the function expression of the comprehensive loss is as follows:
  • ⁇ 1 and ⁇ 2 represent adjustment factors, which are used to control the compression quality of the video, where the values of ⁇ 1 and ⁇ 2 can be values greater than 0.
  • the intelligent compression model can be used for compression processing and also for smoothing constraint processing. Therefore, in the process of adjusting the parameters of the intelligent compression model, the weight of the target pixel core can be adjusted to make the weight of the target pixel core reach the optimal state.
  • the comparison of the influence of different weights on the distribution of the target pixel core can be referred to Figure 8.
  • the pixel core of each pixel point in the key frame is generated based on the key frame and the forward search frame of the video; the pixel core is subjected to smooth constraint processing to obtain a target pixel core that accurately represents the inter-frame motion. Since the target pixel core is obtained by smooth constraint processing of the pixel core, the target pixel cores of adjacent pixels are similar in structure, which is beneficial to video compression, and the pixels in the neighborhood of the corresponding pixel point in the key frame can be accurately captured during decompression. Therefore, the spatial offset can be accurately estimated regardless of the more complex motion deformation, which is beneficial to the accurate decompression of the video.
  • the key frame and the target pixel core are compressed to obtain a compressed key frame and a compressed pixel core; the residual map between the forward search frame and the predicted frame is compressed, and a compressed video package can be obtained according to the compressed map corresponding to the residual map, the compressed key frame and the compressed pixel core.
  • the compression greatly reduces the amount of video data and avoids the additional residual compensation information caused by inaccurate offset estimation, which is beneficial to improving the efficiency and accuracy of video decompression.
  • S202 may specifically include:
  • each key frame may correspond to one or more forward search frames, and there is a certain similarity between the key frame and the corresponding forward search frame, such as a similarity greater than or equal to 70%.
  • a similarity greater than or equal to 70% For example, for a video, when a video frame in which character a appears in the video is a key frame, then the subsequent video frame in which the hand of character a starts to move from bottom to top is a forward search frame.
  • the terminal may first splice the key frame and the forward search frame to obtain a spliced video frame; encode the spliced video frame through an encoder of a feature extraction network to obtain a coded feature; and decode the coded feature through a decoder of the feature extraction network to obtain a target feature map.
  • the feature extraction network may include an encoder part and a decoder part, and the encoder in the encoder part and the decoder in the decoder part are jump-connected.
  • Figure 10 is a detailed structural diagram of the feature extraction network. The left side of the figure is the encoder part, and the right side is the decoder part.
  • the rectangular box with small black dots in the figure corresponds to multi-channel features, and the number of channels is displayed above the rectangular box; the white rectangular box in the figure is the copied and cropped features, and arrows with different directions indicate different operations.
  • the terminal may spatially align the target feature map with the key frame;
  • the target feature map is a three-dimensional feature map, and the width and height values of the target feature map are the same as those of the key frame, as shown in FIG3;
  • the feature vectors in the target feature map corresponding to each pixel in the key frame are converted into pixel kernels respectively. Since the target feature map is obtained by performing image feature processing on the key frame and the forward search frame, and the width and height values of the target feature map are the same as those of the key frame, there are feature vectors in the target feature map that correspond one-to-one to each pixel in the key frame, and the position of the pixel in the key frame is the same as the planar position of the feature vector in the target feature map.
  • a pixel core of each pixel in the key frame is generated based on the key frame and forward search frame of the video.
  • the pixel core can be used to model inter-frame motion. Compared with the traditional scheme of using optical flow network for pixel-level offset prediction, it can estimate the spatial offset more accurately, and after smoothing constraint processing, the accuracy of predicting the spatial offset can be further improved.
  • a video decompression method is provided.
  • the method can be applied to the terminal 102, the server 104 or the terminal 106 in FIG. 1 .
  • the method is applied to the terminal 106 in FIG. 1 as an example for description, and includes the following steps:
  • the compressed key frame is obtained by intra-frame coding of the key frame in the video.
  • the compressed spectrum is obtained by compressing the residual spectrum between the forward search frame and the prediction frame, and the residual spectrum can be a spectrum obtained according to the difference between the forward search frame and the prediction frame.
  • a key frame can be a video frame to be encoded intra-frame in a video. For example, if there are multiple shots in a video (i.e., the segments between every two splicing points in the video), the first frame of each shot can be used as a key frame. For another example, when the difference between two adjacent video frames reaches a preset threshold, the next video frame can also be used as a key frame.
  • the key frame can be an RGB (Red Green Blue) three-channel image or other types of images.
  • the forward search frame may refer to a video frame other than a key frame in the video, and the key frame and inter-frame difference information (such as the The forward search frame can be a three-channel image or other types of images.
  • the compressed key frame in the video packet is intra-decoded to obtain the key frame; then the compressed graph in the video packet is decoded to obtain a quantized second latent variable; and the quantized second latent variable is converted into a residual graph.
  • the second latent variable may be a variable in the latent space that matches the residual spectrum.
  • the compressed graph is input to a decoder, and the decoder decodes the compressed graph to obtain a quantized second latent variable; then the second latent variable is converted from the latent space to a quantized residual graph.
  • the decompression process can be referred to in FIG6 .
  • the target pixel kernel can be used to characterize the inter-frame motion, which is obtained by smoothing constraint processing on the pixel kernel.
  • Smooth constraint processing can include intra-kernel smooth constraint processing and inter-kernel smooth constraint processing, and the intra-kernel smooth constraint processing represents the smooth constraint processing inside the pixel kernel, and the inter-kernel smooth constraint processing represents the smooth constraint processing between pixel kernels.
  • the target pixel kernel obtained by smooth constraint processing accurate inter-frame spatial motion can be learned, and fuzzy information can also be learned.
  • the target pixel kernels of adjacent pixels have similar structures, which conforms to the prior of motion estimation and is more conducive to information compression. This fuzzy information can be used to model the uncertainty of spatial offset prediction, making it easier to compensate for residual errors without compensating for additional errors caused by motion estimation errors.
  • the pixel kernel may be a matrix containing spatial motion information between frames, which is a matrix generated based on the key frames and forward search frames of the video and corresponding to each pixel in the key frame.
  • Each pixel in the key frame corresponds to one or more pixel kernels. For example, when the number of forward search frames corresponding to a key frame is 1, the number of pixel kernels for each pixel in the key frame is 1 or 1 ⁇ k, where k is a positive integer greater than 1.
  • the corresponding pixel in the forward search frame and the value of the pixel can be predicted; when the number of forward search frames corresponding to a key frame is n (n ⁇ 2), the number of pixel kernels for each pixel in the key frame is n or n ⁇ k. Therefore, by applying n or n ⁇ k pixel kernels to the corresponding pixel and its neighborhood of the key frame in sequence, the corresponding pixel in n forward search frames and the value of the pixel can be predicted.
  • At least two copied target pixel cores are compressed to obtain compressed pixel cores during the compression process, at least two target pixel cores cascaded from each pixel point in the key frame can be obtained during decompression.
  • the source end i.e., the terminal 102 in FIG. 1
  • the source end compresses the target pixel core
  • it can adopt a method of directly compressing the target pixel core and a method of compressing the target pixel core using a codebook. Therefore, when decompressing, a matching compression method is selected for decompression according to actual conditions, as follows:
  • Decompression method 1 directly decompress the compressed pixel core.
  • the terminal decodes the compressed pixel kernel in the video packet to obtain a quantized first latent variable; and converts the quantized first latent variable into a target pixel kernel for representing inter-frame motion of each pixel point in the key frame.
  • the first latent variable may be a variable in the latent space that matches the target pixel kernel.
  • the compressed pixel kernel is decoded by arithmetic decoding to obtain a quantized first latent variable; then, the quantized first latent variable is converted into a quantized feature map through a decoder, and each target pixel kernel is decomposed from the feature map.
  • Decompression method 2 using the codebook to decompress the target pixel kernel.
  • the terminal decodes the compressed pixel kernel to obtain a word sequence; searches a codebook; and based on the codebook, converts the word sequence into a target pixel kernel for representing inter-frame motion of each pixel point in a key frame.
  • the codebook is defined in both the encoder and the decoder and does not need to be transmitted with the video packet.
  • the codebook is constructed by selecting several representative video frames from the video to be compressed and then using the selected video frames to generate the codebook.
  • a codebook is constructed based on the video frames. Each word in the codebook corresponds to a target pixel kernel with specific offset and blur information.
  • the feature map is recovered from the word sequence by searching the codebook, and then the target pixel kernel for representing the inter-frame motion of each pixel point in the key frame is decomposed from the feature map.
  • the predicted frame is a video frame that matches the real forward search frame.
  • the neighborhood pixel points may be a set of pixel points whose distance to the pixel point is less than a preset distance.
  • the terminal may perform inter-frame offset processing on the pixels in the key frame and the neighboring pixels of the pixels in turn based on at least two target pixel cores cascaded from each pixel in the key frame to obtain a predicted frame.
  • the terminal performs weighted processing on the pixel points in the key frame and the neighboring pixel points of the pixel points in turn based on the weight value in the target pixel kernel, so as to obtain a predicted frame.
  • the target prediction frame is also a reconstructed video frame that matches the actual forward search frame.
  • the target prediction frame is the same as the forward search frame. The difference is that the forward search frame is the original video frame in the video, while the target prediction frame is a video frame reconstructed based on the key frame, the target pixel kernel and the residual map.
  • the above-mentioned videos may be various types of videos produced by the production object, such as short videos, long videos, and live videos.
  • a video after obtaining the key frame and the target prediction frame, a video can be synthesized according to the video frame and the target prediction frame, and then played on the playback page.
  • the video is a video synthesized by multiple interactive videos
  • a transparent or translucent interactive page such as an H5 page
  • the video jumps to the progress corresponding to the interactive control selected by the triggering operation for playback.
  • each interactive video can be compressed separately when the video is compressed.
  • the video being played is a composite of multiple interactive videos
  • interactive controls for interacting with each interactive video can be displayed, such as different interactive controls corresponding to different interactive videos.
  • the interactive video corresponding to the interactive control can be jumped to for playback.
  • the interactive controls for "hairstyle change" and "voice change” are displayed in the interactive page carried by WebView.
  • the interactive control for "hairstyle change” is clicked, the interactive video for hairstyle change is determined, and then the interactive video for hairstyle change is jumped to for playback.
  • the compressed key frame and the compressed atlas in the video package are decompressed to obtain the key frame and the residual atlas; the compressed pixel core in the video package is decompressed to obtain the target pixel core for each pixel point in the key frame to characterize the inter-frame motion.
  • the target pixel core is obtained by processing the pixel core with smooth constraints. Therefore, during decompression, the pixels in the neighborhood of the corresponding pixel point in the key frame can be accurately captured, and the spatial offset can be accurately estimated, which is conducive to accurately decompressing the video to obtain the predicted frame.
  • the spatial offset can be accurately estimated using the target pixel core, it is only necessary to perform conventional image compensation processing on the predicted frame based on the residual atlas to obtain the target predicted frame, which can avoid additional residual compensation information caused by inaccurate offset estimation, which is conducive to improving the efficiency and accuracy of video decompression.
  • This application proposes a local weighting module and constructs a corresponding AI video compression process based on the local weighting module, as shown in Figure 3.
  • the local weighting module is used for inter-frame motion estimation, and can adaptively generate inter-frame motion and uncertainty information based on two adjacent video frames input.
  • the network structure of the local weighting module adopts a U-Net-like autoencoder structure, and introduces a jump connection between the encoder part and the decoder part.
  • a spliced image pair consisting of a key frame and a forward prediction frame is used as the input of a local weighting module, which can generate a feature map with the same size as the key frame, align the feature map with the key frame in space, and then deform the feature map to obtain the pixel core of each pixel in the key frame.
  • This application constructs two smoothing constraint methods: intra-kernel smoothing constraint and inter-kernel smoothing constraint.
  • the intra-kernel smoothing constraint makes the generated distribution continuous and smooth, and concentrated in a certain area, which can learn the correct offset and blur information, so that when decompressing, it does not randomly and discretely grab pixel values in the neighborhood of the key frame.
  • the inter-kernel smoothing constraint ensures that the adjacent pixel kernel structures are relatively similar, which is more in line with the prior of motion estimation and is more conducive to information compression.
  • the pixel kernel smoothness constraint can be implemented by constraining the L2 norm of the pixel kernel gradient (first-order gradient and second-order gradient). Specifically, let the generated pixel kernel be Kernels, whose size is (K ⁇ K ⁇ H ⁇ W), where K ⁇ K represents the size of the pixel kernel and H ⁇ W represents the spatial size of the image.
  • smooth intra mean(
  • smooth inter mean(
  • mean() represents mean calculation
  • 2 represents L2 norm
  • Motion estimation based on pixel kernels may be affected by the size of the pixel kernel when predicting inter-frame offsets.
  • the accuracy of motion prediction using the pixel kernel will decrease. Therefore, a small pixel kernel size will limit offset prediction, while a large pixel kernel size will lead to excessive video memory consumption.
  • a larger size of inter-frame offset prediction can be achieved without increasing the video memory, effectively improving the accuracy of spatial motion prediction.
  • the pixel core output by the local weighting module needs to be compressed by an intelligent compression algorithm to reduce the bit rate and facilitate storage and transmission.
  • an intelligent compression algorithm to reduce the bit rate and facilitate storage and transmission.
  • Strategy 1 compress the feature map of the pixel kernel directly.
  • the feature map of the pixel kernel is first mapped to the latent space through an encoder of an autoencoder, and then a byte stream for storage and transmission is obtained through lossy quantization and arithmetic coding.
  • the received byte stream is first arithmetically decoded, and then the latent variables obtained are decoded into quantized feature maps through the decoder part of the autoencoder.
  • This application proposes a compression method based on a codebook, which constructs a codebook for pixel kernels.
  • Each word in the codebook corresponds to a pixel kernel with specific offset and blur information.
  • the pixel kernel is mapped to a word through the codebook, and the decoding The pixel kernel is recovered from the word by searching the codebook.
  • the feature map of the pixel kernel can be compressed from H ⁇ W ⁇ K ⁇ K to a data volume of H ⁇ W ⁇ 1 through the word, and can be further compressed through entropy coding and other technologies. It should be noted that the codebook has been predefined at the encoding and decoding ends and does not need to be transmitted.
  • the pixel kernel generated by the local weighting module is applied to the key frame to obtain the predicted frame. Since the predicted frame usually has a certain gap with the actual forward search frame, this part of the error needs to be compensated. Specifically, the forward search frame is subtracted from the predicted frame to obtain the residual map. Since the residual map needs to be stored and transmitted, it needs to be lossy quantized and entropy encoded, as shown in Figure 6; then the quantized and decoded residual map is added to the predicted frame to obtain the reconstructed frame as the final result.
  • the intelligent compression algorithm proposed in this application can directly perform end-to-end optimization, and its loss function includes two parts: distortion loss and bit rate loss.
  • Distortion loss can represent the recovery quality of the decompressed frame
  • bit rate loss measures the number of bytes of the compressed file. The combination of the two can balance the relationship between bit rate and reconstruction quality.
  • the loss function calculation formula is as follows:
  • ⁇ 1 and ⁇ 2 represent adjustment factors, which are used to adjust the compression quality of the video.
  • the loss calculated by the above loss function is used to optimize the intelligent compression algorithm, so that the final application intelligent compression algorithm can be obtained.
  • Figure 13 shows the results of video compression and decompression proposed in this application, where different rows represent different samples.
  • the motion direction between the first sample frames is the upper left, so the weight of the pixel kernel corresponding to the foreground position (visualization 1) is distributed in the lower right; for the background position (visualization 2), the pixel kernel does not perform offset prediction, and can directly capture pixels at the current position (i.e., the center of the kernel).
  • the second sample it moves to the upper right as a whole, and there are situations where motion estimation cannot be predicted (such as the appearance of objects behind).
  • the pixel kernel can control the overall capture of pixels to the lower left, while for uncertain areas (such as visualization 2), a certain blurring process is performed while performing offset prediction.
  • the pixel kernel after blurring can effectively improve the accuracy of motion estimation.
  • the reconstructed frame can refer to the target predicted frame.
  • steps in the flowcharts involved in the above-mentioned embodiments can include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.
  • the embodiment of the present application also provides a video compression device for implementing the video compression method involved in the above-mentioned video, and a video decompression device for implementing the video decompression method.
  • the implementation solution provided by the device to solve the problem is similar to the implementation solution recorded in the above-mentioned method, so the compression of one or more videos provided below.
  • the specific limitations in the device embodiment can refer to the limitations on the video compression method in the above text
  • the specific limitations in the video decompression device embodiment can refer to the limitations on the video decompression method in the above text, which will not be repeated here.
  • a video compression device including: a first generation module 1402, a constraint processing module 1404, a first compression module 1406, a second compression module 1408, and a second generation module 1410, wherein:
  • a first generating module 1402 is used to generate a pixel core of each pixel point in the key frame based on the key frame and the forward search frame of the video;
  • the constraint processing module 1404 is used to perform smooth constraint processing on the pixel kernel to obtain a target pixel kernel for representing inter-frame motion;
  • a first compression module 1406 is used to compress the key frame and the target pixel core to obtain a compressed key frame and a compressed pixel core;
  • a second compression module 1408 is used to compress the residual spectrum between the forward search frame and the prediction frame to obtain a compressed spectrum;
  • the prediction frame is a video frame generated based on the target pixel kernel and the key frame;
  • the second generating module 1410 is used to obtain a compressed video packet according to the compression graph, the compression key frame and the compression pixel kernel.
  • the pixel core of each pixel point in the key frame is generated based on the key frame and the forward search frame of the video; the pixel core is subjected to smooth constraint processing to obtain the target pixel core that accurately represents the inter-frame motion. Since the target pixel core is obtained by smooth constraint processing of the pixel core, the target pixel cores of each adjacent pixel point have similar structures, which is beneficial to video compression, and the pixels in the neighborhood of the corresponding pixel point in the key frame can be accurately captured during decompression. Therefore, no matter how complex the motion deformation is, the spatial offset can be accurately estimated, which is beneficial to accurately decompress the video.
  • the key frame and the target pixel core are compressed to obtain a compressed key frame and a compressed pixel core; the residual map between the forward search frame and the predicted frame is compressed, and a compressed video packet can be obtained according to the compressed map corresponding to the residual map, the compressed key frame and the compressed pixel core. It is not necessary to compress all frames in the video, but only the key frame, the target pixel core and the residual map need to be compressed, which greatly reduces the amount of video data, and can also avoid additional residual compensation information caused by inaccurate offset estimation, which is beneficial to improve the efficiency and accuracy of video decompression.
  • the first generation module 1402 is also used to extract key frames and forward search frames from the video; perform image feature extraction on the key frames and forward search frames to obtain a target feature map; and convert each feature vector in the target feature map into a pixel kernel of each pixel point in the key frame.
  • the first generation module 1402 is also used to splice the key frame and the forward search frame to obtain a spliced video frame; encode the spliced video frame through the encoder of the feature extraction network to obtain the encoded features; and decode the encoded features through the decoder of the feature extraction network to obtain the target feature map.
  • the first generation module 1402 is also used to spatially align the target feature map with the key frame;
  • the target feature map is a three-dimensional feature map, and the width and height values of the target feature map are the same as the width and height values of the key frame; the feature vectors of each pixel point in the target feature map and the key frame are converted into pixel kernels respectively.
  • a pixel core of each pixel in the key frame is generated based on the key frame and forward search frame of the video.
  • the pixel core can be used to model inter-frame motion. Compared with the traditional scheme of using optical flow network for pixel-level offset prediction, it can estimate the spatial offset more accurately, and after smoothing constraint processing, the accuracy of predicting the spatial offset can be further improved.
  • the constraint processing module 1404 is also used to perform intra-core smoothing constraint processing on the pixel core of each pixel point to obtain the processed pixel core of each pixel point; and perform inter-core smoothing constraint processing on the processed pixel core of each pixel point to obtain the target pixel core for characterizing inter-frame motion.
  • the device further comprises:
  • the copy module 1412 is used to copy the target pixel core of each pixel point in the key frame, so that each pixel point corresponds to at least two target pixel cores.
  • the first compression module 1406 is further used to perform intra-frame compression on the key frame to obtain a compressed key frame; convert the target pixel kernel into a first latent variable; quantize the first latent variable to obtain a quantized first latent variable; and encode the quantized first latent variable to obtain a compressed pixel kernel.
  • the first compression module 1406 is further used to perform intra-frame compression on the key frame to obtain a compressed key frame; obtain a coding book; map the target pixel core to a word sequence based on the coding book, and encode the word sequence to obtain a compressed pixel core.
  • the second compression module 1408 is also used to determine a residual map between the prediction frame and the forward search frame; convert the residual map into a second latent variable; quantize the second latent variable to obtain a quantized second latent variable; encode the quantized second latent variable to obtain a compressed map.
  • the device further comprises:
  • the storage module 1414 is used to store the video package
  • a first decompression module 1416 is used to decompress the compressed key frame and the compressed atlas in the video packet to obtain the key frame and the residual atlas when the video packet needs to be decompressed;
  • a second decompression module 1418 configured to decompress the compressed pixel core in the video packet to obtain a target pixel core
  • An offset processing module 1420 is used to perform inter-frame offset processing on pixel points in the key frame and neighboring pixel points of the pixel points in turn based on the target pixel kernel to obtain a predicted frame;
  • the compensation processing module 1422 is used to perform image compensation processing on the prediction frame based on the residual map to obtain a target prediction frame; wherein the target prediction frame is a reconstructed video frame corresponding to the forward search frame.
  • the residual map and the target pixel core are compressed by an intelligent compression algorithm, and the compressed map and the compressed pixel core are decompressed by an intelligent compression algorithm; as shown in FIG15 , the device further includes:
  • the training module 1424 is used to determine the distortion loss between the prediction frame and the forward search frame to obtain a first distortion loss; determine the distortion loss between the target prediction frame and the forward search frame to obtain a second distortion loss; determine the bit rate loss of the target pixel kernel and the residual map respectively to obtain a first bit rate loss and a second bit rate loss; and adjust the parameters of the intelligent compression algorithm based on the first distortion loss, the second distortion loss, the first bit rate loss and the second bit rate loss.
  • a video decompression device including: a first decompression module 1602, a second decompression module 1604, an offset processing module 1606 and a compensation module 1608, wherein:
  • the first decompression module 1602 is used to decompress the compressed key frame and the compressed atlas in the video packet to obtain the key frame and the residual atlas;
  • a second decompression module 1604 is used to decompress the compressed pixel core in the video packet to obtain a target pixel core for representing inter-frame motion of each pixel point in the key frame;
  • An offset processing module 1606 is used to perform inter-frame offset processing on pixel points in the key frame and neighboring pixel points of the pixel points in sequence based on the target pixel core to obtain a predicted frame;
  • the compensation module 1608 is used to perform image compensation processing on the predicted frame based on the residual map to obtain a target predicted frame; wherein the key frame and the target predicted frame are video frames in the video.
  • the second decompression module 1604 is further used to decode the compressed pixel kernel in the video packet to obtain a quantized first latent variable; and convert the quantized first latent variable into a target pixel kernel for each pixel point in the key frame for characterizing inter-frame motion.
  • the second decompression module 1604 is further used to decode the compressed pixel kernel to obtain a word sequence; Find the codebook; based on the codebook, convert the word sequence into the target pixel kernel of each pixel in the key frame.
  • the first decompression module 1602 is further used to perform intra-frame decoding on the compressed key frame in the video packet to obtain the key frame; decode the compressed graph in the video packet to obtain a quantized second latent variable; and convert the quantized second latent variable into a residual graph.
  • the compressed key frame and the compressed atlas in the video package are decompressed to obtain the key frame and the residual atlas;
  • the compressed pixel core in the video package is decompressed to obtain the target pixel core for each pixel point in the key frame to characterize the inter-frame motion.
  • the target pixel core is obtained by smoothing the pixel core. Therefore, when decompressing, the pixels in the neighborhood of the corresponding pixel point in the key frame can be accurately captured. Therefore, no matter how complex the motion deformation is, the spatial offset can be accurately estimated, which is conducive to accurately decompressing the video to obtain the predicted frame.
  • the spatial offset can be accurately estimated by using the target pixel core, it is only necessary to perform conventional image compensation processing on the predicted frame based on the residual atlas to obtain the target predicted frame, which can avoid additional residual compensation information caused by inaccurate offset estimation, which is conducive to improving the efficiency and accuracy of video decompression.
  • Each module in the above-mentioned video compression device and video decompression device can be implemented in whole or in part by software, hardware and a combination thereof.
  • Each of the above-mentioned modules can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute the operations corresponding to each of the above modules.
  • a computer device which may be a server or a terminal. Taking the computer device as a terminal as an example, its internal structure diagram may be shown in FIG17.
  • the computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device.
  • the processor, the memory, and the input/output interface are connected via a system bus, and the communication interface, the display unit, and the input device are connected to the system bus via the input/output interface.
  • the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium.
  • the input/output interface of the computer device is used to exchange information between the processor and an external device.
  • the communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner may be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies.
  • a video compression method and a video decompression method are implemented.
  • the display unit of the computer device is used to form a visually visible image, and can be a display screen, a projection device or a virtual reality imaging device.
  • the display screen can be a liquid crystal display screen or an electronic ink display screen.
  • the input device of the computer device can be a touch layer covered on the display screen, or a button, trackball or touchpad set on the computer device casing, or an external keyboard, touchpad or mouse, etc.
  • FIG. 16 is merely a block diagram of a portion of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may include more or fewer components than shown in the figure, or combine certain components, or have a different arrangement of components.
  • a computer device including a memory and a processor, wherein a computer program is stored in the memory, and when the processor executes the computer program, the steps of the video compression method and the steps of the video decompression method are implemented.
  • a computer-readable storage medium on which a computer program is stored.
  • the steps of implementing the above-mentioned video compression method and the steps of implementing the above-mentioned video decompression method are implemented.
  • a computer program product including a computer program, which, when executed by a processor, implements the steps of the above-mentioned video compression method and the steps of the above-mentioned video decompression method.
  • any reference to the memory, database or other medium used in the embodiments provided in the present application can include at least one of non-volatile and volatile memory.
  • Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetoresistive random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc.
  • Volatile memory can include random access memory (RAM) or external cache memory, etc.
  • RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM).
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • the database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database.
  • Non-relational databases may include distributed databases based on blockchain, etc., but are not limited to this.
  • the processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, etc., but are not limited to this.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请涉及一种视频的压缩方法、解压缩方法、装置、设备和存储介质。所述方法可应用在基于人工智能的视频压缩和解压缩应用场景,所述压缩方法包括:基于视频的关键帧和前向搜索帧生成所述关键帧中各像素点的像素核(202);对所述像素核进行平滑约束处理,得到目标像素核(204);对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核(206);对所述前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱;所述预测帧是基于所述目标像素核和所述关键帧生成的视频帧(208);根据所述压缩图谱、所述压缩关键帧和所述压缩像素核获得压缩的视频包(210)。

Description

视频的压缩方法、解压缩方法、装置、计算机设备和存储介质
本申请要求于2022年11月18日提交中国专利局,申请号为2022114463520,发明名称为“视频的压缩方法、解压缩方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频处理技术邻域,特别是涉及一种视频的压缩方法、解压缩方法、装置、计算机设备和存储介质。
背景技术
随着移动互联网的发展,视频数据占据了网络流量的大部分内容,人们对于各种视频形式(如直播和点播)的需求也不断增大,而原始视频数据的超大文件尺寸给视频的存储和传输带来了巨大的困难。
目前通用的人工智能视频压缩算法主要针对前向搜索帧压缩,即运动估计使用一个光流网络来估计帧间的光流场,并将光流场作为预测偏移作用在前一帧得到预测帧。然而,光流网络只能进行像素级的偏移预测,对于运动形变较为复杂的情况,无法准确地估计出偏移,而且还会出现因偏移估计不准确而导致额外的残差补偿信息。
发明内容
根据本申请的各种实施例,提供了一种视频的压缩方法、解压缩方法、装置、计算机设备和存储介质。
第一方面,本申请提供了一种视频的压缩方法,由计算机设备执行,所述方法包括:
基于视频的关键帧和前向搜索帧生成所述关键帧中各像素点的像素核;
对所述像素核进行平滑约束处理,得到目标像素核;
对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核;
对所述前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱;所述预测帧是基于所述目标像素核和所述关键帧生成的视频帧;
根据所述压缩图谱、所述压缩关键帧和所述压缩像素核获得压缩的视频包。
第二方面,本申请还提供了一种视频的压缩装置。所述装置包括:
第一生成模块,用于基于视频的关键帧和前向搜索帧生成所述关键帧中各像素点的像素核;
约束处理模块,用于对所述像素核进行平滑约束处理,得到目标像素核;
第一压缩模块,用于对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核;
第二压缩模块,用于对所述前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱;所述预测帧是基于所述目标像素核和所述关键帧生成的视频帧;
第二生成模块,用于根据所述压缩图谱、所述压缩关键帧和所述压缩像素核获得压缩的视频包。
第三方面,本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所 述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现所述视频的压缩方法的步骤。
第四方面,本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现所述视频的压缩方法的步骤。
第五方面,本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现所述视频的压缩方法的步骤。
第六方面,本申请提供了一种视频的解压缩方法,由计算机设备执行,所述方法包括:
对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;
对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的目标像素核;
基于所述目标像素核,依次对所述关键帧中的像素点以及所述像素点的邻域像素点进行帧间偏移处理,得到预测帧;
基于所述残差图谱对所述预测帧进行图像补偿处理,得到目标预测帧;其中,所述关键帧与所述目标预测帧是视频中的视频帧。
第七方面,本申请还提供了一种视频的解压缩装置。所述装置包括:
第一解压缩模块,用于对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;
第二解压缩模块,用于对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的目标像素核;
偏移处理模块,用于基于所述目标像素核,依次对所述关键帧中的像素点以及所述像素点的邻域像素点进行帧间偏移处理,得到预测帧;
补偿模块,用于基于所述残差图谱对所述预测帧进行图像补偿处理,得到目标预测帧;其中,所述关键帧与所述目标预测帧是视频中的视频帧。
第八方面,本申请还提供了一种计算机设备。所述计算机设备包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现所述视频的压缩方法的步骤。
第九方面,本申请还提供了一种计算机可读存储介质。所述计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现所述视频的压缩方法的步骤。
第十方面,本申请还提供了一种计算机程序产品。所述计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现所述视频的压缩方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
图1为一个实施例中视频的压缩方法和视频的解压缩的应用环境图;
图2为一个实施例中视频的压缩方法的流程示意图;
图3为一个实施例中视频的压缩方法和解压缩方法的***结构示意图;
图4为一个实施例中基于局部加权模块生成目标像素核的示意图;
图5为一个实施例中各不同阶的像素核之间的对比示意图;
图6为一个实施例中对目标像素核的特征图进行压缩和解压缩的示意图;
图7为另一个实施例中对目标像素核的特征图进行压缩和解压缩的示意图;
图8为一个实施例中不同权重对目标像素核的分布情况的影响对比示意图;
图9为一个实施例中提取目标特征图的流程示意图;
图10为一个实施例中基于局部加权模块的结构模型示意图;
图11为一个实施例中视频的解压缩方法的流程示意图;
图12为一个实施例中解压缩后进行播放视频时,根据触发的交互控件跳转到相应的进度进行播放的示意图;
图13为一个实施例中视频压缩和解压缩的结果的示意图;
图14为一个实施例中视频的压缩装置的结构框图;
图15为另一个实施例中视频的压缩装置的结构框图;
图16为一个实施例中视频的解压缩装置的结构框图;
图17为一个实施例中计算机设备的内部结构图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例提供的视频的压缩方法和解压缩方法,可以应用于如图1所示的应用环境中。其中,终端102、服务器104和终端106之间可以通过网络进行通信。数据存储***可以存储服务器104需要处理的数据。数据存储***可以集成在服务器104上,也可以放在云上或其他网络服务器上。
需要指出的是,虽然图1示出了终端102执行视频的压缩方法,终端106执行视频的解压缩方法,但在实际应用中,视频的压缩方法和解压缩方法均可应用于图1中的终端102、服务器104或终端106,即终端102、服务器104和终端106中的任一种设备均可采用本申请的压缩方法和解压缩方法对视频进行压缩和解压缩。
当终端102执行视频的压缩方法时,终端102可以将获得的视频包进行本地存储,也可以上传至服务器104以供终端106点播,或者可以通过直播平台发送至终端106,此时终端106可以执行解压缩方法对接收到的视频包进行解压缩。类似地,当终端106执行视频的压缩方法时,可以参考上述方式。
当服务器104执行视频的压缩方法时,服务器104可以将获得的视频包进行本地存储,当终端102或终端106需要进行视频点播时,将视频包以数据流的方式发送至终端102或终端106,从而终端102或终端106执行解压缩方法对接收的视频包进行解压缩。
在对视频进行压缩时,为了避免传统方案在采用光流网络预测偏移时存在不准确的问题,本申请利用视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核,并对该像素核进行平滑约束处理,从而可以得到表征帧间运动的目标像素核,由于目标像素核针对的是关键帧中的每个像素点,因此可以利用该目标像素核对关键帧中的像素点及其邻域像素点进行加权处理,从而建模出帧间运动,有效地提高了帧间运动的预估准确性,从而在加压缩时无需额外的残差补偿。
其中,终端102和终端106可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、物联网设备和便携式可穿戴设备,物联网设备可为智能音箱、智能电视、智能空调和智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。
服务器104可以是独立的物理服务器,也可以是区块链***中的服务节点,该区块链***中的各服务节点之间形成点对点(P2P,Peer To Peer)网络,P2P协议是一个运行在传输控 制协议(TCP,Transmission Control Protocol)协议之上的应用层协议。此外,服务器104还可以是多个物理服务器构成的服务器集群,可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。
终端102、终端106与服务器104之间可以通过蓝牙、USB(Universal Serial Bus,通用串行总线)或者通信网络等通讯连接方式进行连接,本申请在此不做限制。
在一个实施例中,如图2所示,提供了一种视频的压缩方法,该方法可应用于图1中的终端102、服务器104或终端106,以该方法应用于图1中的终端102为例进行说明,包括以下步骤:
S202,基于视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核。
其中,视频可以是制作对象制作的各种类型的视频,如小视频、长视频以及直播视频。该小视频和长视频在利用本申请的压缩方法进行压缩后,可以保存于内容分发网络,以便在需要的时候进行点播;而直播视频可以在利用本申请的压缩方法进行压缩后,可以传输给观看直播的对象。当视频为直播视频时,在进行压缩的过程中,可以分段压缩,即每获取到一段视频时,可以采用本申请的压缩方法对获取的该段视频进行压缩;当视频为用于点播或本地存储的小视频或长视频时,可以采用本申请的压缩方法对整个视频进行压缩。
关键帧可以是视频中要进行帧内编码的视频帧,如一个视频中有多个镜头(即视频中每两个剪接点之间的片段),每个镜头的首帧可以作为关键帧;又如,当相邻两视频帧之间的差异度达到预设阈值时,也可以将后一个视频帧作为关键帧。该关键帧可以是RGB(Red Green Blue,红绿蓝)三通道的图像,也可以是其它类型的图像。
前向搜索帧可以指视频中除关键帧之外的视频帧,可利用关键帧和帧间差异信息(如本申请的残差图谱和目标像素核)进行重建。该前向搜索帧可以是三通道的图像,也可以是其它类型的图像。对于视频的关键帧和前向搜索帧,可结合以下例子了解彼此之间的关系:假设视频A的第一帧、第四帧和第七帧为关键帧,则视频A中的其它视频帧,如第二帧、第三帧、第五帧、第六帧、第八帧、第九帧和第十帧为前向搜索帧;终端可利用第一帧和对应的帧间差异信息重建出第二帧和第三帧,利用第四帧和对应的帧间差异信息重建出第五帧和第六帧,以及利用第七帧和对应的帧间差异信息重建出第八帧、第九帧和第十帧。
像素核(pixel kernel)可以是含有帧间空间运动信息的矩阵,在对该像素核进行平滑约束处理之后,可以使像素核学习到模糊信息和准确的帧间空间运动(即不同帧之间物体的空间运动,或者拍摄设备在不同拍摄时刻发生的空间运动),而且保证了相邻像素点的像素核结构相似。需要指出的是,当一个关键帧对应的前向搜索帧的数量为1个时,则该关键帧中每个像素点的像素核数量均为1或1×k,k为大于1的正整数,因此将1个或1×k个像素核作用在关键帧的相应像素点及其邻域上,即可预测出前向搜索帧中相应像素点以及该像素点的值;当一个关键帧对应的前向搜索帧的数量为n(n≥2)个时,则该关键帧中各像素点的像素核数量为n或n×k,因此将n个或n×k个像素核依次作用在关键帧的相应像素点及其邻域上,即可预测出n个前向搜索帧中相应像素点以及该像素点的值。其中,上述的空间运动也可称为帧间运动和空间偏移。
在一个实施例中,终端可以将视频中的关键帧和前向搜索帧输入至局部加权模块,该局部加权模块可以根据前向搜索帧和关键帧为关键帧中的各像素点生成对应的像素核,如图3所示。
其中,该局部加权模块可以主要由特征提取网络组成,该特征提取网络的网络结构可以是自编码器(auto-encoder)结构,可以包括编码器部分和解码器部分,而编码器部分中的编码器和解码器部分中的解码器之间跳跃连接,可参考图4所示的局部加权模块。需要指出的是,该局部加权模块是本申请提出的模块,可以自适应地同时构建出帧间的空间和尺度信息,具有良好的细粒度。
S204,对像素核进行平滑约束处理,得到目标像素核。
其中,为了约束像素核满足平滑连续的先验、且更易于信息的压缩和存储,本申请构建了平滑约束正则项,因此基于该平滑约束正则项可以对像素核进行平滑约束处理。该平滑约束正则项可以包括核内平滑约束正则项和核间平滑约束正则项,因此平滑约束处理可以包括核内平滑约束处理以及核间平滑约束处理,而核内平滑约束处理表示像素核内部的平滑约束处理,核间平滑约束处理表示像素核之间的平滑约束处理。通过平滑约束处理所得的目标像素核,可以学习到准确的帧间空间运动(即利用目标像素核可准确地预测出帧间空间运动),而且还学习到模糊信息。目标像素核可以是由各权重值所构成的矩阵,不为零的权重值在目标像素核中的分布可用于空间偏移预测。此外,相邻像素点的目标像素核之间结构相似,更符合运动估计的先验(即相邻区域的运动信息比较相似),有利于信息压缩。需要指出的是,模糊信息可用于建模空间偏移预测的不确定性,从而在残差补偿时更加容易,无需补偿因运动估计错误而造成的额外误差。
在一个实施例中,终端对各像素点的像素核进行核内平滑约束处理,得到各像素点的处理后像素核;对各像素点的处理后像素核进行核间平滑约束处理,得到用于表征帧间运动的目标像素核。具体地,终端将各像素点的像素核进行组合,得到像素核的特征图;然后对像素核的特征图进行核内平滑约束处理,再接着进行核间平滑约束处理。具体地,终端可以基于核内平滑约束正则项对各像素点的像素核进行核内平滑约束处理,得到各像素点的处理后像素核;基于核间平滑约束正则项各像素点的处理后像素核进行核间平滑约束处理,得到用于表征帧间运动的目标像素核。
其中,平滑约束正则项通过约束像素核梯度(如一阶梯度、二阶梯度或其它阶数的梯度中的至少一种)的L2范数实现。例如,设像素核的特征图为Kernels,其尺寸为(K×K×H×W),其中K×K表示像素核的尺寸,H×W表示关键帧和前向搜索帧的空间尺寸(即图像尺寸),因此核内平滑约束正则项和核间平滑约束正则项的计算式为:
smoothintra=mean(||Kernels[1:,∶,∶,∶]-Kernels[:K-1,∶,∶,∶]||2
+||Kernels[:,1:,∶,∶]-Kernels[:,∶K-1,∶,∶]||2)
smoothinter=mean(||Kernels[:,∶,1:,∶]-Kernels[:,∶,∶H-1,∶]||2
+||Kernels[:,∶,∶,1:]-Kernels[:,∶,∶,∶W-1]||2)
其中,上述计算式中的mean()表示均值计算,||||2表示L2范数。中括号中的第一个元素表示当前的像素核中的行元素,第二个元素表示当前的像素核中的列元素,第三个元素表示其它像素核中的行元素,第四个元素表示其它像素核中的各列像素点。此外,计算式中的1:表示从像素核内的当前权重值至最后一个权重值之间的所有权重值,计算式中的:表示从像素核内的第一个权重值至最后一个权重值之间的所有权重值,计算式中的:K-1表示从像素核 内的第一个权重值至K-1个权重值,计算式中的:H-1表示其它像素核内的第一个权重值至H-1个权重值,计算式中的:W-1表示其它像素核内的第一个权重值至W-1个权重值。
除了采用上述平滑约束处理的方式之外,也可以利用滤波器进行平滑约束处理;或者,计算均值,然后利用这个均值对像素核内的权重值进行处理,如均值与各权重值计算差值。
通过上述方式进行平滑约束处理,得到一阶的目标像素核;此外,在得到一阶的目标像素核之后,还可以利用上述方式对一阶的目标像素核再次进行平滑约束处理,从而得到二阶的目标像素核;在获得二阶的目标像素核,还可以对一阶和二阶的目标像素核进行融合,得到融合后的目标像素核,如图5所示,图5为各不同阶的像素核之间的对比图。值得注意的是,在进行压缩时,可以对一阶的目标像素核进行压缩,也可以对二阶的目标像素核进行压缩,或者对一阶和二阶的目标像素核进行融合之后,对融合所得的目标像素核进行压缩。不同阶数的目标像素核,学习到的模糊信息不同。
基于像素核的运动估计,在预测帧间偏移时可能会受像素核尺寸的影响,当帧间偏移超出像素核的尺寸时,利用像素核进行运动预测时准确性会降低,因此像素核的尺寸偏小会限制偏移预测,而像素核的尺寸偏大则会导致显存占用过大。基于上述情况,本申请提出了级联像素核的策略,具体地:终端对关键帧中各像素点的目标像素核进行复制,以使各像素点分别对应至少两个目标像素核,将同一个像素点的各目标像素核级联起来,从而在只少量增加显存的前提下,实现了更大尺寸的帧间偏移预测,有效地提高了空间运动预测的准确性。
S206,对关键帧和目标像素核进行压缩,得到压缩关键帧和压缩像素核。
其中,压缩关键帧可以指关键帧经过帧内编码(属于视频压缩技术)所得的视频帧,该压缩关键帧可作为I帧(Intra-frame,帧内编码帧)。
在一个实施例中,终端在对关键帧进行压缩时,可采用帧内编码的方式对关键帧进行编码,以去除关键帧中的图像空间冗余信息,得到压缩关键帧。
视频在进行压缩后,可以进行存储和传输,而目标像素核在压缩后也需要一并进行存储和传输,以便在视频解压缩时,利用该目标像素核重构出相应的视频帧。为了降低视频存储所占用的空间,或者降低视频传输时占用的带宽,需要对目标像素核进行压缩,而目标像素核的压缩可以采用以下两种方式:
压缩方式1,直接对目标像素核进行压缩。
在一个实施例中,终端对目标像素核转换为第一隐变量;对第一隐变量进行量化,得到量化的第一隐变量;将量化的第一隐变量进行编码,得到压缩像素核。
其中,第一隐变量可以是隐空间中的与目标像素核匹配的变量。
例如,将关键帧各像素点的目标像素核组成三维的特征图,将该特征图输入至编码器,通过该编码器将特征图映射到隐空间,得到第一隐变量;然后对该第一隐变量依序进行有损量化和算术编码,得到用于存储和传输的以字节流形式呈现的压缩像素核,如图6所示。其中,在进行算术编码时,可以先对输入数据的概率进行估计,然后再进行编码。
压缩方式2,利用编码簿对目标像素核进行压缩。
在一个实施例中,终端获取编码簿;基于编码簿将目标像素核映射为词序列,并将词序列进行编码,得到压缩像素核。
其中,编码簿(codebook)可以是从将要压缩的视频中选出几张具有代表性的视频帧,然后利用选取出来的视频帧构建出编码簿。该编码簿中的每个词汇(word)对应一种特定偏移和模糊信息的目标像素核。
例如,将关键帧各像素点的目标像素核组成三维的特征图,然后利用编码簿将该特征图映射为word(即使用word的序号进行表征),通过编码簿可以将像素核从数据量H×W×K×K压缩到数据量H×W×1,有效地降低了数据量;然后,还可以对word进行熵编码以进行进一步压缩,如图7所示。需要指出的是,编码簿已分别在编码器和解码器这两端进行预定义,无需传输。
需要指出的是,当同一个像素点级联了的至少两个目标像素核时,终端可以采用上述任一种压缩方式,对关键帧中各像素点级联的至少两个目标像素核进行压缩,得到压缩像素核。
S208,对前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱。
其中,预测帧是基于目标像素核和关键帧生成的视频帧。
在一个实施例中,终端可以基于目标像素核,依次对关键帧中的像素点以及像素点的邻域像素点进行帧间偏移处理,得到预测帧。例如,对于尺寸为H×W的关键帧,令关键帧中第i行第j列像素点的目标像素核为Kernels(i,j),关键帧中第i行第j列像素点以及该像素点的邻域像素点为pixel(i,j),利用Kernels(i,j)中的权重值对关键帧中第i行第j列像素点以及该像素点的邻域像素点pixel(i,j)进行加权处理,从而可以得到预测帧。
在一个实施例中,终端确定预测帧和前向搜索帧之间的残差图谱;将残差图谱转换为第二隐变量;对第二隐变量进行量化,得到量化的第二隐变量;将量化的第二隐变量进行编码,得到压缩图谱。
其中,第二隐变量可以是隐空间中的与残差图谱匹配的变量。
例如,将残差图谱输入至编码器,通过该编码器将残差图谱映射到隐空间,得到第二隐变量;然后对该第二隐变量依序进行有损量化和算术编码,得到用于存储和传输的以字节流形式呈现的压缩图谱,压缩过程可参考图6。
S210,根据压缩图谱、压缩关键帧和压缩像素核获得压缩的视频包。
其中,视频包可以指视频内容的数据包,该数据包可以进行存储以及在网络中进行传输。
在一个实施例中,终端可以直接将压缩图谱、压缩关键帧和压缩像素核进行打包,得到压缩的视频包。此外,在获得视频包之后,终端还可以对该视频包进行存储或传输,如存储于视频库中,或传输到目标端。该目标端可以是播放视频的终端,或者是CDN服务器。
在一个实施例中,在获得视频包之后,终端可以将视频包进行存储。当需要解压缩视频包时,对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;对视频包中的压缩像素核进行解压缩,得到目标像素核;基于目标像素核,依次对关键帧中的像素点以及像素点的邻域像素点进行帧间偏移处理,如基于目标像素核中的权重值,依次对关键帧中的像素点以及像素点的邻域像素点进行加权处理,可以得到预测帧;基于残差图谱对预测帧进行图像补偿处理,得到目标预测帧;其中,目标预测帧也可称为目标视频帧,是前向搜索帧的重构视频帧。
需要指出的是,若压缩过程中对复制的至少两个目标像素核进行压缩得到压缩像素核,则在解压缩时,可以得到关键帧中各像素点级联的至少两个目标像素核;然后,基于关键帧中各像素点级联的至少两个目标像素核,对依次对关键帧中的像素点以及像素点的邻域像素点进行帧间偏移处理,得到预测帧。
对于压缩图谱的解压缩,具体可以包括:终端对视频包中的压缩图谱进行解码,得到量化的第二隐变量;将量化的第二隐变量转换为残差图谱。例如,将压缩图谱输入至解码器,通过该解码器将压缩图谱进行解码,得到量化的第二隐变量;然后将第二隐变量从隐空间转 换为量化的残差图谱,解压缩过程可参考图6。
对于压缩像素核的解压缩,具体可以包括:终端对视频包中的压缩像素核进行解码,得到量化的第一隐变量;将量化的第一隐变量,转换为关键帧中各像素点的用于表征帧间运动的目标像素核。或者,终端对压缩像素核进行解码,得到词序列;查找编码簿;基于编码簿,将词序列转换为关键帧中各像素点的用于表征帧间运动的目标像素核。
在一个实施例中,该压缩方法可应用在视频压缩的应用阶段和训练阶段,当应用在训练阶段时,可以利用失真损失(distortion loss)和比特率损失(rate loss)对智能压缩模型进行优化。其中,失真损失可以衡量解压缩帧的恢复质量,比特率损失可以衡量压缩文件的字节数,二者相结合可以平衡比特率和重建质量之间的关系。智能压缩模型可以是基于智能压缩算法构建的网络模型。
残差图谱和目标像素核是通过智能(AI)压缩算法进行压缩所得的,压缩图谱和压缩像素核是通过智能压缩模型进行解压缩所得的。对智能压缩模型进行优化的步骤包括:终端确定预测帧与前向搜索帧间的失真损失,得到第一失真损失;确定目标预测帧与前向搜索帧间的失真损失,得到第二失真损失;分别确定目标像素核和残差图谱的比特率损失,得到第一比特率损失和第二比特率损失;基于第一失真损失、第二失真损失、第一比特率损失和第二比特率损失,对智能压缩模型的参数进行调整。
其中,第一失真损失可以是预测帧与前向搜索帧之间的均方根误差(MSE),第二失真损失可以是目标预测帧与前向搜索帧之间的均方根误差;第一比特率损失可以是目标像素核的特征图在压缩后的文件数据量与像素数量之间的比值,第二比特率损失可以是残差图谱在压缩后的文件数据量与像素数量之间的比值。
在一个实施例中,终端可以基于第一失真损失、第二失真损失、第一比特率损失和第二比特率损失确定综合损失,根据该综合损失对智能压缩模型的参数进行调整。该综合损失的函数表达式如下:
其中,表示预测帧与前向搜索帧之间的均方根误差,即第一失真损失;表示目标预测帧与前向搜索帧之间的均方根误差,即第二失真损失;表示目标像素核的特征图在压缩后的文件数据量与像素数量之间的比值,即第一比特率损失;表示是残差图谱在压缩后的文件数据量与像素数量之间的比值,即第二比特率损失;λ1和λ2表示调节因子,用于对视频的压缩质量进行调控,其中λ1和λ2的取值可以是大于0的数值。
智能压缩模型可以进行压缩处理,也可以用来平滑约束处理,因此在对智能压缩模型进行参数调整的过程中,可以调整目标像素核的权重,以使目标像素核的权重达到最优状态。其中,不同权重对目标像素核的分布情况的影响对比,可参考图8。
上述实施例中,基于视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核;对像素核进行平滑约束处理,可以得到准确表征帧间运动的目标像素核,由于目标像素核是像素核经过平滑约束处理所得的,因此各相邻像素点的目标像素核之间结构相似,有利于视频压缩,而且在解压缩时可以准确地抓取关键帧内相应像素点邻域内的像素,因此无论是运动形变较为复杂的情况也能准确地估计出空间偏移,有利于对视频准确地进行解压缩。此外,对关键帧和目标像素核进行压缩,得到压缩关键帧和压缩像素核;对前向搜索帧与预测帧之间的残差图谱进行压缩,根据残差图谱对应的压缩图谱、压缩关键帧和压缩像素核即可获得压缩的视频包,无需对视频中的所有帧进行压缩,只需对关键帧、目标像素核以及残差图谱进 行压缩,大大降低了视频的数据量,而且还可以避免因偏移估计不准确而导致额外的残差补偿信息,有利于提高视频解压缩的效率以及准确性。
在一个实施例中,如图9所示,S202具体可以包括:
S902,从视频中提取出关键帧和前向搜索帧。
其中,在一个视频中,关键帧的数量可以有多个,且每个关键帧可以对应一个或多个前向搜索帧,该关键帧与对应的前向搜索帧之间存在一定的相似度,如相似度大于或等于70%。例如,对于一个视频,当视频中恰好出现人物a的视频帧为关键帧,则后续人物a的手部从下至上开始移动的视频帧为前向搜索帧。
S904,对关键帧和前向搜索帧进行图像特征提取,得到目标特征图。
在一个实施例中,终端可以先对关键帧和前向搜索帧进行拼接,得到拼接视频帧;通过特征提取网络的编码器对拼接视频帧编码,得到编码特征;通过特征提取网络的解码器对编码特征解码,得到目标特征图。
其中,特征提取网络可以包括编码器部分和解码器部分,而编码器部分中的编码器和解码器部分中的解码器之间跳跃连接。如图10所示,图10为特征提取网络的详细结构图,图的左侧为编码器部分,右侧为解码器部分,图中带小黑点的矩形框对应多通道特征,通道数显示在矩形框的上方;图中白色的矩形框为复制和裁剪后的特征,不同指向的箭头表示进行不同的操作。
S906,将目标特征图中的各特征向量,分别转换为关键帧中各像素点的像素核。
在一个实施例中,终端可以将目标特征图与关键帧进行空间对齐;该目标特征图为三维特征图,且目标特征图的宽高值与关键帧的宽高值相同,如图3所示;将目标特征图中的与关键帧中各像素点对应的特征向量,分别转换为像素核。由于目标特征图是对关键帧和前向搜索帧进行图像特征处理所得的,且目标特征图的宽高值与关键帧的宽高值相同,因此目标特征图中存在与关键帧中各像素点一一对应的特征向量,该像素点在关键帧中的位置与特征向量在目标特征图中的平面位置是相同的。
上述实施例中,基于视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核,该像素核可用于建模帧间运动,相对传统方案中采用光流网络进行像素级的偏移预测而言,能较为准确地估计出空间偏移,并且在进行平滑约束处理后,可以进一步提高预测空间偏移的准确性。
在一个实施例中,如图11所示,提供了一种视频的解压缩方法,该方法可应用于图1中的终端102、服务器104或终端106,以该方法应用于图1中的终端106为例进行说明,包括以下步骤:
S1102,对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱。
其中,压缩关键帧是对视频中的关键帧进行帧内编码所得的。压缩图谱是对前向搜索帧与预测帧之间的残差图谱进行压缩所得的,而残差图谱可以是根据前向搜索帧与预测帧之间的差值所得的图谱。
关键帧可以是视频中要进行帧内编码的视频帧,如一个视频中有多个镜头(即视频中每两个剪接点之间的片段),每个镜头的首帧可以作为关键帧;又如,当相邻两视频帧之间的差异度达到预设阈值时,也可以将后一个视频帧作为关键帧。该关键帧可以是RGB(Red Green Blue,红绿蓝)三通道的图像,也可以是其它类型的图像。
前向搜索帧可以指视频中除关键帧之外的视频帧,可利用关键帧和帧间差异信息(如本 申请的残差图谱和目标像素核)进行重建。该前向搜索帧可以是三通道的图像,也可以是其它类型的图像。
在一个实施例中,对视频包中的压缩关键帧进行帧内解码,得到关键帧;然后对视频包中的压缩图谱进行解码,得到量化的第二隐变量;将量化的第二隐变量转换为残差图谱。
其中,第二隐变量可以是隐空间中的与残差图谱匹配的变量。
例如,将压缩图谱输入至解码器,通过该解码器将压缩图谱进行解码,得到量化的第二隐变量;然后将第二隐变量从隐空间转换为量化的残差图谱。其中,解压缩过程可参考图6。
S1104,对视频包中的压缩像素核进行解压缩,得到目标像素核。
其中,目标像素核可以用于表征帧间运动,是对像素核进行平滑约束处理所得的。平滑约束处理可以包括核内平滑约束处理以及核间平滑约束处理,而核内平滑约束处理表示像素核内部的平滑约束处理,核间平滑约束处理表示像素核之间的平滑约束处理。通过平滑约束处理所得的目标像素核,可以学习到准确的帧间空间运动,而且还学习到模糊信息。此外,相邻像素点的目标像素核之间结构相似,符合运动估计的先验,更有利于信息压缩。该模糊信息可用于建模空间偏移预测的不确定性,从而在残差补偿时更加容易,无需补偿因运动估计错误而造成的额外误差。
像素核可以是含有帧间空间运动信息的矩阵,是基于视频的关键帧和前向搜索帧生成的与关键帧中各像素点的矩阵。关键帧中各像素点均对应一个或多个像素核。例如,当一个关键帧对应的前向搜索帧的数量为1个时,则该关键帧中每个像素点的像素核数量均为1或1×k,k为大于1的正整数,因此将1个或1×k个像素核作用在关键帧的相应像素点及其邻域上,即可预测出前向搜索帧中相应像素点以及该像素点的值;当一个关键帧对应的前向搜索帧的数量为n(n≥2)个时,则该关键帧中各像素点的像素核数量为n或n×k,因此将n个或n×k个像素核依次作用在关键帧的相应像素点及其邻域上,即可预测出n个前向搜索帧中相应像素点以及该像素点的值。
需要指出的是,若压缩过程中对复制的至少两个目标像素核进行压缩得到压缩像素核,则在解压缩时,可以得到关键帧中各像素点级联的至少两个目标像素核。
源端(即图1中的终端102)在对目标像素核进行压缩时,可以采用直接对目标像素核进行压缩的方式,以及利用编码簿对目标像素核进行压缩的方式,因此在进行解压缩时,根据实际情况选择相匹配的压缩方式进行解压缩,具体如下:
解压缩方式1,直接对压缩像素核进行解压缩。
在一个实施例中,终端对视频包中的压缩像素核进行解码,得到量化的第一隐变量;将量化的第一隐变量,转换为关键帧中各像素点的用于表征帧间运动的目标像素核。
其中,第一隐变量可以是隐空间中的与目标像素核匹配的变量。
例如,如图6所示,在获得视频包中的压缩像素核时,采用算术解码的方式对该压缩像素核进行解码,得到量化的第一隐变量;然后,通过解码器将量化的第一隐变量转换为量化的特征图,从该特征图中分解出各目标像素核。
解压缩方式2,利用编码簿对目标像素核进行解压缩。
在一个实施例中,终端对压缩像素核进行解码,得到词序列;查找编码簿;基于编码簿,将词序列转换为关键帧中各像素点的用于表征帧间运动的目标像素核。
其中,编码簿在编码器和解码器均进行了定义,无需随视频包进行传输。对于编码簿的构建方式包括:从将要压缩的视频中选出几张具有代表性的视频帧,然后利用选取出来的视 频帧构建出编码簿。该编码簿中的每个词汇对应一种特定偏移和模糊信息的目标像素核。
例如,如图7所示,在进行解压缩时,通过查找编码簿,从词序列中恢复出特征图,然后从特征图中分解出关键帧中各像素点的用于表征帧间运动的目标像素核。
S1106,基于目标像素核,依次对关键帧中的像素点以及像素点的邻域像素点进行帧间偏移处理,得到预测帧。
其中,预测帧是与真实的前向搜索帧匹配的视频帧。邻域像素点可以是与该像素点之间的距离小于预设距离的像素点所组成的集合。
在一个实施例中,终端可以基于关键帧中各像素点级联的至少两个目标像素核,对依次对关键帧中的像素点以及像素点的邻域像素点进行帧间偏移处理,得到预测帧。
在一个实施例中,终端基于目标像素核中的权重值,依次对关键帧中的像素点以及像素点的邻域像素点进行加权处理,从而可以得到预测帧。
S1108,基于残差图谱对预测帧进行图像补偿处理,得到目标预测帧;其中,关键帧与目标预测帧是视频中的视频帧。
其中,目标预测帧也是与真实的前向搜索帧匹配的重构视频帧,在视频内容上,目标预测帧与前向搜索帧是相同的,区别在于:前向搜索帧是视频中的原始视频帧,而目标预测帧是基于关键帧、目标像素核以及残差图谱重构的视频帧。
上述的视频可以是制作对象制作的各种类型的视频,如小视频、长视频以及直播视频。
在一个实施例中,在得到关键帧和目标预测帧之后,根据该视频帧和目标预测帧可合成视频,然后在播放页面进行播放。若该视频为多个互动视频合成的视频时,在播放的过程中,当播放到目标进度时,可以加载一个透明或半透明的交互页面(如H5页面),然后在该交互页面显示至少两个交互控件,响应于对交互控件的触发操作,将视频跳转至触发操作选中的交互控件所对应的进度进行播放。需要指出的是,当该视频为多个互动视频合成的视频时,在进行视频压缩时,可以分别对每个互动视频进行压缩。
例如,当播放的视频为多个互动视频合成的视频时,在播放到特定的播放进度时,可以显示与各互动视频进行交互的交互控件,如不同交互控件对应不同的互动视频,当点击其中一个交互控件时,可以跳转到该交互控件对应的互动视频进行播放。如图12所示,当播放到特定的播放进度时,在WebView承载的交互页面中显示“换发型”和“变声”的交互控件,当点击“换发型”的交互控件,确定换发型的互动视频,然后跳转到换发型的互动视频进行播放。
上述实施例中,对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;对视频包中的压缩像素核进行解压缩,得到关键帧中各像素点的用于表征帧间运动的目标像素核,目标像素核是像素核经过平滑约束处理所得的,因此在解压缩时可以准确地抓取关键帧内相应像素点邻域内的像素,能准确地估计出空间偏移,有利于对视频准确地进行解压缩得到预测帧。此外,由于利用目标像素核能准确地估计出空间偏移,只需要基于残差图谱对预测帧进行常规的图像补偿处理,即可得到目标预测帧,可以避免因偏移估计不准确而导致额外的残差补偿信息,有利于提高视频解压缩的效率以及准确性。
为了更加清楚了解本申请的方案,这里结合实际的应用场景进行描述,具体如下:
本申请提出了一个局部加权模块,并基于该局部加权模块构建了对应的AI视频压缩流程,如图3所示。
(一)局部加权模块
(1)模块介绍
该局部加权模块用于进行帧间运动估计,可以根据输入的两个相邻视频帧自适应地生成帧间的运动和不确定性信息。该局部加权模块的网络结构采用类U-Net的自编码器结构,并在编码器部分和解码器部分之间引入了跳跃连接。
如图3所示,将关键帧和前向预测帧组成的拼接图像对作为局部加权模块的输入,该局部加权模块可以生成一个与关键帧的尺寸一致的特征图,将该特征图与关键帧在空间上对齐,然后对该特征图进行变形,可得到关键帧中各像素点的像素核。
(2)像素核平滑约束
本申请构建了两种平滑约束方式:核内平滑约束和核间平滑约束。核内平滑约束使得生成的分布连续平滑,且集中在某一区域,这可以学***滑约束保证了相邻的像素核结构较为相似,更符合运动估计的先验,且更有利于信息压缩。
像素核平滑约束可以通过约束像素核梯度(一阶梯度和二阶梯度)的L2范数实现。具体地,设生成的像素核为Kernels,其尺寸为(K×K×H×W),其中K×K表示像素核的尺寸,H×W表示图像的空间尺寸,则核内平滑约束和核间平滑约束可以表征为:
smoothintra=mean(||Kernels[1:,∶,∶,∶]-Kernels[:K-1,∶,∶,∶]||2
+||Kernels[:,1:,∶,∶]-Kernels[:,∶K-1,∶,∶]||2)
smoothinter=mean(||Kernels[:,∶,1:,∶]-Kernels[:,∶,∶H-1,∶]||2
+||Kernels[:,∶,∶,1:]-Kernels[:,∶,∶,∶W-1]||2)
其中,上述表达式中的mean()表示均值计算,||||2表示L2范数。
(3)级联策略
基于像素核的运动估计,在预测帧间偏移时可能会受像素核尺寸的影响,当帧间偏移超出像素核的尺寸时,利用像素核进行运动预测时准确性会降低,因此像素核的尺寸偏小会限制偏移预测,而像素核的尺寸偏大则会导致显存消耗过大。将同一个像素点的各像素核级联起来,从而在不增大过多显存的前提下,实现了更大尺寸的帧间偏移预测,有效地提高了空间运动预测的准确性。
(二)针对像素核的压缩策略
局部加权模块输出的像素核需要经过智能压缩算法进行压缩,以降低比特率并便于存储和传输。对于像素核的压缩,本申请构建了两种压缩策略,具体如下所述:
策略1,直接像素核的特征图进行压缩。
可参考图6,首先通过一个自编码器的编码器,将像素核的特征图映射到隐空间,之后再通过有损量化和算术编码得到用于存储和传输的字节流。在解码端,首先对接收到的字节流进行算术解码,之后再通过自编码器的解码器部分将得到的隐变量解码为量化后的特征图。
策略2,通过编码簿进行压缩。
本申请提出了基于编码簿的压缩方式,为像素核构建了一个编码簿,编码簿中每个word对应一种特定偏移和模糊信息的像素核。参考图7,通过编码簿将像素核映射为word,解码 时通过查找编码簿从word中恢复出像素核。通过word可以将像素核的特征图从H×W×K×K压缩到数据量H×W×1,之后还可以通过熵编码等技术进行进一步压缩。需要注意的是,编码簿已在编码端和解码端进行了预定义,无需传输。
(三)残差补偿与损失函数
将局部加权模块生成的像素核作用在关键帧上,便可以得到预测帧。由于预测帧通常与真实的前向搜索帧存在一定差距,需补偿这部分误差。具体地,将前向搜索帧减去预测帧以得到残差图谱,由于残差图谱需要存储和传输,所以需要对其进行有损量化和熵编码,可参考图6;之后将量化解码过的残差图谱与预测帧相加,便可得到重建帧作为最终的结果。
本申请提出的智能压缩算法可以直接进行端到端的优化,其损失函数包括失真损失(distortion loss)和比特率损失(rate loss)两个部分。失真损失可以表示解压缩帧的恢复质量,比特率损失则衡量了压缩文件的字节数。二者相结合便可以平衡比特率和重建质量之间的关系。损失函数计算公式如下:
其中,表示预测帧与前向搜索帧之间的均方根误差,表示目标预测帧与前向搜索帧之间的均方根误差,分别表示像素核和残差图谱压缩文件的比特率;λ1和λ2表示调节因子,用于对视频的压缩质量进行调控。利用上述损失函数计算出来的损失对智能压缩算法进行优化,从而可以得到最终应用的智能压缩算法。
本申请不依赖硬件环境,可以灵活地嵌入到任意的视频压缩框架中,具有良好的可扩展性。图13展示了本申请提出的视频压缩和解压缩的结果,其中不同行表示不同的样本。第一个样本帧间的运动方向为左上,因此前景位置(可视化1)对应像素核的权重分布在右下方;对于背景位置(可视化2),像素核不进行偏移预测,可直接在当前位置(即核中心)抓取像素。对于第二个样本,其整体向右上移动,且存在运动估计无法预测的情况(如背后物体的出现)。因此对于运动估计较为确定的区域(如可视化1),像素核可以控制整体向左下抓取像素,而对于不确定的区域(如可视化2),则在进行偏移预测的同时进行一定的模糊化处理,模糊化处理后的像素核可以有效提高运动估计的准确性,当作用在关键帧上得到预测帧,可以避免因估计不准确而导致额外残差补偿的问题。其中,重建帧可以指目标预测帧。
应该理解的是,虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
基于同样的发明构思,本申请实施例还提供了一种用于实现上述所涉及的视频的压缩方法的视频的压缩装置,以及视频的解压缩方法的视频的解压缩装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似,故下面所提供的一个或多个视频的压缩 装置实施例中的具体限定可以参见上文中对于视频的压缩方法的限定,以及视频的解压缩装置实施例中的具体限定可以参见上文中对于视频的解压缩方法的限定在此不再赘述。
在一个实施例中,如图14所示,提供了一种视频的压缩装置,包括:第一生成模块1402、约束处理模块1404、第一压缩模块1406、第二压缩模块1408和第二生成模块1410,其中:
第一生成模块1402,用于基于视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核;
约束处理模块1404,用于对像素核进行平滑约束处理,得到用于表征帧间运动的目标像素核;
第一压缩模块1406,用于对关键帧和目标像素核进行压缩,得到压缩关键帧和压缩像素核;
第二压缩模块1408,用于对前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱;预测帧是基于目标像素核和关键帧生成的视频帧;
第二生成模块1410,用于根据压缩图谱、压缩关键帧和压缩像素核获得压缩的视频包。
上述实施例中,基于视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核;对像素核进行平滑约束处理,可以得到准确表征帧间运动的目标像素核,由于目标像素核是像素核经过平滑约束处理所得的,因此各相邻像素点的目标像素核之间结构相似,有利于视频压缩,而且在解压缩时可以准确地抓取关键帧内相应像素点邻域内的像素,因此无论是运动形变较为复杂的情况也能准确地估计出空间偏移,有利于对视频准确地进行解压缩。此外,对关键帧和目标像素核进行压缩,得到压缩关键帧和压缩像素核;对前向搜索帧与预测帧之间的残差图谱进行压缩,根据残差图谱对应的压缩图谱、压缩关键帧和压缩像素核即可获得压缩的视频包,无需对视频中的所有帧进行压缩,只需对关键帧、目标像素核以及残差图谱进行压缩,大大降低了视频的数据量,而且还可以避免因偏移估计不准确而导致额外的残差补偿信息,有利于提高视频解压缩的效率以及准确性。
在一个实施例中,第一生成模块1402,还用于从视频中提取出关键帧和前向搜索帧;对关键帧和前向搜索帧进行图像特征提取,得到目标特征图;将目标特征图中的各特征向量,分别转换为关键帧中各像素点的像素核。
在一个实施例中,第一生成模块1402,还用于对关键帧和前向搜索帧进行拼接,得到拼接视频帧;通过特征提取网络的编码器对拼接视频帧编码,得到编码特征;通过特征提取网络的解码器对编码特征解码,得到目标特征图。
在一个实施例中,第一生成模块1402,还用于将目标特征图与关键帧进行空间对齐;目标特征图为三维特征图,且目标特征图的宽高值与关键帧的宽高值相同;将目标特征图中的与关键帧中各像素点的特征向量,分别转换为像素核。
上述实施例中,基于视频的关键帧和前向搜索帧生成关键帧中各像素点的像素核,该像素核可用于建模帧间运动,相对传统方案中采用光流网络进行像素级的偏移预测而言,能较为准确地估计出空间偏移,并且在进行平滑约束处理后,可以进一步提高预测空间偏移的准确性。
在一个实施例中,约束处理模块1404,还用于对各像素点的像素核进行核内平滑约束处理,得到各像素点的处理后像素核;对各像素点的处理后像素核进行核间平滑约束处理,得到用于表征帧间运动的目标像素核。
在一个实施例中,如图15所示,该装置还包括:
复制模块1412,用于对关键帧中各像素点的目标像素核进行复制,以使各像素点分别对应至少两个目标像素核。
在一个实施例中,第一压缩模块1406,还用于对关键帧进行帧内压缩,得到压缩关键帧;对目标像素核转换为第一隐变量;对第一隐变量进行量化,得到量化的第一隐变量;将量化的第一隐变量进行编码,得到压缩像素核。
在一个实施例中,第一压缩模块1406,还用于对关键帧进行帧内压缩,得到压缩关键帧;获取编码簿;基于编码簿将目标像素核映射为词序列,并将词序列进行编码,得到压缩像素核。
在一个实施例中,第二压缩模块1408,还用于确定预测帧和前向搜索帧之间的残差图谱;将残差图谱转换为第二隐变量;对第二隐变量进行量化,得到量化的第二隐变量;将量化的第二隐变量进行编码,得到压缩图谱。
在一个实施例中,如图15所示,该装置还包括:
存储模块1414,用于将视频包进行存储;
第一解压缩模块1416,用于当需要解压缩视频包时,对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;
第二解压缩模块1418,用于对视频包中的压缩像素核进行解压缩,得到目标像素核;
偏移处理模块1420,用于基于目标像素核,依次对关键帧中的像素点以及像素点的邻域像素点进行帧间偏移处理,得到预测帧;
补偿处理模块1422,用于基于残差图谱对预测帧进行图像补偿处理,得到目标预测帧;其中,目标预测帧是与前向搜索帧对应的重构视频帧。
在一个实施例中,残差图谱和目标像素核是通过智能压缩算法进行压缩所得的,压缩图谱和压缩像素核是通过智能压缩算法进行解压缩所得的;如图15所示,该装置还包括:
训练模块1424,用于确定预测帧与前向搜索帧间的失真损失,得到第一失真损失;确定目标预测帧与前向搜索帧间的失真损失,得到第二失真损失;分别确定目标像素核和残差图谱的比特率损失,得到第一比特率损失和第二比特率损失;基于第一失真损失、第二失真损失、第一比特率损失和第二比特率损失,对智能压缩算法的参数进行调整。
在一个实施例中,如图16所示,提供了一种视频的解压缩装置,包括:第一解压缩模块1602、第二解压缩模块1604、偏移处理模块1606和补偿模块1608,其中:
第一解压缩模块1602,用于对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;
第二解压缩模块1604,用于对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的用于表征帧间运动的目标像素核;
偏移处理模块1606,用于基于所述目标像素核,依次对所述关键帧中的像素点以及所述像素点的邻域像素点进行帧间偏移处理,得到预测帧;
补偿模块1608,用于基于所述残差图谱对所述预测帧进行图像补偿处理,得到目标预测帧;其中,所述关键帧与所述目标预测帧是视频中的视频帧。
在一个实施例中,第二解压缩模块1604,还用于对视频包中的压缩像素核进行解码,得到量化的第一隐变量;将量化的第一隐变量,转换为关键帧中各像素点的用于表征帧间运动的目标像素核。
在一个实施例中,第二解压缩模块1604,还用于对压缩像素核进行解码,得到词序列; 查找编码簿;基于编码簿,将词序列转换为关键帧中各像素点的目标像素核。
在一个实施例中,第一解压缩模块1602,还用于对视频包中的压缩关键帧进行帧内解码,得到关键帧;对视频包中的压缩图谱进行解码,得到量化的第二隐变量;将量化的第二隐变量转换为残差图谱。
上述实施例中,对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;对视频包中的压缩像素核进行解压缩,得到关键帧中各像素点的用于表征帧间运动的目标像素核,目标像素核是像素核经过平滑约束处理所得的,因此在解压缩时可以准确地抓取关键帧内相应像素点邻域内的像素,因此无论是运动形变较为复杂的情况也能准确地估计出空间偏移,有利于对视频准确地进行解压缩得到预测帧。此外,由于利用目标像素核能准确地估计出空间偏移,只需要基于残差图谱对预测帧进行常规的图像补偿处理,即可得到目标预测帧,可以避免因偏移估计不准确而导致额外的残差补偿信息,有利于提高视频解压缩的效率以及准确性。
上述视频的压缩装置和视频的解压缩装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器或终端,以计算机设备是终端为例,其内部结构图可以如图17所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中,处理器、存储器和输入/输出接口通过***总线连接,通信接口、显示单元和输入装置通过输入/输出接口连接到***总线。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作***和计算机程序。该内存储器为非易失性存储介质中的操作***和计算机程序的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信,无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种视频的压缩方法和视频的解压缩方法。该计算机设备的显示单元用于形成视觉可见的画面,可以是显示屏、投影装置或虚拟现实成像装置,显示屏可以是液晶显示屏或电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。
本邻域技术人员可以理解,图16中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种计算机设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述视频的压缩方法的步骤,以及实现上述视频的解压缩方法的步骤。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述视频的压缩方法的步骤,以及实现上述视频的解压缩方法的步骤。
在一个实施例中,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述视频的压缩方法的步骤,以及实现上述视频的解压缩方法的步骤。
本邻域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory,MRAM)、铁电存储器(Ferroelectric Random Access Memory,FRAM)、相变存储器(Phase Change Memory,PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器等。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等,不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等,不限于此。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本邻域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请的保护范围应以所附权利要求为准。

Claims (19)

  1. 一种视频的压缩方法,由计算机设备执行,其中,所述方法包括:
    基于视频的关键帧和前向搜索帧生成所述关键帧中各像素点的像素核;
    对所述像素核进行平滑约束处理,得到目标像素核;
    对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核;
    对所述前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱;所述预测帧是基于所述目标像素核和所述关键帧生成的视频帧;
    根据所述压缩图谱、所述压缩关键帧和所述压缩像素核获得压缩的视频包。
  2. 根据权利要求1所述的方法,其中,所述基于视频的关键帧和前向搜索帧生成所述关键帧中各像素点的像素核包括:
    从视频中提取出关键帧和前向搜索帧;
    对所述关键帧和所述前向搜索帧进行图像特征提取,得到目标特征图;
    将所述目标特征图中的各特征向量,分别转换为所述关键帧中各像素点的像素核。
  3. 根据权利要求2所述的方法,其中,所述对所述关键帧和所述前向搜索帧进行图像特征处理,得到目标特征图包括:
    对所述关键帧和所述前向搜索帧进行拼接,得到拼接视频帧;
    通过特征提取网络的编码器对所述拼接视频帧编码,得到编码特征;
    通过所述特征提取网络的解码器对所述编码特征解码,得到目标特征图。
  4. 根据权利要求2所述的方法,其中,所述将所述目标特征图中的各特征向量,分别转换为所述关键帧中各像素点的像素核包括:
    将所述目标特征图与所述关键帧进行空间对齐;所述目标特征图为三维特征图,且所述目标特征图的宽高值与所述关键帧的宽高值相同;
    将所述目标特征图中的与所述关键帧中各像素点对应的特征向量,分别转换为像素核。
  5. 根据权利要求1所述的方法,其中,所述对所述像素核进行平滑约束处理,得到目标像素核包括:
    对各所述像素点的像素核进行核内平滑约束处理,得到各所述像素点的处理后像素核;
    对各所述像素点的处理后像素核进行核间平滑约束处理,得到目标像素核。
  6. 根据权利要求1所述的方法,其中,所述对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核之前,所述方法还包括:
    对所述关键帧中各像素点的目标像素核进行复制,以使各所述像素点分别对应至少两个目标像素核。
  7. 根据权利要求1所述的方法,其中,所述对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核包括:
    对所述关键帧进行帧内压缩,得到压缩关键帧;
    对所述目标像素核转换为第一隐变量;对所述第一隐变量进行量化,得到量化的第一隐变量;将量化的所述第一隐变量进行编码,得到压缩像素核。
  8. 根据权利要求1所述的方法,其中,所述对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核包括:
    对所述关键帧进行帧内压缩,得到压缩关键帧;
    获取编码簿;基于所述编码簿将所述目标像素核映射为词序列,并将所述词序列进行编 码,得到压缩像素核。
  9. 根据权利要求1所述的方法,其中,所述对所述前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱包括:
    确定预测帧和所述前向搜索帧之间的残差图谱;
    将所述残差图谱转换为第二隐变量;
    对所述第二隐变量进行量化,得到量化的第二隐变量;
    将量化的所述第二隐变量进行编码,得到压缩图谱。
  10. 根据权利要求1至9中的任一项所述的方法,其中,所述方法还包括:
    当需要解压缩所述视频包时,对所述视频包中的压缩关键帧和压缩图谱进行解压缩,得到所述关键帧和所述残差图谱;
    对所述视频包中的压缩像素核进行解压缩,得到所述目标像素核;
    基于所述目标像素核,依次对所述关键帧中的像素点以及所述像素点的邻域像素点进行帧间偏移处理,得到预测帧;
    基于所述残差图谱对所述预测帧进行图像补偿处理,得到目标预测帧;其中,所述目标预测帧是所述前向搜索帧的重构视频帧。
  11. 根据权利要求10所述的方法,其中,所述残差图谱和所述目标像素核是通过智能压缩模型进行压缩所得的,所述压缩图谱和所述压缩像素核是通过智能压缩模型进行解压缩所得的;所述方法还包括:
    确定所述预测帧与所述前向搜索帧间的失真损失,得到第一失真损失;
    确定所述目标预测帧与所述前向搜索帧间的失真损失,得到第二失真损失;
    分别确定所述目标像素核和所述残差图谱的比特率损失,得到第一比特率损失和第二比特率损失;
    基于所述第一失真损失、所述第二失真损失、所述第一比特率损失和所述第二比特率损失,对所述智能压缩模型的参数进行调整。
  12. 一种视频的解压缩方法,由计算机设备执行,其中,所述方法包括:
    对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;
    对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的目标像素核;
    基于所述目标像素核,依次对所述关键帧中的像素点以及所述像素点的邻域像素点进行帧间偏移处理,得到预测帧;
    基于所述残差图谱对所述预测帧进行图像补偿处理,得到目标预测帧;其中,所述关键帧与所述目标预测帧是视频中的视频帧。
  13. 根据权利要求12所述的方法,其中,所述对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的目标像素核包括:
    对所述视频包中的压缩像素核进行解码,得到量化的第一隐变量;
    将量化的所述第一隐变量,转换为所述关键帧中各像素点的目标像素核。
  14. 根据权利要求12所述的方法,其中,所述对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的目标像素核包括:
    对所述压缩像素核进行解码,得到词序列;
    查找编码簿;
    基于所述编码簿,将所述词序列转换为所述关键帧中各像素点的目标像素核。
  15. 根据权利要求12至14中的任一项所述的方法,其中,所述对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱包括:
    对视频包中的压缩关键帧进行帧内解码,得到关键帧;
    对所述视频包中的压缩图谱进行解码,得到量化的第二隐变量;将量化的所述第二隐变量转换为残差图谱。
  16. 一种视频的压缩装置,其中,所述装置包括:
    第一生成模块,用于基于视频的关键帧和前向搜索帧生成所述关键帧中各像素点的像素核;
    约束处理模块,用于对所述像素核进行平滑约束处理,得到目标像素核;
    第一压缩模块,用于对所述关键帧和所述目标像素核进行压缩,得到压缩关键帧和压缩像素核;
    第二压缩模块,用于对所述前向搜索帧与预测帧之间的残差图谱进行压缩,得到压缩图谱;所述预测帧是基于所述目标像素核和所述关键帧生成的视频帧;
    第二生成模块,用于根据所述压缩图谱、所述压缩关键帧和所述压缩像素核获得压缩的视频包。
  17. 一种视频的解压缩装置,其中,所述装置包括:
    第一解压缩模块,用于对视频包中的压缩关键帧和压缩图谱进行解压缩,得到关键帧和残差图谱;
    第二解压缩模块,用于对所述视频包中的压缩像素核进行解压缩,得到所述关键帧中各像素点的目标像素核;
    偏移处理模块,用于基于所述目标像素核,依次对所述关键帧中的像素点以及所述像素点的邻域像素点进行帧间偏移处理,得到预测帧;
    补偿模块,用于基于所述残差图谱对所述预测帧进行图像补偿处理,得到目标预测帧;其中,所述关键帧与所述目标预测帧是视频中的视频帧。
  18. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至15中任一项所述的方法的步骤。
  19. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至15中任一项所述的方法的步骤。
PCT/CN2023/124015 2022-11-18 2023-10-11 视频的压缩方法、解压缩方法、装置、计算机设备和存储介质 WO2024104014A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211446352.0 2022-11-18
CN202211446352.0A CN115499666B (zh) 2022-11-18 2022-11-18 视频的压缩方法、解压缩方法、装置、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2024104014A1 true WO2024104014A1 (zh) 2024-05-23

Family

ID=85116223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/124015 WO2024104014A1 (zh) 2022-11-18 2023-10-11 视频的压缩方法、解压缩方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN115499666B (zh)
WO (1) WO2024104014A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499666B (zh) * 2022-11-18 2023-03-24 腾讯科技(深圳)有限公司 视频的压缩方法、解压缩方法、装置、设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070206672A1 (en) * 2004-06-14 2007-09-06 Shinichi Yamashita Motion Image Encoding And Decoding Method
CN103533356A (zh) * 2013-10-10 2014-01-22 深圳市智美达科技有限公司 视频数据传输方法和装置
CN103974068A (zh) * 2014-05-07 2014-08-06 电子科技大学 一种基于内容的视频尺寸缩小的方法
US20170244962A1 (en) * 2014-03-07 2017-08-24 Eagle Eye Networks Inc Adaptive Security Camera Image Compression Method of Operation
WO2022184031A1 (zh) * 2021-03-04 2022-09-09 华为技术有限公司 视频编解码方法及装置
CN115499666A (zh) * 2022-11-18 2022-12-20 腾讯科技(深圳)有限公司 视频的压缩方法、解压缩方法、装置、设备和存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109753975B (zh) * 2019-02-02 2021-03-09 杭州睿琪软件有限公司 一种训练样本获得方法、装置、电子设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070206672A1 (en) * 2004-06-14 2007-09-06 Shinichi Yamashita Motion Image Encoding And Decoding Method
CN103533356A (zh) * 2013-10-10 2014-01-22 深圳市智美达科技有限公司 视频数据传输方法和装置
US20170244962A1 (en) * 2014-03-07 2017-08-24 Eagle Eye Networks Inc Adaptive Security Camera Image Compression Method of Operation
CN103974068A (zh) * 2014-05-07 2014-08-06 电子科技大学 一种基于内容的视频尺寸缩小的方法
WO2022184031A1 (zh) * 2021-03-04 2022-09-09 华为技术有限公司 视频编解码方法及装置
CN115499666A (zh) * 2022-11-18 2022-12-20 腾讯科技(深圳)有限公司 视频的压缩方法、解压缩方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN115499666B (zh) 2023-03-24
CN115499666A (zh) 2022-12-20

Similar Documents

Publication Publication Date Title
TWI680671B (zh) 圖片檔處理方法、設備及系統以及儲存介質
US20200145692A1 (en) Video processing method and apparatus
JP2020010331A (ja) 画質を向上させる方法
US11102477B2 (en) DC coefficient sign coding scheme
CN108921910B (zh) 基于可伸缩卷积神经网络的jpeg编码压缩图像复原的方法
CN110198492B (zh) 一种视频的水印添加方法、装置、设备及存储介质
WO2024104014A1 (zh) 视频的压缩方法、解压缩方法、装置、计算机设备和存储介质
CN112887739A (zh) 电子设备、***及其控制方法
US11677987B2 (en) Joint termination of bidirectional data blocks for parallel coding
WO2019109955A1 (zh) 帧间预测方法、装置及终端设备
TW202239209A (zh) 用於經學習視頻壓縮的多尺度光流
CN109151503B (zh) 一种图片文件处理方法及其设备
WO2020078211A1 (zh) 视频编码、视频解码方法、装置、计算机设备和存储介质
WO2019091372A1 (zh) 图像预测方法和装置
CN112887722B (zh) 一种图像无损压缩方法
CN112601095B (zh) 一种视频亮度和色度分数插值模型的创建方法及***
CN111080729B (zh) 基于Attention机制的训练图片压缩网络的构建方法及***
WO2023225808A1 (en) Learned image compress ion and decompression using long and short attention module
CN115866245A (zh) 视频编码方法、装置、计算机设备和存储介质
WO2022100173A1 (zh) 一种视频帧的压缩和视频帧的解压缩方法及装置
CN113938687A (zh) 多参考帧间预测方法、***、设备及存储介质
Yang et al. Graph-convolution network for image compression
WO2023206420A1 (zh) 视频编解码方法、装置、设备、***及存储介质
WO2024078403A1 (zh) 图像处理方法、装置及设备
WO2023246655A1 (zh) 一种图像编码、解码方法及编码、解码装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23890451

Country of ref document: EP

Kind code of ref document: A1