CN113810695A

CN113810695A - Video encoding method, apparatus and computer-readable storage medium

Info

Publication number: CN113810695A
Application number: CN202010541739.9A
Authority: CN
Inventors: 王慧芬; 张园; 史敏锐
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2021-12-17

Abstract

The disclosure provides a video coding method, a video coding device and a computer readable storage medium, and relates to the technical field of signal coding. The video coding method comprises the following steps: extracting each key frame and the time index of each key frame from the video; respectively extracting the feature vectors of each key frame by using a deep learning neural network; the feature vectors and time indices of the respective key frames are encoded. The video coding method and the video coding device can reduce the coding overhead of video coding, thereby improving the coding efficiency of the video coding.

Description

Video encoding method, apparatus and computer-readable storage medium

Technical Field

The present disclosure relates to the field of signal encoding technologies, and in particular, to a video encoding method and apparatus, and a computer-readable storage medium.

Background

With the increase of machine learning application, intelligent platforms are gradually adopted in the fields of car networking, video monitoring, smart cities and the like, and massive communication data are generated between the intelligent platforms and a large number of sensors.

With the increase of data volume, the low efficiency of the traditional video coding method is increasingly prominent, so that the service requirements are difficult to meet in terms of data delay and data scale. Therefore, there is a need to provide a video encoding technique for intelligent platforms.

Disclosure of Invention

One technical problem solved by the present disclosure is how to improve the coding efficiency of video coding.

According to an aspect of the embodiments of the present disclosure, there is provided a video encoding method, including: extracting each key frame and the time index of each key frame from the video; respectively extracting the feature vectors of each key frame by using a deep learning neural network; the feature vectors and time indices of the respective key frames are encoded.

In some embodiments, extracting the respective key frames from the video comprises: the method comprises the steps of performing down-sampling on a video at a preset sampling interval to obtain each sampling frame of the video, wherein the preset sampling interval is in inverse proportion to a code stream of the video; and determining whether each sampling frame belongs to the key frame or not according to the RGB histogram distance between each sampling frame and the reference sampling frame.

In some embodiments, determining whether each sample frame belongs to a key frame according to the RGB histogram distance of each sample frame from the reference sample frame comprises: calculating the RGB histogram distance between the current sampling frame and a reference sampling frame, wherein the initial value of the reference sampling frame is the first sampling frame; under the condition that the distance of the RGB histogram is not smaller than a first threshold value, determining that the current sampling frame belongs to a key frame, and taking the current sampling frame as a reference sampling frame, wherein the first threshold value is inversely proportional to the code stream of the video; and determining that the current sampling frame does not belong to the key frame under the condition that the RGB histogram distance is smaller than a first threshold value.

In some embodiments, encoding the feature vectors and the time indices of the respective keyframes comprises: dividing each key frame into different shooting lenses according to the feature vectors, wherein each shooting lens comprises at least one key frame; feature vectors and time indices of the respective key frames in the same shot are encoded.

In some embodiments, dividing the respective keyframes into different shots according to the feature vectors comprises: sequentially taking each key frame as a current key frame from a first key frame of the current shooting lens and judging whether the current key frame belongs to the current shooting lens; under the condition that the current key frame belongs to the current shooting lens, taking the next key frame as the current key frame, and continuously judging whether the current key frame belongs to the current shooting lens; and under the condition that the current key frame does not belong to the current shooting lens, dividing the key frame from the first key frame to the last key frame of the current key frame into a complete shooting lens, taking the current key frame as the first key frame of the next shooting lens, taking the next shooting lens as the current shooting lens, and repeating the steps.

In some embodiments, determining whether the current key frame belongs to the current shot comprises: calculating a spatial distance between the feature vector of the current key frame and the feature vector of the first key frame as a first spatial distance; under the condition that the last key frame of the current key frame is the first key frame, if the first spatial distance is smaller than the second threshold, determining that the current key frame belongs to the current shooting lens; if the first distance is not smaller than the second threshold, determining that the current key frame does not belong to the current shooting lens; under the condition that the last key frame of the current key frame is not the first key frame, if the first spatial distance is smaller than a third threshold value, determining that the current key frame belongs to the current shooting lens; if the first spatial distance is not smaller than the third threshold, determining that the current key frame does not belong to the current shooting lens; the third threshold is proportional to a second spatial distance, where the second spatial distance is an average spatial distance between each key frame determined to belong to the current shooting lens and the first key frame.

In some embodiments, encoding the feature vectors and time indices of the respective keyframes in the same shot comprises: clustering all key frames in the same shooting lens, and taking the centroid obtained by clustering as a representation frame; calculating a feature vector difference value between the feature vector of each key frame in the same shooting lens and the feature vector of the characterization frame; and sequentially coding the feature vector difference values and the time indexes of all key frames in the same shooting lens according to the sequence of the modules of the feature vector difference values from small to large, wherein the number of all key frames subjected to coding in the same shooting lens is not more than a fourth threshold value.

According to another aspect of the embodiments of the present disclosure, there is provided a video encoding apparatus including: the data extraction module is configured to extract each key frame and the time index of each key frame from the video; the feature vector extraction module is configured to extract feature vectors of the key frames respectively by utilizing a deep learning neural network; and the information encoding module is configured to encode the feature vectors and the time indexes of the key frames.

In some embodiments, the data extraction module is configured to: the method comprises the steps of performing down-sampling on a video at a preset sampling interval to obtain each sampling frame of the video, wherein the preset sampling interval is in inverse proportion to a code stream of the video; and determining whether each sampling frame belongs to the key frame or not according to the RGB histogram distance between each sampling frame and the reference sampling frame.

In some embodiments, the data extraction module is configured to: calculating the RGB histogram distance between the current sampling frame and a reference sampling frame, wherein the initial value of the reference sampling frame is the first sampling frame; under the condition that the distance of the RGB histogram is not smaller than a first threshold value, determining that the current sampling frame belongs to a key frame, and taking the current sampling frame as a reference sampling frame, wherein the first threshold value is inversely proportional to the code stream of the video; and determining that the current sampling frame does not belong to the key frame under the condition that the RGB histogram distance is smaller than a first threshold value.

In some embodiments, the information encoding module is configured to: dividing each key frame into different shooting lenses according to the feature vectors, wherein each shooting lens comprises at least one key frame; feature vectors and time indices of the respective key frames in the same shot are encoded.

In some embodiments, the information encoding module is configured to: sequentially taking each key frame as a current key frame from a first key frame of the current shooting lens and judging whether the current key frame belongs to the current shooting lens; under the condition that the current key frame belongs to the current shooting lens, taking the next key frame as the current key frame, and continuously judging whether the current key frame belongs to the current shooting lens; and under the condition that the current key frame does not belong to the current shooting lens, dividing the key frame from the first key frame to the last key frame of the current key frame into a complete shooting lens, taking the current key frame as the first key frame of the next shooting lens, taking the next shooting lens as the current shooting lens, and repeating the steps.

In some embodiments, the information encoding module is configured to: calculating a spatial distance between the feature vector of the current key frame and the feature vector of the first key frame as a first spatial distance; under the condition that the last key frame of the current key frame is the first key frame, if the first spatial distance is smaller than the second threshold, determining that the current key frame belongs to the current shooting lens; if the first distance is not smaller than the second threshold, determining that the current key frame does not belong to the current shooting lens; under the condition that the last key frame of the current key frame is not the first key frame, if the first spatial distance is smaller than a third threshold value, determining that the current key frame belongs to the current shooting lens; if the first spatial distance is not smaller than the third threshold, determining that the current key frame does not belong to the current shooting lens; the third threshold is proportional to a second spatial distance, where the second spatial distance is an average spatial distance between each key frame determined to belong to the current shooting lens and the first key frame.

In some embodiments, the information encoding module is configured to: clustering all key frames in the same shooting lens, and taking the centroid obtained by clustering as a representation frame; calculating a feature vector difference value between the feature vector of each key frame in the same shooting lens and the feature vector of the characterization frame; and sequentially coding the feature vector difference values and the time indexes of all key frames in the same shooting lens according to the sequence of the modules of the feature vector difference values from small to large, wherein the number of all key frames subjected to coding in the same shooting lens is not more than a fourth threshold value.

According to still another aspect of an embodiment of the present disclosure, there is provided a video encoding apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform the aforementioned video encoding method based on instructions stored in the memory.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, and the instructions, when executed by a processor, implement the aforementioned video encoding method.

The video coding method and the video coding device can reduce the coding overhead of video coding, thereby improving the coding efficiency of the video coding.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 shows a flow diagram of a video encoding method of some embodiments of the present disclosure.

Fig. 2 shows a schematic illustration of an application scenario of an embodiment.

Fig. 3 shows a flow chart of dividing each key frame into different shots.

Fig. 4 shows a flowchart for encoding feature vectors and time indices of respective key frames in the same shot.

Fig. 5 shows a schematic structural diagram of a video encoding apparatus according to some embodiments of the present disclosure.

Fig. 6 is a schematic structural diagram of a video encoding apparatus according to further embodiments of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

Some embodiments of the disclosed video encoding methods are first described in conjunction with fig. 1.

Fig. 1 shows a flow diagram of a video encoding method of some embodiments of the present disclosure. As shown in fig. 1, the present embodiment includes steps S101 to S103.

In step S101, each key frame and the time index of each key frame are extracted from the video.

When extracting each key frame and the time index of each key frame, the video may be down-sampled at a preset sampling interval to obtain each sampling frame of the video. Wherein the preset sampling interval may be set to be inversely proportional to a code stream of the video. The correspondence between the preset sampling interval and the code stream may be, for example, as shown in table 1. For example, when the code stream is 256kb/s, the video is down-sampled at a ratio of 4: 1. It will be appreciated by those skilled in the art that prior to down-sampling, the current codestream rate of the video may be detected to determine the corresponding preset sampling interval.

TABLE 1

Code stream (kb/s)	Sampling interval
		16	8
64	6
		256	4
512	2

And then, determining whether each sampling frame belongs to the key frame or not according to the RGB histogram distance between each sampling frame and the reference sampling frame. The specific arithmetic value of the RGB histogram distance can be the root mean square value of the red, green and blue channel numerical values.

For example, the RGB histogram distance between the current sample frame and the reference sample frame is calculated, and the initial value of the reference sample frame is the first sample frame. Under the condition that the distance of the RGB histogram is not smaller than a first threshold value, determining that the current sampling frame belongs to a key frame, and taking the current sampling frame as a reference sampling frame; and under the condition that the distance of the RGB histogram is smaller than a first threshold value, determining that the current sampling frame does not belong to the key frame, and not processing the current sampling frame. And continuously repeating the steps, and sequentially taking each sampling frame as the current sampling frame from the second sampling frame for judgment. Wherein the first threshold is inversely proportional to the code stream of the video. The correspondence between the first threshold and the code stream may be, for example, as shown in table 2. For example, when the code stream is 256kb/s, the first threshold value is 0.4.

TABLE 2

Code stream (kb/s)	First threshold value
		16	0.6
64	0.5
		256	0.4
512	0.3

In step S102, feature vectors of the respective key frames are extracted by using a deep learning neural network.

The deep learning neural network may specifically adopt a backbone network such as VGG16 and ResNet50, or may adopt a neural network such as SSD (Single Shot multi box Detector).

In step S103, the feature vectors and time indices of the respective key frames are encoded.

For example, the respective key frames may be divided into different shots according to the feature vectors, each shot including at least one key frame. Then, feature vectors and time indices of the respective key frames in the same shot are encoded.

If the video is partitioned and each video block is assigned a different set of parameters for video encoding, optimization of the final combined video can be achieved. Achieving such video block partitioning requires grouping video frames that are similar to each other into a sequence, so that the encoding parameters of the same sequence are relatively similar, and these sequences may also be referred to as long video shot (shot) sequences. These shots are typically video portions of relatively short duration, typically from the same camera under relatively constant lighting and environmental conditions, and contain the capture of the same or similar visual content. Conventional algorithms for determining the natural boundaries of a shot need to examine statistics such as the amount of difference between pixels of successive frames. When the amount of difference exceeds some fixed or dynamic adjustment range threshold, a new shot boundary can be determined. The end result of the algorithm that determines the natural boundaries of shots is the shot list and its time stamp. Instead of using fixed-length video coding blocks for coding, one can use these generated shots as basic video coding blocks. In step S103, instead of checking the statistics such as the amount of difference between pixels of consecutive frames, different shots are divided according to the amount of difference between feature vectors of consecutive key frames.

Fig. 2 is a schematic diagram illustrating an application scenario of the present embodiment. As shown in fig. 2, the present embodiment is implemented by a sensor to perform feature vector extraction and feature vector compression. And then, transmitting the compressed feature vectors to an intelligent platform, and decoding the feature vectors by the intelligent platform, thereby performing task analysis based on machine vision by using the feature vectors.

The inventor considers that the key of machine learning is feature extraction, and the establishment, training and prediction of the deep learning neural network can not leave the feature information extracted from the original information. In view of this, the video encoding method provided by this embodiment is a video encoding oriented to an intelligent platform, and encodes, transmits and decodes the feature vectors of the key frames in the video instead of encoding, transmitting and decoding the video frames in human vision. Therefore, on the basis of meeting the machine learning task of the intelligent platform, the encoding overhead of video encoding can be reduced, the encoding delay of the video encoding is reduced, and the encoding efficiency of the video encoding is improved.

The embodiment also introduces a self-adaptive streaming media technology, can intelligently sense the network quality, and adaptively adjusts the sampling interval and the related threshold value according to the code stream, thereby dynamically adjusting the coding rate of the video, further reducing the coding cost and improving the coding efficiency while carrying out smooth video coding.

How to divide each key frame into different shots is described below in conjunction with fig. 3.

Fig. 3 shows a flow chart of dividing each key frame into different shots. As shown in fig. 3, the present embodiment includes steps S301 to S303.

In step S301, from the first key frame of the current shot, each key frame is sequentially used as a current key frame and it is determined whether the current key frame belongs to the current shot. In the case where the current key frame belongs to the current shot, step S302 is performed. In the case where the current key frame does not belong to the current shot, step S303 is performed.

The specific process of determining whether the current key frame belongs to the current shot is as follows.

First, a spatial distance between the feature vector of the current key frame and the feature vector of the first key frame is calculated as a first spatial distance.

Under the condition that the last key frame of the current key frame is the first key frame, if the first spatial distance is smaller than the second threshold, determining that the current key frame belongs to the current shooting lens; and if the first distance is not smaller than the second threshold, determining that the current key frame does not belong to the current shooting lens. The second threshold may be, for example, 0.0005.

Under the condition that the last key frame of the current key frame is not the first key frame, if the first spatial distance is smaller than a third threshold value, determining that the current key frame belongs to the current shooting lens; if the first spatial distance is not smaller than the third threshold, determining that the current key frame does not belong to the current shooting lens; the third threshold is proportional to a second spatial distance, where the second spatial distance is an average spatial distance between each key frame determined to belong to the current shooting lens and the first key frame. For example, the third threshold may be a product of the second spatial distance and a preset weight k, where k is 1+ seg _ grad _ t, and seg _ grad _ t is an adaptive parameter, and may be, for example, 0.3.

In step S302, the next key frame is used as the current key frame, and whether the current key frame belongs to the current shot is continuously determined.

In step S303, the first key frame to the previous key frame of the current key frame are divided into complete shots, the current key frame is used as the first key frame of the next shot, and the next shot is used as the current shot.

It will be understood by those skilled in the art that after steps S302 and S303 are completed, the process returns to step S301, i.e., the above steps S301 to S303 are repeated.

How to encode the feature vectors and time indices of the respective key frames in the same shot is described below in conjunction with fig. 4.

Fig. 4 shows a flowchart for encoding feature vectors and time indices of respective key frames in the same shot. As shown in fig. 4, the present embodiment includes steps S401 to S403.

In step S401, clustering is performed on each key frame in the same shot, and a centroid obtained by the clustering is used as a characterization frame.

For example, clustering algorithms such as K-means and K-means can be adopted to obtain the characterization frames.

In step S402, a feature vector difference value between the feature vector of each key frame and the feature vector of the characterizing frame in the same shot is calculated.

In step S403, the feature vector difference values and the time indexes of the key frames in the same shot are sequentially encoded according to the descending order of the moduli of the feature vector difference values, and the number of the key frames in the same shot that are encoded is not greater than a fourth threshold.

For example, a coding scheme such as huffman coding can be used for the coding process. Those skilled in the art will appreciate that there is an upper limit to the number of key frames for encoding processing in the same shot. When the number of the key frames exceeds the upper limit number (for example, the upper limit is 10), the feature vectors and the time indexes of the key frames exceeding the upper limit part are discarded, and only the feature vectors and the time indexes of the key frames within the upper limit number are reserved. Since the feature vector differences are ordered before encoding in descending order of their moduli, the feature vector differences between the feature vectors of the discarded keyframes and those of the characterizing frames have larger moduli.

The shooting lens detection strategy based on the feature vector in the embodiment corresponding to fig. 3 can more efficiently detect the key frames belonging to the same shooting lens. In the embodiment corresponding to fig. 4, the video coding strategy based on the shot can use the similar coding parameters to code the similar eigenvectors. Therefore, in the embodiment, the shot detection strategy based on the feature vector and the video coding strategy based on the shot can further reduce the coding overhead and improve the coding efficiency, so as to meet the increasing connection scale of the intelligent platform.

Some embodiments of the disclosed video encoding apparatus are described below in conjunction with fig. 5.

Fig. 5 shows a schematic structural diagram of a video encoding apparatus according to some embodiments of the present disclosure. As shown in fig. 5, the video encoding device 50 in the present embodiment includes: a data extraction module 501 configured to extract each key frame and a time index of each key frame from the video; a feature vector extraction module 502 configured to extract feature vectors of the key frames respectively by using a deep learning neural network; an information encoding module 503 configured to encode the feature vectors and the time indices of the respective key frames.

The video coding method provided by the embodiment is video coding oriented to an intelligent platform, and coding, transmitting and decoding are not performed on video frames in human vision, but on feature vectors of key frames in a video. Therefore, on the basis of meeting the machine learning task of the intelligent platform, the encoding overhead of video encoding can be reduced, the encoding delay of the video encoding is reduced, and the encoding efficiency of the video encoding is improved.

In some embodiments, the data extraction module 501 is configured to: the method comprises the steps of performing down-sampling on a video at a preset sampling interval to obtain each sampling frame of the video, wherein the preset sampling interval is in inverse proportion to a code stream of the video; and determining whether each sampling frame belongs to the key frame or not according to the RGB histogram distance between each sampling frame and the reference sampling frame.

In some embodiments, the data extraction module 501 is configured to: calculating the RGB histogram distance between the current sampling frame and a reference sampling frame, wherein the initial value of the reference sampling frame is the first sampling frame; under the condition that the distance of the RGB histogram is not smaller than a first threshold value, determining that the current sampling frame belongs to a key frame, and taking the current sampling frame as a reference sampling frame, wherein the first threshold value is inversely proportional to the code stream of the video; and determining that the current sampling frame does not belong to the key frame under the condition that the RGB histogram distance is smaller than a first threshold value.

The embodiment introduces the self-adaptive streaming media technology, can intelligently sense the network quality, and adaptively adjusts the sampling interval and the related threshold value according to the code stream, thereby dynamically adjusting the coding rate of the video, further reducing the coding cost and improving the coding efficiency while carrying out smooth video coding.

In some embodiments, the information encoding module 503 is configured to: dividing each key frame into different shooting lenses according to the feature vectors, wherein each shooting lens comprises at least one key frame; feature vectors and time indices of the respective key frames in the same shot are encoded.

In some embodiments, the information encoding module 503 is configured to: sequentially taking each key frame as a current key frame from a first key frame of the current shooting lens and judging whether the current key frame belongs to the current shooting lens; under the condition that the current key frame belongs to the current shooting lens, taking the next key frame as the current key frame, and continuously judging whether the current key frame belongs to the current shooting lens; and under the condition that the current key frame does not belong to the current shooting lens, dividing the key frame from the first key frame to the last key frame of the current key frame into a complete shooting lens, taking the current key frame as the first key frame of the next shooting lens, taking the next shooting lens as the current shooting lens, and repeating the steps.

In some embodiments, the information encoding module 503 is configured to: calculating a spatial distance between the feature vector of the current key frame and the feature vector of the first key frame as a first spatial distance; under the condition that the last key frame of the current key frame is the first key frame, if the first spatial distance is smaller than the second threshold, determining that the current key frame belongs to the current shooting lens; if the first distance is not smaller than the second threshold, determining that the current key frame does not belong to the current shooting lens; under the condition that the last key frame of the current key frame is not the first key frame, if the first spatial distance is smaller than a third threshold value, determining that the current key frame belongs to the current shooting lens; if the first spatial distance is not smaller than the third threshold, determining that the current key frame does not belong to the current shooting lens; the third threshold is proportional to a second spatial distance, where the second spatial distance is an average spatial distance between each key frame determined to belong to the current shooting lens and the first key frame.

In some embodiments, the information encoding module 503 is configured to: clustering all key frames in the same shooting lens, and taking the centroid obtained by clustering as a representation frame; calculating a feature vector difference value between the feature vector of each key frame in the same shooting lens and the feature vector of the characterization frame; and sequentially coding the feature vector difference values and the time indexes of all key frames in the same shooting lens according to the sequence of the modules of the feature vector difference values from small to large, wherein the number of all key frames subjected to coding in the same shooting lens is not more than a fourth threshold value.

The shooting lens detection strategy based on the feature vector can more efficiently detect key frames belonging to the same shooting lens. Based on the video coding strategy of the shooting lens, the similar characteristic vectors can be coded by adopting similar coding parameters. Therefore, in the embodiment, the shot detection strategy based on the feature vector and the video coding strategy based on the shot can further reduce the coding overhead and improve the coding efficiency, so as to meet the increasing connection scale of the intelligent platform.

Further embodiments of the video encoding apparatus of the present disclosure are described below in conjunction with fig. 6.

Fig. 6 is a schematic structural diagram of a video encoding apparatus according to further embodiments of the present disclosure. As shown in fig. 6, the video encoding device 60 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 being configured to perform the video encoding method in any of the foregoing embodiments based on instructions stored in the memory 610.

Memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

The video encoding apparatus 60 may further include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be through a bus 660, for example. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk.

The present disclosure also includes a computer readable storage medium having stored thereon computer instructions that, when executed by a processor, implement a video encoding method in any of the foregoing embodiments.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A video encoding method, comprising:

extracting each key frame and the time index of each key frame from the video;

respectively extracting the feature vectors of each key frame by using a deep learning neural network;

the feature vectors and time indices of the respective key frames are encoded.

2. The video coding method of claim 1, wherein the extracting the respective key frames from the video comprises:

the method comprises the steps of performing down-sampling on a video at a preset sampling interval to obtain each sampling frame of the video, wherein the preset sampling interval is in inverse proportion to a code stream of the video;

and determining whether each sampling frame belongs to the key frame or not according to the RGB histogram distance between each sampling frame and the reference sampling frame.

3. The video coding method of claim 2, wherein the determining whether each sample frame belongs to a key frame according to the RGB histogram distance of each sample frame from the reference sample frame comprises:

calculating the RGB histogram distance between the current sampling frame and a reference sampling frame, wherein the initial value of the reference sampling frame is the first sampling frame;

under the condition that the distance of the RGB histogram is not smaller than a first threshold value, determining that a current sampling frame belongs to a key frame, and taking the current sampling frame as a reference sampling frame, wherein the first threshold value is inversely proportional to a code stream of a video;

and determining that the current sampling frame does not belong to the key frame under the condition that the RGB histogram distance is smaller than a first threshold value.

4. The video coding method of claim 1, wherein the coding the feature vectors and the temporal indices of the respective key frames comprises:

dividing each key frame into different shooting lenses according to the feature vector, wherein each shooting lens comprises at least one key frame;

feature vectors and time indices of the respective key frames in the same shot are encoded.

5. The video coding method of claim 4, wherein the dividing the respective keyframes into different shots according to the feature vectors comprises:

sequentially taking each key frame as a current key frame from a first key frame of the current shooting lens and judging whether the current key frame belongs to the current shooting lens;

under the condition that the current key frame belongs to the current shooting lens, taking the next key frame as the current key frame, and continuously judging whether the current key frame belongs to the current shooting lens;

and under the condition that the current key frame does not belong to the current shooting lens, dividing the key frame from the first key frame to the last key frame of the current key frame into a complete shooting lens, taking the current key frame as the first key frame of the next shooting lens, taking the next shooting lens as the current shooting lens, and repeating the steps.

6. The video coding method of claim 5, wherein the determining whether the current key frame belongs to a current shot comprises:

calculating a spatial distance between the feature vector of the current key frame and the feature vector of the first key frame as a first spatial distance;

under the condition that the last key frame of the current key frame is the first key frame, if the first spatial distance is smaller than a second threshold value, determining that the current key frame belongs to the current shooting lens; if the first distance is not smaller than a second threshold value, determining that the current key frame does not belong to the current shooting lens;

under the condition that the last key frame of the current key frame is not the first key frame, if the first spatial distance is smaller than a third threshold value, determining that the current key frame belongs to the current shooting lens; if the first spatial distance is not smaller than a third threshold, determining that the current key frame does not belong to the current shooting lens;

the third threshold is proportional to a second spatial distance, where the second spatial distance is an average spatial distance between each key frame determined to belong to the current shooting lens and the first key frame.

7. The video coding method of claim 4, wherein the encoding the feature vectors and the time indices of the respective key frames in the same shot comprises:

clustering all key frames in the same shooting lens, and taking the centroid obtained by clustering as a representation frame;

calculating a feature vector difference value between the feature vector of each key frame in the same shooting lens and the feature vector of the characterization frame;

and sequentially coding the feature vector difference values and the time indexes of all key frames in the same shooting lens according to the sequence of the modules of the feature vector difference values from small to large, wherein the number of all key frames subjected to coding in the same shooting lens is not more than a fourth threshold value.

8. A video encoding device, comprising:

the data extraction module is configured to extract each key frame and the time index of each key frame from the video;

the feature vector extraction module is configured to extract feature vectors of the key frames respectively by utilizing a deep learning neural network;

and the information encoding module is configured to encode the feature vectors and the time indexes of the key frames.

9. The video coding method of claim 8, wherein the data extraction module is configured to:

10. The video coding method of claim 9, wherein the data extraction module is configured to:

11. The video coding method of claim 8, wherein the information encoding module is configured to:

12. The video coding method of claim 11, wherein the information encoding module is configured to:

13. The video coding method of claim 12, wherein the information encoding module is configured to:

14. The video coding method of claim 11, wherein the information encoding module is configured to:

15. A video encoding device, comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the video encoding method of any of claims 1-7 based on instructions stored in the memory.

16. A computer readable storage medium, wherein the computer readable storage medium stores computer instructions which, when executed by a processor, implement the video encoding method of any one of claims 1 to 7.