CN113658045A

CN113658045A - Video processing method and device

Info

Publication number: CN113658045A
Application number: CN202110933328.9A
Authority: CN
Inventors: 磯部駿; 陶鑫; 章佳杰; 戴宇荣
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-08-14
Filing date: 2021-08-14
Publication date: 2021-11-16

Abstract

The disclosure relates to a video processing method and device. The video processing method comprises the following steps: acquiring all video frames of a video; for each of the all video frames, performing the following: the method includes calculating a change image between a current video frame and an adjacent video frame, wherein the adjacent video frame is a video frame adjacent to the current video frame, calculating spatial attention information between the current video frame and the adjacent video frame based on the change image, obtaining features for super resolution of the current video frame based on the spatial attention information between the current video frame and the adjacent video frame, and generating a high resolution video frame of the current video frame based on the features for super resolution. According to the video processing method and device disclosed by the invention, the attention to the complementary region can be improved based on the spatial attention information, and the attention to the redundant region is reduced, so that the super-resolution performance is improved.

Description

Video processing method and device

Technical Field

The present disclosure relates to the field of video technology. More particularly, the present disclosure relates to a video processing method and apparatus.

Background

The super-resolution is a classic task in the field of computer vision, and the detail information which is lost from low resolution to high resolution is filled by means of image priori knowledge and image self-similarity, so that a corresponding high-resolution image is generated. Thanks to the deep learning method, the convolutional neural network establishes a mapping relationship from low resolution to high resolution by extracting bottom layer texture information, and obtains good performance on a super-resolution task. In recent years, with the development of video technology, video super-resolution has become a hot problem of research. Compared with image super-resolution, the video contains rich time sequence complementary information, and can more robustly cope with the influence of external factors such as shielding, illumination and the like.

Disclosure of Invention

An exemplary embodiment of the present disclosure is to provide a video processing method and apparatus to solve at least the problems of video processing in the related art, and may not solve any of the problems.

According to an exemplary embodiment of the present disclosure, there is provided a video processing method including: acquiring all video frames of a video; for each of the all video frames, performing the following: the method includes calculating a change image between a current video frame and an adjacent video frame, wherein the adjacent video frame is a video frame adjacent to the current video frame, calculating spatial attention information between the current video frame and the adjacent video frame based on the change image, wherein the spatial attention information represents information of a region of interest of the current video frame relative to the adjacent frame, obtaining features for super resolution of the current video frame based on the spatial attention between the current video frame and the adjacent video frame, and generating a high resolution video frame of the current video frame based on the features for super resolution.

Optionally, the neighboring video frame comprises a next video frame of the current video frame, wherein the next video frame is a next video frame neighboring the current video frame. The step of calculating a change image between the current video frame and the neighboring video frame may comprise: a change image between the current video frame and the next video frame is calculated. The step of calculating spatial attention information between a current video frame and the neighboring video frame based on the change image may comprise: spatial attention information between a current video frame and a next video frame is calculated based on a change image between the current video frame and the next video frame.

Optionally, the step of calculating spatial attention information between the current video frame and the next video frame based on the changed image between the current video frame and the next video frame comprises: dividing a current video frame into a first area and a second area based on the change degree of each frame of pixels in a change image between the current video frame and a next video frame; determining a first region of a change image corresponding to the first region and a second region of the change image corresponding to the second region in a change image between the current video frame and a next video frame, and determining a first region of the next video frame corresponding to the first region and a second region of the next video frame corresponding to the second region in the next video frame; calculating spatial attention information between a first region of a current video frame and a first region of a next video frame based on the change image first region; and calculating spatial attention information between the second region of the current video frame and the second region of the next video frame based on the changed image second region, wherein the spatial attention information between the current video frame and the next video frame comprises the spatial attention information between the first region of the current video frame and the first region of the next video frame and the spatial attention information between the second region of the current video frame and the second region of the next video frame.

Optionally, the step of calculating spatial attention information between the first region of the current video frame and the first region of the next video frame based on the changed image first region may include: performing point multiplication on the first area of the change image and the first area of the current video frame to obtain a first point multiplication result; performing spatial gating convolution operation on the first area of the change image and the first point multiplication result to obtain a first convolution result; spatial attention information between a first region of the current video frame and a first region of a next video frame is determined based on the first convolution result, wherein the first region of the next video frame is a region of the next video frame corresponding to the first region of the current video frame.

Optionally, the step of determining spatial attention information between the first region of the current video frame and the first region of the next video frame based on the first convolution result may comprise: performing linear rectification on the first convolution result to obtain a first rectification result; the first rectification result is taken as first spatial attention information between a first region of the current video frame and a first region of a next video frame. The step of linearly rectifying the first convolution result may include: and setting the value less than zero in the first convolution result to be equal to zero, and keeping the value more than zero in the first convolution result unchanged.

Optionally, the step of calculating spatial attention information between the second region of the current video frame and the second region of the next video frame based on the changed image second region may include: performing negation operation on the second area of the change image to obtain a negation image of the second area of the change image; performing dot multiplication on the negation image and a second area of the current video frame to obtain a second dot multiplication result; performing spatial gating convolution operation on the negation image and the second dot product result to obtain a second convolution result; spatial attention information between a second region of the current video frame and a second region of a next video frame is determined based on the second convolution result, wherein the second region of the next video frame is a region of the next video frame corresponding to the second region of the current video frame.

Optionally, the step of determining spatial attention information between the second region of the current video frame and the second region of the next video frame based on the second convolution result may include: performing linear rectification on the second convolution result to obtain a second rectification result; and taking the second rectification result as the spatial attention information between the second area of the current video frame and the second area of the next video frame. The step of linearly rectifying the second convolution result may include: and setting the value less than zero in the second convolution result to be equal to zero, and keeping the value more than zero in the second convolution result unchanged.

Optionally, the adjacent video frame may include a previous video frame of the current video frame, wherein the previous video frame is a previous video frame adjacent to the current video frame. The step of calculating a change image between the current video frame and the neighboring video frame may comprise: and calculating a change image between the current video frame and the last video frame. The step of calculating spatial attention information between a current video frame and the neighboring video frame based on the change image may comprise: spatial attention information between a current video frame and a previous video frame is calculated based on a changed image between the current video frame and the previous video frame.

Optionally, the neighboring video frames may include a previous video frame and a next video frame of the current video frame. The step of calculating a change image between the current video frame and the neighboring video frame may comprise: a change image between the current video frame and the previous video frame is calculated, and a change image between the current video frame and the next video frame is calculated. The step of calculating spatial attention information between a current video frame and the neighboring video frame based on the change image may comprise: spatial attention information between a current video frame and a previous video frame is calculated based on a change image between the current video frame and the previous video frame, and spatial attention information between the current video frame and a next video frame is calculated based on a change image between the current video frame and the next video frame.

Optionally, the step of obtaining the features for super resolution of the current video frame based on the spatial attention information between the current video frame and the neighboring video frame may comprise: using a fusion network to obtain features of the current video frame for super resolution based on spatial attention information between the current video frame and the neighboring video frame.

Optionally, the step of obtaining the features for super resolution of the current video frame based on the spatial attention information between the current video frame and the neighboring video frame may comprise: splicing the spatial attention information between the current video frame and the next video frame with the spatial attention information between the current video frame and the previous video frame; and obtaining the characteristics of the current video frame for super resolution by using a fusion network based on the splicing result.

Optionally, the step of generating a high resolution video frame of the current video frame based on the features for super resolution may comprise: generating a high resolution video frame of the current video frame using a neural network based on the features for super resolution.

Alternatively, the change image may comprise a differential image or an optical flow image.

According to an exemplary embodiment of the present disclosure, there is provided a video processing apparatus including: a video frame acquisition unit configured to acquire all video frames of a video; and a video processing unit configured to process for each of the all video frames, wherein the video processing unit comprises: an attention calculating unit configured to calculate a change image between a current video frame and an adjacent video frame, calculate spatial attention information between the current video frame and the adjacent video frame based on the change image, wherein the spatial attention information represents information of a region in which the current video frame is focused with respect to the adjacent frame, the adjacent video frame being a video frame adjacent to the current video frame, a feature obtaining unit configured to obtain a feature for super resolution of the current video frame based on the spatial attention between the current video frame and the adjacent video frame, and a generating unit configured to generate a high resolution video frame of the current video frame based on the feature for super resolution.

Optionally, the neighboring video frame comprises a next video frame of the current video frame, wherein the next video frame is a next video frame neighboring the current video frame. The attention calculation unit may be configured to: a change image between the current video frame and the next video frame is calculated, and spatial attention information between the current video frame and the next video frame is calculated based on the change image between the current video frame and the next video frame.

Optionally, the attention calculation unit may be configured to: dividing a current video frame into a first area and a second area based on the change degree of each frame of pixels in a change image between the current video frame and a next video frame; determining a first region of a change image corresponding to the first region and a second region of the change image corresponding to the second region in a change image between the current video frame and a next video frame, and determining a first region of the next video frame corresponding to the first region and a second region of the next video frame corresponding to the second region in the next video frame; calculating spatial attention information between a first region of a current video frame and a first region of a next video frame based on the change image first region; and calculating spatial attention information between the second region of the current video frame and the second region of the next video frame based on the changed image second region, wherein the spatial attention information between the current video frame and the next video frame comprises the spatial attention information between the first region of the current video frame and the first region of the next video frame and the spatial attention information between the second region of the current video frame and the second region of the next video frame.

Optionally, the attention calculation unit may be configured to: performing point multiplication on the first area of the change image and the first area of the current video frame to obtain a first point multiplication result; performing spatial gating convolution operation on the first area of the change image and the first point multiplication result to obtain a first convolution result; spatial attention information between a first region of the current video frame and a first region of a next video frame is determined based on the first convolution result, wherein the first region of the next video frame is a region of the next video frame corresponding to the first region of the current video frame.

Optionally, the attention calculation unit may be configured to: performing linear rectification on the first convolution result to obtain a first rectification result; taking the first rectification result as spatial attention information between the first region of the current video frame and the first region of the next video frame, wherein the attention calculation unit may be further configured to: and setting the value less than zero in the first convolution result to be equal to zero, and keeping the value more than zero in the first convolution result unchanged.

Optionally, the attention calculation unit may be configured to: performing negation operation on the second area of the change image to obtain a negation image of the second area of the change image; performing dot multiplication on the negation image and a second area of the current video frame to obtain a second dot multiplication result; performing spatial gating convolution operation on the negation image and the second dot product result to obtain a second convolution result; spatial attention information between a second region of the current video frame and a second region of a next video frame is determined based on the second convolution result, wherein the second region of the next video frame is a region of the next video frame corresponding to the second region of the current video frame.

Optionally, the attention calculation unit may be configured to: performing linear rectification on the second convolution result to obtain a second rectification result; taking the second rectification result as spatial attention information between the second region of the current video frame and the second region of the next video frame, wherein the attention calculation unit may be further configured to: and setting the value less than zero in the second convolution result to be equal to zero, and keeping the value more than zero in the second convolution result unchanged.

Optionally, the adjacent video frame may include a previous video frame of the current video frame, wherein the previous video frame is a previous video frame adjacent to the current video frame. The attention calculation unit may be configured to: a change image between the current video frame and a previous video frame is calculated, and spatial attention information between the current video frame and the previous video frame is calculated based on the change image between the current video frame and the previous video frame.

Optionally, the neighboring video frames may include a previous video frame and a next video frame of the current video frame. The attention calculation unit may be configured to: calculating a change image between a current video frame and a previous video frame and calculating a change image between the current video frame and a next video frame, calculating spatial attention information between the current video frame and the previous video frame based on the change image between the current video frame and the previous video frame, and calculating spatial attention information between the current video frame and the next video frame based on the change image between the current video frame and the next video frame.

Optionally, the feature obtaining unit may be configured to: using a fusion network to obtain features of the current video frame for super resolution based on spatial attention information between the current video frame and the neighboring video frame.

Optionally, the feature obtaining unit may be configured to: splicing the spatial attention information between the current video frame and the next video frame with the spatial attention information between the current video frame and the previous video frame; and obtaining the characteristics of the current video frame for super resolution by using a fusion network based on the splicing result.

Optionally, the generating unit is configured to: generating a high resolution video frame of the current video frame using a neural network based on the features for super resolution.

According to an exemplary embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement a video processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor of an electronic device, causes the electronic device to execute a video processing method according to an exemplary embodiment of the present disclosure.

According to an exemplary embodiment of the present disclosure, a computer program product is provided, comprising computer programs/instructions which, when executed by a processor, implement a video processing method according to an exemplary embodiment of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

improving the attention to the complementary region based on the spatial attention information, and reducing the attention to the redundant region;

and the super-resolution performance is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 illustrates an exemplary system architecture 100 in which exemplary embodiments of the present disclosure may be applied.

Fig. 2 illustrates a flow chart of a video processing method according to an exemplary embodiment of the present disclosure.

Fig. 3 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 4 shows a block diagram of the video processing unit 32 according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram of an electronic device 500 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

The video super-resolution technology is also widely applied to scenes with enhanced image quality, and provides more pleasing visual experience for users. The invention re-considers how to utilize the spatio-temporal information to help the algorithm to enhance the information of the video frame from the perspective of time difference so as to further enhance the existing video super-resolution algorithm. For video, two parts of information exist between frames, namely redundant information and complementary information, and full mining of the complementary information brings abundant details for super-resolution results. The invention captures the change information of adjacent frames by a simple and effective time sequence difference method, and enables the neural network to pay key attention to the information.

Hereinafter, a video processing method and apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 5.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various video applications may be installed on the

terminal devices

101, 102, 103. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and capable of playing, recording, editing, etc. video, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, etc. When the

terminal device

101, 102, 103 is software, it may be installed in the electronic devices listed above, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or it may be implemented as a single software or software module. And is not particularly limited herein.

The

terminal devices

101, 102, 103 may be equipped with an image capture device (e.g., a camera) to capture video data. In practice, the smallest visual unit that makes up a video is a Frame (Frame). Each frame is a static image. Temporally successive sequences of frames are composited together to form a motion video. Further, the

terminal apparatuses

101, 102, 103 may also be mounted with a component (e.g., a speaker) for converting an electric signal into sound to play the sound, and may also be mounted with a device (e.g., a microphone) for converting an analog audio signal into a digital audio signal to pick up the sound.

The server 105 may be a server providing various services, such as a background server providing support for multimedia applications installed on the

terminal devices

101, 102, 103. The background server can analyze and store the received data such as the audio and video data uploading request, and can also receive the video processing request sent by the

terminal equipment

101, 102, 103, and feed back the video processing result to the

terminal equipment

101, 102, 103.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the video processing method provided by the embodiment of the present disclosure is generally executed by a terminal device, but may also be executed by a server, or may also be executed by cooperation of the terminal device and the server. Accordingly, the video processing apparatus may be provided in the terminal device, the server, or both the terminal device and the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation, and the disclosure is not limited thereto.

Fig. 2 illustrates a flow chart of a video processing method according to an exemplary embodiment of the present disclosure. The video processing method of fig. 2 is applicable to video super resolution of low resolution video.

Referring to fig. 2, in step S201, all video frames of a video are acquired. A high resolution video frame is then generated for each of all video frames of the video in steps S202 to S204.

In step S202, a change image between the current video frame and the adjacent video frame is calculated. Here, the neighboring video frame is a video frame neighboring the current video frame. Here, the neighboring video frame may be a previous video frame and/or a next video frame of the current video frame. The next video frame is the next video frame adjacent to the current video frame, and the previous video frame is the previous video frame adjacent to the current video frame. The neighboring video frame may also be several video frames neighboring the current video frame. Hereinafter, the following video frame of the current video frame is taken as an example for explanation.

In an exemplary embodiment of the present disclosure, the change image may be a difference image or an optical flow image.

In step S203, spatial attention information between the current video frame and the neighboring video frame is calculated based on the change image between the current video frame and the neighboring video frame. Here, the spatial attention information represents information of a region of interest of the current video frame with respect to the neighboring frame.

In an exemplary embodiment of the present disclosure, when a differential image is used as a change image, the speed and accuracy of calculating spatial attention may be improved.

In an exemplary embodiment of the present disclosure, if the neighboring video frame is a next video frame of the current video frame, when calculating a change image between the current video frame and the neighboring video frame, the change image between the current video frame and the next video frame may be calculated, and when calculating spatial attention information between the current video frame and the neighboring video frame based on the change image between the current video frame and the neighboring video frame, the spatial attention information between the current video frame and the next video frame may be calculated based on the change image between the current video frame and the next video frame.

In an exemplary embodiment of the present disclosure, in calculating spatial attention information between a current video frame and a next video frame based on a change image between the current video frame and the next video frame, the current video frame may be first divided into a first region and a second region based on a degree of change of each frame pixel in the change image between the current video frame and the next video frame, and a change image first region corresponding to the first region and a change image second region corresponding to the second region in the change image between the current video frame and the next video frame are determined, and a next video frame first region corresponding to the first region and a next video frame second region corresponding to the second region in the next video frame are determined, and then spatial attention information between the first region of the current video frame and the next video frame first region is calculated based on the change image first region, and calculating spatial attention information between the second region of the current video frame and the second region of the next video frame based on the changed image second region. Here, the spatial attention information between the current video frame and the next video frame includes spatial attention information between a first region of the current video frame and a first region of the next video frame, and spatial attention information between a second region of the current video frame and a second region of the next video frame.

In an exemplary embodiment of the present disclosure, when spatial attention information between a first region of a current video frame and a first region of a next video frame is calculated based on a first region of a change image, the first region of the change image and the first region of the current video frame may be first dot-multiplied to obtain a first dot-multiplied result, the first region of the change image and the first dot-multiplied result may be subjected to a spatial gate convolution operation to obtain a first convolution result, and then spatial attention information between the first region of the current video frame and the first region of the next video frame may be determined based on the first convolution result. Here, the next video frame first region is a region in the next video frame corresponding to the first region of the current video frame.

In an exemplary embodiment of the present disclosure, when determining spatial attention information between a first region of a current video frame and a first region of a next video frame based on a first convolution result, the first convolution result may be first linearly rectified to obtain a first rectification result, and then the first rectification result may be used as the spatial attention information between the first region of the current video frame and the first region of the next video frame.

In an exemplary embodiment of the present disclosure, when the first convolution result is linearly rectified, a value smaller than zero in the first convolution result may be set equal to zero, and a value larger than zero in the first convolution result may be kept unchanged.

For example, by formula

Spatial attention information between a first region of a current video frame and a first region of a next video frame is calculated. Here, Z₁Representing spatial attention information between a first region of a current video frame and a first region of a next video frame, Relu representing a linear rectification function, Conv representing a convolution,

indicating a changing image (e.g., a difference image or an optical flow image), < indicates a dot product, I_tRepresenting the current video frame.

In an exemplary embodiment of the present disclosure, when calculating spatial attention information between a second region of a current video frame and a second region of a next video frame based on the second region of the change image, an inversion operation may be first performed on the second region of the change image to obtain an inverted image of the second region of the change image, then the inverted image may be dot-multiplied with the second region of the current video frame to obtain a second dot-multiplied result, the inverted image and the second dot-multiplied result may be subjected to a spatial gate convolution operation to obtain a second convolution result, and then spatial attention information between the second region of the current video frame and the second region of the next video frame may be determined based on the second convolution result. Here, the second region of the next video frame is a region in the next video frame corresponding to the second region of the current video frame.

In an exemplary embodiment of the present disclosure, when determining the spatial attention information between the second region of the current video frame and the second region of the next video frame based on the second convolution result, the second convolution result may be first linearly rectified to obtain a second rectification result, and then the second rectification result may be used as the spatial attention information between the second region of the current video frame and the second region of the next video frame.

In an exemplary embodiment of the present disclosure, when the second convolution result is linearly rectified, a value smaller than zero in the second convolution result may be set equal to zero, and a value larger than zero in the second convolution result may be kept unchanged.

For example, by formula

Spatial attention information between a second region of the current video frame and a second region of the next video frame is calculated. Here, Z₂Representing spatial attention information between a second region of the current video frame and a second region of the next video frame, Relu representing a linear rectification function, Conv representing a convolution,

an inverted image indicating a changed image (e.g., a difference image or an optical flow image) <' > indicating a dot product, I_tRepresenting the current video frame.

In an exemplary embodiment of the present disclosure, if the neighboring video frame is a previous video frame of the current video frame, when calculating a change image between the current video frame and the neighboring video frame, the change image between the current video frame and the previous video frame may be calculated, and when calculating spatial attention information between the current video frame and the neighboring video frame based on the change image, the spatial attention information between the current video frame and the previous video frame may be calculated based on the change image between the current video frame and the previous video frame.

In an exemplary embodiment of the present disclosure, if the neighboring video frames include a previous video frame and a next video frame of a current video frame, in calculating a change image between the current video frame and the neighboring video frame, the change image between the current video frame and the previous video frame may be calculated, and the change image between the current video frame and the next video frame may be calculated; in calculating the spatial attention information between the current video frame and the neighboring video frame based on the change image between the current video frame and the neighboring video frame, the spatial attention information between the current video frame and the previous video frame may be calculated based on the change image between the current video frame and the previous video frame, and the spatial attention information between the current video frame and the next video frame may be calculated based on the change image between the current video frame and the next video frame.

In step S204, a feature for super resolution of the current video frame is obtained based on spatial attention information between the current video frame and the neighboring video frame.

In an exemplary embodiment of the present disclosure, in obtaining the feature for super resolution of the current video frame based on the spatial attention information between the current video frame and the neighboring video frame, the feature for super resolution of the current video frame may be obtained using a fusion network based on the spatial attention information between the current video frame and the neighboring video frame.

In an exemplary embodiment of the present disclosure, in obtaining a feature for super resolution of a current video frame based on spatial attention information between the current video frame and the neighboring video frame, spatial attention information between the current video frame and a next video frame and spatial attention information between the current video frame and a previous video frame may be first stitched; then, based on the splicing result, a fusion network is used to obtain the characteristics of the current video frame for super resolution.

In step S205, a high-resolution video frame of the current video frame is generated based on the features for super-resolution.

In an exemplary embodiment of the present disclosure, in generating a high resolution video frame of a current video frame based on a feature for super resolution, the high resolution video frame of the current video frame may be generated using a neural network based on the feature for super resolution.

It should be noted that although it is described in step S202 and step S203 that the spatial attention between the current video frame and the next video frame is calculated based on the change image between the current video frame and the next video frame, the present disclosure is not limited thereto, and for example, the spatial attention between the current video frame and the previous video frame may also be calculated based on the change image between the current video frame and the previous video frame.

It should be noted that in exemplary embodiments of the present disclosure, spatial attention may also be calculated by other methods (e.g., without limitation, non-local, transfomer) as computational overhead allows. Alternatively, step S202 and step S203 may not be performed, and spatial attention may be calculated by other methods (e.g., without limitation, non-local, transfomer).

The video processing method according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 1 to 2. Hereinafter, a video processing apparatus and units thereof according to an exemplary embodiment of the present disclosure will be described with reference to fig. 3 and 4.

Fig. 3 illustrates a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure. Fig. 4 shows a block diagram of the video processing unit 32 according to an exemplary embodiment of the present disclosure.

Referring to fig. 3, the video processing apparatus includes a video frame acquisition unit 31 and a video processing unit 32.

The video frame acquisition unit 31 is configured to acquire all video frames of a video.

The video processing unit 32 is configured to process for each of all video frames.

Referring to fig. 4, the video processing unit 32 includes an attention calculating unit 321, a feature obtaining unit 322, and a generating unit 323.

The attention calculation unit 321 is configured to calculate a change image between the current video frame and the neighboring video frame, and calculate spatial attention information between the current video frame and the neighboring video frame based on the change image between the current video frame and the neighboring video frame. Here, the spatial attention information represents information of a region in which the current video frame is focused with respect to the neighboring frame, which is a video frame neighboring the current video frame. Here, the neighboring video frame may be a previous video frame and/or a next video frame of the current video frame. The next video frame is the next video frame adjacent to the current video frame, and the previous video frame is the previous video frame adjacent to the current video frame.

In an exemplary embodiment of the present disclosure, if the neighboring video frame is a next video frame of the current video frame, the attention calculation unit is configured to: a change image between the current video frame and the next video frame is calculated, and spatial attention information between the current video frame and the next video frame is calculated based on the change image between the current video frame and the next video frame.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: dividing a current video frame into a first area and a second area based on the change degree of each frame of pixels in a change image between the current video frame and a next video frame; determining a first region of a change image corresponding to the first region and a second region of the change image corresponding to the second region in a change image between the current video frame and a next video frame, and determining a first region of the next video frame corresponding to the first region and a second region of the next video frame corresponding to the second region in the next video frame; calculating spatial attention information between a first region of a current video frame and a first region of a next video frame based on the change image first region; and calculating spatial attention information between the second region of the current video frame and the second region of the next video frame based on the changed image second region, wherein the spatial attention information between the current video frame and the next video frame comprises the spatial attention information between the first region of the current video frame and the first region of the next video frame and the spatial attention information between the second region of the current video frame and the second region of the next video frame.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: performing point multiplication on the first area of the change image and the first area of the current video frame to obtain a first point multiplication result; performing spatial gating convolution operation on the first area of the change image and the first point multiplication result to obtain a first convolution result; spatial attention information between a first region of the current video frame and a first region of a next video frame is determined based on the first convolution result, wherein the first region of the next video frame is a region of the next video frame corresponding to the first region of the current video frame.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: performing linear rectification on the first convolution result to obtain a first rectification result; and taking the first rectification result as the spatial attention information between the first region of the current video frame and the first region of the next video frame.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: and setting the value less than zero in the first convolution result to be equal to zero, and keeping the value more than zero in the first convolution result unchanged.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: performing negation operation on the second area of the change image to obtain a negation image of the second area of the change image; performing dot multiplication on the negation image and a second area of the current video frame to obtain a second dot multiplication result; performing spatial gating convolution operation on the negation image and the second dot product result to obtain a second convolution result; spatial attention information between a second region of the current video frame and a second region of a next video frame is determined based on the second convolution result, wherein the second region of the next video frame is a region of the next video frame corresponding to the second region of the current video frame.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: performing linear rectification on the second convolution result to obtain a second rectification result; and taking the second rectification result as the spatial attention information between the second area of the current video frame and the second area of the next video frame.

In an exemplary embodiment of the present disclosure, the attention calculating unit 321 may be configured to: and setting the value less than zero in the second convolution result to be equal to zero, and keeping the value more than zero in the second convolution result unchanged.

In an exemplary embodiment of the present disclosure, if the neighboring video frame is a previous video frame of the current video frame, the attention calculation unit is configured to: a change image between the current video frame and a previous video frame is calculated, and spatial attention information between the current video frame and the previous video frame is calculated based on the change image between the current video frame and the previous video frame.

In an exemplary embodiment of the present disclosure, if the neighboring video frames include a previous video frame and a next video frame of the current video frame, the attention calculation unit is configured to: calculating a change image between a current video frame and a previous video frame and calculating a change image between the current video frame and a next video frame, calculating spatial attention information between the current video frame and the previous video frame based on the change image between the current video frame and the previous video frame, and calculating spatial attention information between the current video frame and the next video frame based on the change image between the current video frame and the next video frame.

The feature obtaining unit 322 is configured to obtain features for super resolution of the current video frame based on spatial attention between the current video frame and the neighboring video frame.

In an exemplary embodiment of the present disclosure, the feature obtaining unit 322 may be configured to: using a fusion network to obtain features of the current video frame for super resolution based on spatial attention information between the current video frame and the neighboring video frame.

In an exemplary embodiment of the present disclosure, the feature obtaining unit 322 may be configured to: splicing the spatial attention information between the current video frame and the next video frame with the spatial attention information between the current video frame and the previous video frame; and obtaining the characteristics of the current video frame for super resolution by using a fusion network based on the splicing result.

The generating unit 323 is configured to generate a high resolution video frame of the current video frame based on the feature for super resolution.

In an exemplary embodiment of the present disclosure, the generating unit 323 may be configured to: generating a high resolution video frame of the current video frame using a neural network based on the features for super resolution.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

The video processing apparatus according to the exemplary embodiment of the present disclosure has been described above with reference to fig. 3 and 4. Next, an electronic device according to an exemplary embodiment of the present disclosure is described with reference to fig. 5.

Referring to fig. 5, an electronic device 500 includes at least one memory 501 and at least one processor 502, the at least one memory 501 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 502, perform a method of video processing according to an exemplary embodiment of the present disclosure.

In an exemplary embodiment of the present disclosure, the electronic device 500 may be a PC computer, a tablet device, a personal digital assistant, a smartphone, or other device capable of executing the above-described set of instructions. Here, the electronic device 500 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) individually or in combination. The electronic device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 500, the processor 502 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 502 may execute instructions or code stored in the memory 501, wherein the memory 501 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 501 may be integrated with the processor 502, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 501 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 501 and the processor 502 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 502 is able to read files stored in the memory.

In addition, the electronic device 500 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.

There is also provided, in accordance with an example embodiment of the present disclosure, a computer-readable storage medium, such as a memory 501, comprising instructions executable by a processor 502 of an apparatus 500 to perform the above-described method. Alternatively, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, which comprises computer programs/instructions, which when executed by a processor, implement the method of video processing according to an exemplary embodiment of the present disclosure.

The video processing method and apparatus according to the exemplary embodiments of the present disclosure have been described above with reference to fig. 1 to 5. However, it should be understood that: the video processing apparatus and units thereof shown in fig. 3 and 4 may be respectively configured as software, hardware, firmware, or any combination thereof to perform a specific function, the electronic device shown in fig. 5 is not limited to include the above-shown components, but some components may be added or deleted as needed, and the above components may also be combined.

According to the video processing method and the video processing device, all video frames of a video are acquired; for each of the all video frames, performing the following: the method comprises the steps of calculating a change image between a current video frame and an adjacent video frame (wherein the adjacent video frame is a video frame adjacent to the current video frame), calculating spatial attention information between the current video frame and the adjacent video frame based on the change image between the current video frame and the adjacent video frame, obtaining a feature for super resolution of the current video frame based on the spatial attention information between the current video frame and the adjacent video frame, and generating a high-resolution video frame of the current video frame based on the feature for super resolution, so that the attention to a complementary region is increased based on the spatial attention information, the attention to a redundant region is reduced, and the super resolution performance is improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A video processing method, comprising:

acquiring all video frames of a video;

for each of the all video frames, performing the following:

calculating a change image between a current video frame and an adjacent video frame, wherein the adjacent video frame is a video frame adjacent to the current video frame,

calculating spatial attention information between a current video frame and the neighboring video frame based on the change image, wherein spatial attention information represents information of a region of interest of the current video frame with respect to the neighboring frame,

obtaining features for super resolution of the current video frame based on spatial attention information between the current video frame and the neighboring video frame, and

generating a high resolution video frame of the current video frame based on the features for super resolution.

2. The video processing method according to claim 1, wherein the neighboring video frame comprises a next video frame of the current video frame, wherein the next video frame is a next video frame neighboring the current video frame,

wherein the step of calculating a change image between the current video frame and the neighboring video frame comprises:

a change image between the current video frame and the next video frame is calculated,

wherein the step of calculating spatial attention information between the current video frame and the neighboring video frame based on the change image comprises:

spatial attention information between a current video frame and a next video frame is calculated based on a change image between the current video frame and the next video frame.

3. The video processing method of claim 2, wherein the step of calculating the spatial attention information between the current video frame and the next video frame based on the changed image between the current video frame and the next video frame comprises:

dividing a current video frame into a first area and a second area based on the change degree of each frame of pixels in a change image between the current video frame and a next video frame;

determining a first region of a change image corresponding to the first region and a second region of the change image corresponding to the second region in a change image between the current video frame and a next video frame, and determining a first region of the next video frame corresponding to the first region and a second region of the next video frame corresponding to the second region in the next video frame;

calculating spatial attention information between a first region of a current video frame and a first region of a next video frame based on the change image first region;

calculating spatial attention information between a second region of the current video frame and a second region of the next video frame based on the change image second region,

the spatial attention information between the current video frame and the next video frame comprises spatial attention information between a first region of the current video frame and a first region of the next video frame, and spatial attention information between a second region of the current video frame and a second region of the next video frame.

4. The video processing method according to claim 3, wherein the step of calculating spatial attention information between the first region of the current video frame and the first region of the next video frame based on the changed image first region comprises:

performing point multiplication on the first area of the change image and the first area of the current video frame to obtain a first point multiplication result;

performing spatial gating convolution operation on the first area of the change image and the first point multiplication result to obtain a first convolution result;

spatial attention information between a first region of the current video frame and a first region of a next video frame is determined based on the first convolution result, wherein the first region of the next video frame is a region of the next video frame corresponding to the first region of the current video frame.

5. The video processing method of claim 4, wherein the step of determining spatial attention information between the first region of the current video frame and the first region of the next video frame based on the first convolution result comprises:

performing linear rectification on the first convolution result to obtain a first rectification result;

the first rectification result is taken as spatial attention information between the first region of the current video frame and the first region of the next video frame,

wherein the step of linearly rectifying the first convolution result comprises:

and setting the value less than zero in the first convolution result to be equal to zero, and keeping the value more than zero in the first convolution result unchanged.

6. The video processing method according to claim 3, wherein the step of calculating spatial attention information between the second region of the current video frame and the second region of the next video frame based on the changed image second region comprises:

performing negation operation on the second area of the change image to obtain a negation image of the second area of the change image;

performing dot multiplication on the negation image and a second area of the current video frame to obtain a second dot multiplication result;

performing spatial gating convolution operation on the negation image and the second dot product result to obtain a second convolution result;

spatial attention information between a second region of the current video frame and a second region of a next video frame is determined based on the second convolution result, wherein the second region of the next video frame is a region of the next video frame corresponding to the second region of the current video frame.

7. The video processing method of claim 6, wherein the step of determining spatial attention information between the second region of the current video frame and the second region of the next video frame based on the second convolution result comprises:

performing linear rectification on the second convolution result to obtain a second rectification result;

using the second rectification result as spatial attention information between the second region of the current video frame and the second region of the next video frame,

wherein the step of linearly rectifying the second convolution result comprises:

and setting the value less than zero in the second convolution result to be equal to zero, and keeping the value more than zero in the second convolution result unchanged.

8. A video processing apparatus, comprising:

a video frame acquisition unit configured to acquire all video frames of a video; and

a video processing unit configured to process for each of the all video frames,

wherein the video processing unit includes:

an attention calculation unit configured to calculate a change image between a current video frame and an adjacent video frame, calculate spatial attention information between the current video frame and the adjacent video frame based on the change image, wherein the spatial attention information represents information of a region in which the current video frame focuses with respect to the adjacent frame, the adjacent video frame being a video frame adjacent to the current video frame,

a feature obtaining unit configured to obtain a feature for super resolution of a current video frame based on spatial attention between the current video frame and the neighboring video frame, an

A generating unit configured to generate a high resolution video frame of the current video frame based on the feature for super resolution.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video processing method of any of claims 1 to 7.

10. A computer-readable storage medium storing a computer program, which when executed by a processor of an electronic device causes the electronic device to perform the video processing method of any one of claims 1 to 7.