CN112750094B

CN112750094B - Video processing method and system

Info

Publication number: CN112750094B
Application number: CN202011611610.7A
Authority: CN
Inventors: 赵洋; 马彦博; 曹力; 贾伟; 李琳; 刘晓平
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-12-09
Anticipated expiration: 2040-12-30
Also published as: CN112750094A

Abstract

The invention relates to a video processing method and a system, and the method specifically comprises the following steps: obtaining an initial interleaved input frame; vertically interpolating the initial staggered input frames to complete frame resolution, and performing time sequence alignment fusion to obtain time sequence fusion characteristic frames; and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image. Compared with the prior art, the method can uniformly realize the de-interlacing, compression post-processing and super-resolution of the low-quality video, thereby recovering the reconstructed video with high visual quality.

Description

Video processing method and system

Technical Field

The present invention relates to the field of video and image processing, and in particular, to a video processing method and system.

Background

Interlaced technology has found widespread use in early television broadcast systems (e.g., NTSC, PAL, and SECAM). The odd line pixels and the even line pixels of the interlaced scanning frame image are respectively from two different frames, which are called as an odd field and an even field, and based on the mode, the frame rate and the bandwidth of the video can be well balanced. Since the two fields are captured at different time instances, the images of the different fields actually have a certain displacement difference and cannot be perfectly aligned spatially. So when two fields of content are interlaced, noticeable jagged artifacts are observed in the video, and in the case of large motion, the effect of such artifacts is more severe. In addition to the interlacing artifacts, old videos often contain other complex noises and are poor in definition, the traditional de-interlacing method is single and limited in effect and cannot uniformly remove the complex negative effects, and the current deep learning methods are all based on a single frame and cannot well process the severe artifacts caused by large motion, so that a clean picture is recovered.

Therefore, how to design a processing method and system capable of recovering higher visual effect for video images containing complex interlacing artifacts becomes a problem to be solved in the art.

Disclosure of Invention

The invention aims to provide a video processing method and a video processing system aiming at a video image containing complex staggered artifacts. And respectively carrying out high-frequency enhancement and deep information reuse from the fusion characteristics containing the time sequence information by using a double-flow network structure, thereby further removing compression noise and recovering details. Finally, the image features are hyper-divided using an up-sampling method to obtain the final high resolution output image.

In order to achieve the purpose, the invention provides the following scheme:

a video processing method, comprising the steps of:

obtaining an initial interleaved input frame;

vertically interpolating the initial staggered input frame to complete frame resolution, and performing time sequence alignment fusion to obtain a time sequence fusion characteristic frame;

and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image.

Optionally, the vertically interpolating the initial interlaced input frame to the complete frame resolution, and performing time sequence alignment fusion to obtain a time sequence fusion feature frame specifically includes the following steps:

splitting the initial interlaced input frame into an odd field and an even field;

acquiring depth features corresponding to the odd field and the even field by adopting a feature extraction network based on a depth neural network;

adopting a vertical up-sampling network based on a depth neural network to carry out vertical interpolation on the depth characteristics to obtain a vertical interpolation reconstruction frame;

and carrying out time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame.

Optionally, the obtaining of the depth features corresponding to the odd field and the even field by using the feature extraction network based on the deep neural network specifically includes the following steps:

acquiring images corresponding to the odd field and the even field;

converting the image into 1 times of feature maps through a first convolution layer, and transmitting the 1 times of feature maps to a plurality of first residual blocks connected with the first convolution layer through residual errors;

the first residual blocks perform feature extraction and reconstruction on the feature maps with the quantity being 1 time of that of the feature maps to obtain depth features corresponding to the odd fields and the even fields;

outputting the depth feature through a second convolutional layer.

Optionally, the vertically interpolating the depth feature by using a vertical upsampling network based on a depth neural network to obtain a vertically interpolated reconstructed frame specifically includes the following steps:

inputting said depth features into a convolutional neural network comprising a third convolutional layer and a vertical pixel scrambling block;

increasing the dimension of the 1-fold number of feature maps to 2-fold number of feature maps by the third convolutional layer, and transmitting the 2-fold number of feature maps to the vertical pixel scrambling block;

and performing up-sampling on the feature maps with the quantity being 2 times of that of the feature maps in the vertical direction through the vertical pixel scrambling block to obtain the vertically interpolated reconstructed frame.

Optionally, performing time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion feature frame specifically includes the following steps:

connecting adjacent frames in series with the frames in the vertically interpolated reconstructed frames corresponding to the adjacent frames, wherein the adjacent frames are symmetrical to each frame in the vertically interpolated reconstructed frames;

obtaining the deformable convolution offset required by the adjacent frame through offset network learning;

according to the deformable convolution offset, sequentially aligning the adjacent frames through a deformable convolution layer to obtain aligned frame characteristics;

and performing fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.

Optionally, the obtaining of the deformable convolution offset required by the adjacent frame through offset network learning specifically includes the following steps:

inputting said vertically interpolated reconstructed frame to an offset learning unit comprising a fifth convolution layer, a U-Net structure;

reducing the feature maps of the adjacent frames from 2 times to 1 times by the fifth convolution layer;

and obtaining the deformable convolution offset required by the adjacent frame through an offset learning unit of the U-Net structure according to the adjacent frame after the dimension reduction.

Optionally, the step of removing compression noise from the time-series fusion feature frame to obtain a reconstructed output image specifically includes the following steps:

inputting the time sequence fusion characteristic frame into a convolutional neural network comprising a plurality of multi-scale blocks to obtain a multi-layer output result, and connecting the multi-layer output result to an output layer from a residual error for accumulation to obtain an enhanced processing characteristic frame;

meanwhile, inputting the time sequence fusion feature frame into a convolutional neural network comprising a plurality of second residual blocks to obtain an information reuse feature frame;

and accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic frame.

And performing super-resolution reconstruction on the accumulated image characteristic frame to obtain the reconstructed output image.

Optionally, network training is carried out on the whole neural network, the network training is end-to-end whole training,

the samples of the network training use the synthesized frame sequence subjected to the interleaving degradation processing as a training set, and the loss function L in the whole training process is as follows:

wherein the content of the first and second substances,

for outputting the image, O _t For the reference picture, t is the timestamp and e is a constant.

The present invention also provides a video processing system, comprising:

an acquisition module for acquiring an initial interlaced input frame;

a multi-field fusion alignment de-interlacing module for vertically interpolating the initial interlaced input frame to a complete frame resolution, and performing time sequence alignment fusion to obtain a time sequence fusion feature frame;

and the de-interlacing feature optimization module is used for removing compression noise from the time sequence fusion feature frame to obtain a reconstructed output image.

Optionally, the multi-field fusion alignment de-interlacing module includes:

a field splitting unit for splitting the initial interlaced input frame into an odd field and an even field;

the characteristic extraction unit is connected with the field splitting unit and used for acquiring depth characteristics corresponding to the odd field and the even field;

the vertical up-sampling unit is connected with the feature extraction unit and is used for performing vertical interpolation on the depth features to obtain a vertical interpolation reconstruction frame;

and the multi-frame alignment fusion unit is connected with the vertical up-sampling unit and is used for carrying out time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame.

The de-interlacing feature optimization module comprises:

the multi-scale enhancement unit is used for carrying out high-frequency enhancement processing on the time sequence fusion feature frame to obtain an enhanced processing feature frame;

the depth residual error unit is used for reusing depth information of the time sequence fusion characteristic frame to obtain an information reuse characteristic frame;

and the accumulation unit is used for accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic.

And the up-sampling unit is used for performing super-resolution reconstruction on the accumulated image characteristics to obtain the reconstructed output image.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

different from the traditional single de-interlacing method, the method can uniformly realize de-interlacing, de-compression, frame interpolation and super-resolution of the old video. Through vertical field interpolation of a plurality of fields and efficient time sequence alignment, limited field information is jointly utilized to effectively remove staggered saw teeth, meanwhile, the full utilization of time redundant information further eliminates complex artifacts such as compression and the like, and recovers more high-frequency details as much as possible, thereby recovering a high-visual-effect high-resolution image.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a video processing method according to embodiment 1 of the present invention.

Fig. 2 is a flowchart of a video processing system according to embodiment 2 of the present invention.

Fig. 3 is a block diagram of a video processing system according to embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a video processing method and a video processing system, which are used for splitting an input frame into an odd field and an even field, inputting the odd field and the even field into a network for feature extraction, then combining the odd field and the even field through vertical pixel interpolation to obtain a plurality of reconstructed frames with full resolution, aligning the time sequence of the frame features of adjacent frames with the features of an intermediate frame, and fusing and dimensionality-reducing the plurality of features by using a convolution network to obtain fused features with time sequence information. And then, a double-current network structure is used for respectively carrying out high-frequency enhancement and deep information reuse from the fusion characteristics containing the time sequence information, thereby further removing the compression noise and recovering the details. Finally, the image features are over-divided using an up-sampling method to obtain the final high resolution output image. Compared with the prior art, the method can uniformly realize the de-interlacing, compression post-processing and super-resolution of the low-quality video, thereby recovering the reconstructed video with high visual quality.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

Example 1:

referring to fig. 1, a video processing method according to the present invention includes the following steps:

s1: obtaining an initial interlaced input frame;

s2: vertically interpolating the initial staggered input frames to complete frame resolution, and performing time sequence alignment fusion to obtain time sequence fusion characteristic frames; wherein, S2 specifically comprises the following steps:

s21: splitting the initial interlaced input frame into an odd field and an even field;

s22: acquiring depth features corresponding to the odd field and the even field by adopting a feature extraction network based on a depth neural network; specifically, S22 further includes the following steps:

s221: acquiring images corresponding to the odd field and the even field;

s222: converting the image into 1 times of feature maps through a first convolution layer, and transmitting the 1 times of feature maps to a plurality of first residual blocks connected with the first convolution layer through residual errors;

s223: the first residual blocks perform feature extraction and reconstruction on the feature maps with the quantity being 1 time of that of the feature maps to obtain depth features corresponding to the odd fields and the even fields;

s224: outputting the depth feature through a second convolution layer.

S23: adopting a vertical up-sampling network based on a depth neural network to carry out vertical interpolation on the depth characteristics to obtain a vertical interpolation reconstruction frame; specifically, S23 further includes the following steps:

s231: inputting said depth features into a convolutional neural network comprising a third convolutional layer and a vertical pixel scrambling block;

s232, increasing the dimension of the feature maps of 1 time number to feature maps of 2 times number through the third convolution layer, and transmitting the feature maps of 2 times number to the vertical pixel scrambling block;

and S233, performing up-sampling on the feature maps of 2 times of the number in the vertical direction through the vertical pixel scrambling block to obtain the vertical interpolation reconstruction frame.

S24: and carrying out time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame. Specifically, S24 further includes the following steps:

s241: connecting adjacent frames in series with frames in the vertically interpolated reconstructed frames corresponding to the adjacent frames, wherein the adjacent frames are symmetric to each frame in the vertically interpolated reconstructed frames;

s242: obtaining the deformable convolution offset required by the adjacent frame through offset network learning; specifically, S242 further includes the following steps:

s2421: inputting said vertically interpolated reconstructed frame to an offset learning unit comprising a fifth convolution layer, a U-Net structure;

s2422: reducing the feature maps of the adjacent frames from 2 times to 1 times by the fifth convolution layer;

s2423: and obtaining the deformable convolution offset required by the adjacent frame through an offset learning unit of the U-Net structure according to the adjacent frame after the dimension reduction.

S243: according to the deformable convolution deviation, sequentially aligning the adjacent frames through a deformable convolution layer to obtain aligned frame characteristics;

s244: and performing fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.

S3: and removing compression noise from the time sequence fusion characteristic frame to obtain a reconstructed output image. Specifically, S3 further includes the following steps:

s31: inputting the time sequence fusion characteristic frame into a convolutional neural network comprising a plurality of multi-scale blocks to obtain a multi-layer output result, and connecting the multi-layer output result to an output layer from a residual error for accumulation to obtain an enhanced processing characteristic frame;

s32: meanwhile, inputting the time sequence fusion feature frame into a convolutional neural network comprising a plurality of second residual blocks to obtain an information reuse feature frame;

s33: and accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic frame.

S34: and performing super-resolution reconstruction on the accumulated image characteristic frame to obtain the reconstructed output image.

As a possible implementation manner, the network training is carried out on the whole neural network, the network training is end-to-end whole training,

wherein the content of the first and second substances,

for outputting the image, O _t For reference picture, t is the time stamp, e is 1 × 10 ^-3 。

Through the steps, the method can uniformly realize the de-interlacing, the de-compression, the frame interpolation and the super-resolution of the old video. Through vertical field interpolation of a plurality of fields and efficient time sequence alignment, limited field information is jointly utilized to effectively remove staggered saw teeth, meanwhile, the full utilization of time redundant information further eliminates complex artifacts such as compression and the like, and recovers more high-frequency details as much as possible, thereby recovering a high-visual-effect high-resolution image.

Example 2:

referring to fig. 2 and fig. 3, the present invention further provides a video processing system, including:

an acquisition module for acquiring initial interleaved input frames;

a multi-field fusion alignment de-interlacing module 1, configured to vertically interpolate the initial interlaced input frame to a complete frame resolution, and perform time sequence alignment fusion to obtain a time sequence fusion feature frame;

specifically, the multi-field fusion alignment de-interlacing module 1 includes:

a field splitting unit 3 for splitting the initial interleaved input frame into an odd field and an even field;

the feature extraction unit 4 is connected with the field splitting unit 3 and is used for acquiring depth features corresponding to the odd field and the even field; the feature extraction unit 4 is a feature extraction network based on a deep neural network, and the feature extraction network sequentially comprises:

an input layer for converting the images corresponding to the odd field and the even field into 1 times of feature maps by a first convolution layer;

the first residual blocks are connected with the input layer through residual errors and used for carrying out feature extraction and reconstruction on the feature maps with the quantity being 1 time that of the feature maps to obtain depth features corresponding to the odd fields and the even fields;

an output layer for outputting the depth feature through a second convolution layer.

A vertical up-sampling unit 5 connected to the feature extraction unit 4, configured to perform vertical interpolation on the depth features to obtain a vertically interpolated reconstructed frame; the vertical upsampling unit 5 is a vertical upsampling network based on a deep neural network, and the vertical upsampling network sequentially comprises:

an input layer for increasing the 1-fold number of feature maps to 2-fold number of feature maps by a third convolution layer;

and the vertical pixel scrambling block is used for performing up-sampling on the feature maps with the quantity being 2 times in the vertical direction to obtain the vertically interpolated reconstructed frame.

And the multi-frame alignment fusion unit 6 is connected with the vertical up-sampling unit 5 and is used for performing time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame. The method specifically comprises the following steps:

a concatenation unit configured to concatenate an adjacent frame with a frame of the vertically interpolated reconstructed frames corresponding to the adjacent frame, where the adjacent frame is symmetric to each of the vertically interpolated reconstructed frames;

the deformable convolution offset unit is used for obtaining the deformable convolution offset required by the adjacent frame through offset network learning; specifically, the offset network is an offset network based on a convolutional neural network, and the offset network sequentially includes:

an input layer for reducing the feature maps of the adjacent frames from 2 times to 1 times by a fifth convolution layer;

the offset learning unit of the U-Net structure is used for obtaining the deformable convolution offset required by the adjacent frame according to the adjacent frame after the dimension reduction;

the offset learning unit of the U-Net structure comprises: 3 downsample blocks and 3 upsample blocks, wherein each downsample block comprises, in order, a sixth convolutional layer, a first LRelu activation function, a seventh convolutional layer, and a second LRelu activation line number; each up-sampling block comprises an eighth convolution layer, a third LRelu activating function, a ninth convolution layer, a fourth LRelu function and a bilinear interpolation operation in sequence; each of the downsample blocks and the corresponding upsample block are connected by a residual.

The alignment unit is used for sequentially aligning the adjacent frames through a deformable convolution layer according to the deformed convolution offset to obtain aligned frame characteristics;

and the fusion unit is used for carrying out fusion operation on the aligned frame features through a fourth convolution layer to obtain the time sequence fusion feature frame.

And the de-interlacing feature optimization module 2 is used for removing compression noise from the time sequence fusion feature frame to obtain a reconstructed output image.

Specifically, the de-interlacing feature optimization module includes:

the multi-scale enhancement unit 7 is used for performing high-frequency enhancement processing on the time sequence fusion feature frame to obtain an enhanced processing feature frame; specifically, the multi-scale enhancement unit is a convolutional neural network-based multi-scale enhancement network, the multi-scale enhancement network includes multiple multi-scale blocks, each multi-scale block is formed by stacking a tenth convolutional layer, an eleventh convolutional layer and a twelfth convolutional layer, and outputs of the tenth convolutional layer, the eleventh convolutional layer and the twelfth convolutional layer are connected to a tail through residual errors for accumulation.

A depth residual error unit 8, configured to perform depth information reuse on the time sequence fusion feature frame to obtain an information reuse feature frame; specifically, the depth residual unit is a depth residual network based on a convolutional neural network, the depth residual network includes a plurality of second residual blocks, and each of the second residual blocks includes a thirteenth convolutional layer, a Relu activation function, and a fourteenth convolutional layer; the deep residual network further comprises a residual connection from the input to the plurality of second residual blocks.

And an accumulation unit 9, configured to accumulate the enhanced processing feature frame and the information reuse feature frame to obtain an accumulated image feature.

And the up-sampling unit 10 is used for performing super-resolution reconstruction on the accumulated image characteristics to obtain a reconstructed output image. Specifically, the upsampling unit is an upsampling network based on a convolutional neural network, and the upsampling network sequentially includes:

an input layer for increasing the dimension of the reduced 1-fold number of feature maps to 2-fold number of feature maps by a fifteenth convolution layer;

a sub-pixel convolution layer for performing 2 times of upsampling on the feature map with the quantity being 2 times to obtain the feature map with the quantity being 1 time after the upsampling;

an output layer for outputting the up-sampled 1-fold number of feature maps through a sixteenth convolution layer.

As a possible implementation manner, the system further includes a network training unit, which is used for performing network training on the whole network, wherein the network training is end-to-end whole training,

the sample of the network training uses a synthesized frame sequence of high-quality high-resolution video subjected to interleaving and quality degradation processing as a training set, wherein the quality degradation refers to changing a common clear video frame into a fuzzy interleaved frame so as to simulate the effect of old video, and a loss function L in the whole training process is as follows:

wherein, the first and the second end of the pipe are connected with each other,

In specific implementation, the acquisition module acquires N consecutive interlaced frames, where N is an odd number greater than or equal to 3, and for convenience of description, N =3 in the following example. For 3 consecutive interlaced frames I [ t-1, t +1], we denote the intermediate frame I [ t ] as the reference frame and the remaining two symmetric frames as the neighboring frames, both with a resolution of H W, where H is the height size and W is the width size. The odd line of each frame pixel is an odd field, the even line of each frame pixel is an even field, six fields adjacent in time sequence are obtained through field separation, and the resolution is H/2 multiplied by W. For a plurality of fields, corresponding depth features are acquired using the feature extraction unit 4.

It should be noted that the feature extraction unit 4 in the present invention is a deep neural network, for example, an example of the feature extraction unit 4 in the present invention is a feature extraction network based on a convolutional neural network, and the specific structure sequentially includes:

an input layer, specifically a convolution layer of 3 × 3 size, converts the input image into a 64-layer feature map;

5 residual blocks, wherein each residual block is specifically a convolution layer with the size of 3 multiplied by 3, a ReLU activation function layer and a convolution layer with the size of 3 multiplied by 3, the number of characteristic channels is 64, and the characteristic channels are used for carrying out characteristic extraction and reconstruction;

a residual join from after the input layer to after the residual block.

An output layer, specifically a 3 x 3 sized convolutional layer.

After the depth features are acquired, the features are vertically interpolated using a vertical upsampling unit to restore the original resolution and reduce the interleaved comb artifacts.

The vertical upsampling unit 5 in the present invention is a vertical upsampling network based on a deep neural network. Specifically, the invention provides one of the vertical field interpolation network examples based on the convolutional neural network, and the structures of the vertical field interpolation network examples sequentially comprise:

an input layer, specifically a convolution layer with a size of 3 × 3, which increases the feature maps of 64 layers in the previous stage to 128 layers;

a vertical pixel scrambling block for up-sampling the image in the vertical direction. It should be noted that, the Pixel Scrambling (PS) module is widely used in super-resolution networks, and our vertical pixel scrambling module only multiplies the features by 2 in the vertical direction.

Then obtaining six continuous reconstruction characteristic frames F with the resolution of H multiplied by W _t[1,2,3...6] 。

The interlaced artifacts have been removed preliminarily from the reconstructed frames in the above steps, but since only the information of a single field is utilized, a large amount of temporal and spatial feature information is missing, and many high-frequency details cannot be recovered. For this reason we need to aggregate more timing information to enrich the feature information.

Since the performance of the model is seriously affected by the existence of masks and large-scale motion in the video frame sequence, in order to fully utilize the time information of a plurality of reconstructed frames, the frames need to be aligned in time sequence.

The multi-frame alignment fusion unit 6 of the present invention is shown in fig. 3. For the successive reconstructed frames F obtained in the previous step _t[1,2,3...6] Our goal is to obtain a fused feature F with timing information ₃ * And F ₄ * For this purpose, it is first necessary to align all the adjacent frames of the target frame with them. Due to the existence of various complex artifacts, the traditional optical flow alignment cannot well learn accurate flow to realize alignment, so that implicit alignment based on deformable convolution is adopted, and an efficient offset learning structure is designed to learn more adaptive offset. We take F ₃ * For example, first, the adjacent frames are compared with F ₃ Deformable convolution offset deltaP required for concatenation and learning alignment by offset network _j If the offset network is R, then

△P _j ＝R([F ₃ ，F _j ])，j＝(1,2,4,5)

The offset network R in the invention is a method based on a convolutional neural network, and the specific structure sequentially comprises the following steps:

an input layer, specifically a convolution layer with the size of 3 multiplied by 3, reduces the dimension of the spliced adjacent frame features from 128 layers to 64 layers;

an offset learning unit of U-Net structure comprises 3 downsampling blocks and 3 upsampling blocks, wherein the downsampling blocks sequentially comprise a 3 x 3 convolutional layer with an expansion rate of 2, an LRelu activating function, a 3 x 3 convolutional layer with an expansion rate of 1 and an LRelu activating function. The upsampling block comprises, in order, a 3 x 3 convolutional layer with a spreading factor of 1, an LRelu activation function and a bilinear interpolation operation with a scaling factor of 2. Wherein each downsample block and corresponding upsample block residual are concatenated.

Obtaining a deformable convolution offset Δ P _j Then, the adjacent frames are aligned in sequence, so as to obtain the aligned frame characteristics A _j ：

A _j ＝D(F _j ，△P _j )，j＝(1,2,4,5)

Where D is an alignment operation, which is performed by a deformable convolution layer.

The aligned frame characteristics A _j Obtaining the final output fusion characteristic F of the step through fusion operation ₃ *。

In the invention, the fusion operation is a convolution layer with the size of 1 multiplied by 1, and the fusion dimensionality of a plurality of groups of features is reduced to 64 layers.

Fusion feature F ₄ * By using a same F ₃ * Obtained in a completely consistent manner, with an adjacent frame index of j = (2,3,5,6).

The obtained fusion features contain rich space-time information and greatly eliminate staggered artifacts, but the image features still have complex negative effects such as blurring and compression block artifacts, and in order to further learn depth features to restore details, the fusion features output in the previous stage are put into a de-staggered feature optimization module to perform further feature optimization and perform up-sampling to improve resolution.

The de-interlacing feature optimization module 2 of the present invention further comprises a multi-scale enhancement unit 7, a depth residual connection unit 8 and an upsampling unit 10, as shown in fig. 3. For the input fusion feature F, an up-sampling unit connected at the tail of a double-current network consisting of a multi-scale enhancement unit and a depth residual error unit is used for obtaining a final reconstructed image

Setting a multi-scale enhancement unit as M, a depth residual error connecting unit as S and an up-sampling unit as U, namely:

the multi-scale enhancement unit 7 in the invention is a method based on a convolutional neural network, and the specific structure of the embodiment is as follows:

3 multi-scale blocks, each multi-scale block being stacked of one 3 x 3 convolutional layer with a span of 1, one 3 x 3 convolutional layer with a span of 2, one 3 x 3 convolutional layer with a span of 1 and one 3 x 3 convolutional layer with a span of 2, the output of each layer being connected by a residual to a tail accumulation.

The depth residual error unit 8 in the present invention is also a method based on a convolutional neural network, and its specific structure is as follows:

6 residual blocks, each of which consists of a 3 × 3 convolutional layer, a Relu activation function and a 3 × 3 convolutional layer;

a residual concatenation from the input to the output of the residual block;

the multi-scale enhancement unit 7 aims at enhancing the representation of high frequency details by a multi-scale stacked reception field, and the depth residual unit 8 aims at further extracting spatio-temporal features and avoiding information loss. The outputs of the two units are accumulated and input into an up-sampling unit 10 for super-resolution reconstruction.

The upsampling unit 10 in the present invention may be a conventional upsampling algorithm, such as interpolation or the like; or an up-sampling network based on a deep neural network.

Specifically, the present invention provides one example of an upsampling network based on a convolutional neural network, and the structure of the upsampling network sequentially includes:

an input layer, specifically a convolution layer with a size of 3 × 3, for increasing the feature maps of 64 layers in the previous stage to 128 layers;

a sub-pixel convolution layer with a magnification factor of 2 for up-sampling the image by a factor of 2;

reconstructing 64 layers of characteristic diagram into output result by an output layer, specifically a convolution layer with 3 × 3 size

It should be noted that the vertical upsampling unit 5 in the present invention may be separately trained in advance, or may be integrated into the entire network for end-to-end training. The training samples for the entire network use the sequence of high quality high resolution video interleaved degraded synthesized frames as the training set. The main constraint is that the final reconstructed high resolution result is consistent with the original unprocessed sample image, and the loss function of the training is as follows:

for outputting the image, O _t For reference pictures, t is the time stamp, and e is empirically set to 1 × 10 ^-3 The charbonierloss, which is a common problem in image enhancement and reconstruction.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A video processing method, comprising the steps of:

obtaining an initial interleaved input frame;

carrying out compression noise removal processing on the time sequence fusion characteristic frame to obtain a reconstructed output image;

the step of vertically interpolating the initial interleaved input frames to a complete frame resolution and performing time sequence alignment fusion to obtain a time sequence fusion feature frame specifically comprises the following steps:

performing time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain a time sequence fusion characteristic frame;

the method for acquiring the depth features corresponding to the odd field and the even field by adopting the feature extraction network based on the deep neural network specifically comprises the following steps:

acquiring images corresponding to the odd field and the even field;

outputting the depth feature through a second convolution layer.

2. The video processing method according to claim 1, wherein said vertically interpolating the depth features using a vertical upsampling network based on a depth neural network to obtain a vertically interpolated reconstructed frame comprises the following steps:

and upsampling the feature maps with the quantity of 2 times in the vertical direction through the vertical pixel scrambling block to obtain the vertical interpolation reconstruction frame.

3. The video processing method according to claim 2, wherein performing time-series alignment fusion on the vertically interpolated reconstructed frame to obtain the time-series fusion feature frame specifically comprises the following steps:

connecting adjacent frames in series with frames in the vertically interpolated reconstructed frames corresponding to the adjacent frames, wherein the adjacent frames are symmetric to each frame in the vertically interpolated reconstructed frames;

according to the deformable convolution deviation, sequentially aligning the adjacent frames through a deformable convolution layer to obtain aligned frame characteristics;

4. The video processing method according to claim 3, wherein said obtaining the deformable convolution offset required for the adjacent frame through offset net learning specifically comprises the steps of:

and according to the adjacent frames after the dimension reduction, obtaining the deformable convolution offset required by the adjacent frames through an offset learning unit of the U-Net structure.

5. The video processing method according to claim 4, wherein the step of removing compression noise from the time-series fusion feature frame to obtain a reconstructed output image specifically comprises the steps of:

meanwhile, inputting the time sequence fusion characteristic frame into a convolutional neural network comprising a plurality of second residual error blocks to obtain an information reuse characteristic frame;

accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic frame;

and performing super-resolution reconstruction on the accumulated image characteristic frame to obtain a reconstructed output image.

6. The video processing method according to any of claims 1-5, further comprising network training the entire neural network, said network training being end-to-end overall training,

the sample of the network training uses the synthesized frame sequence subjected to the interleaving and quality reduction processing as a training set, and a loss function L in the whole training process is as follows:

to output an image, O _t For a reference picture, t is the timestamp, and e is a constant.

7. A video processing system, comprising:

an acquisition module for acquiring an initial interlaced input frame;

a multi-field fusion alignment de-interlacing module used for vertically interpolating the initial interlaced input frame to a complete frame resolution ratio and carrying out time sequence alignment fusion to obtain a time sequence fusion characteristic frame;

the de-interlacing feature optimization module is used for removing compression noise from the time sequence fusion feature frame to obtain a reconstructed output image;

the multi-field fusion alignment de-interlacing module comprises:

the characteristic extraction unit is connected with the field splitting unit and used for acquiring depth characteristics corresponding to the odd field and the even field; the feature extraction unit is a feature extraction network based on a deep neural network, and the feature extraction network sequentially comprises:

an input layer for converting the images corresponding to the odd field and the even field into characteristic maps of 1 time number by a first convolution layer;

an output layer for outputting said depth features through a second convolutional layer;

the multi-frame alignment fusion unit is connected with the vertical up-sampling unit and is used for carrying out time sequence alignment fusion on the vertically interpolated reconstructed frame to obtain the time sequence fusion characteristic frame;

the de-interlacing feature optimization module comprises:

the accumulation unit is used for accumulating the enhanced processing characteristic frame and the information reuse characteristic frame to obtain an accumulated image characteristic;

and the up-sampling unit is used for carrying out super-resolution reconstruction on the accumulated image characteristics to obtain the reconstructed output image.