CN116091868A

CN116091868A - Online video anti-shake device, online video anti-shake method and learning method thereof

Info

Publication number: CN116091868A
Application number: CN202310102762.1A
Authority: CN
Inventors: 刘帅成; 张卓凡; 刘震; 曾兵
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-05-09

Abstract

The invention discloses an online video anti-shake device, an online video anti-shake method and a learning method thereof, which belong to the technical field of video processing, wherein the learning method for video anti-shake comprises the following steps: acquiring training data; training the neural network model based on the training data; acquiring training data includes: obtaining a jittering video and a stable video; extracting a first inter-frame motion of the jittered video; transforming each frame of the stabilized video based on the first inter-frame motion of the dithered video to obtain a processed video; the stabilized video and the processed video are used as training data. According to the learning method, the motion of a jittering video is transferred to a stable video, so that an unstable video corresponding to the original stable video is synthesized, and then the original stable video and the corresponding unstable video are used as training data required by the video anti-shake method. The invention does not need to carry out synchronous shooting on the stable video and the jittering video, and the picture content can be irrelevant.

Description

Online video anti-shake device, online video anti-shake method and learning method thereof

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to online video anti-shake equipment, an online video anti-shake method and a learning method thereof.

Background

Video anti-shake aims at converting a shake video into a satisfactory stable video through a smooth video track, and is widely applied to the fields of smart phones, unmanned aerial vehicles, security protection and the like at present. Video anti-shake can be divided into three main categories at present, namely mechanical anti-shake, optical anti-shake and digital anti-shake. Mechanical anti-shake typically uses sensors and mechanical structures to accomplish this task. The optical anti-shake detects the angle and speed of motion through a set of lenses and sensors to achieve video stabilization. Digital anti-shake technology is implemented in software without using a specific device, and thus digital video anti-shake can be regarded as a problem in the fields of video processing and computer vision. Because digital anti-shake is implemented solely by means of software algorithms, it is the only way to stabilize already recorded video, in addition to saving costs and reducing specific equipment requirements.

Digital video anti-shake may consider two different environments: off-line anti-shake and on-line anti-shake. In an offline case, the information from all frames of the video can be used, thus yielding better results, especially in the post-processing of recorded video. Under the on-line condition, future frames are not used for anti-shake of the video, and the video can be immediately and stably recorded in the video recording process, so that the anti-shake method is important for a real-time flow field picture.

The traditional digital anti-shake method detects feature points in a video frame, then estimates a 2D transformation such as Homography (Homography), optical Flow (Optical Flow) and grid Flow (MeshFlow), or estimates a 3D camera pose as a motion representation, and finally performs smoothing processing on a camera path formed by the motion to realize video anti-shake. The anti-shake method based on deep learning in the conventional manner uses a neural network model, such as a convolutional neural network model (Convolutional Neural Networks), to directly learn the mapping relationship from unstable video to stable video. However, the conventional approach has the following drawbacks: 1. the traditional method is limited by a feature algorithm, and the situation that feature detection and tracking fail can occur on low-quality video, so that anti-shake failure is caused. 2. Although the deep learning method performs well on low quality video, it is very dependent on the quality and amount of training data, and usually takes video frames directly as input, and is therefore also affected by the picture texture. 3. The deep learning training data for video anti-shake adopts double-camera shooting, namely, two video recording devices with identical models respectively use and do not use external mechanical auxiliary anti-shake devices to synchronously shoot stable and unstable video pairs, and the problems of high cost, low efficiency, path divergence and the like are caused.

Disclosure of Invention

The invention provides online video anti-shake equipment, an online video anti-shake method and a learning method thereof, which can synthesize training data for video anti-shake tasks without double-machine shooting.

The invention is realized by the following technical scheme:

in one aspect, the present invention provides a learning method for video anti-shake, including the steps of: acquiring training data; training the neural network model based on the training data; acquiring training data includes: obtaining a jittering video and a stable video; extracting a first inter-frame motion of the jittered video; transforming each frame of the stabilized video based on the first inter-frame motion of the dithered video to obtain a processed video; the stabilized video and the processed video are used as training data.

In some of these embodiments, the loss function of the neural network model to be trained is:

L＝L _MC +αL _SC +βL _SP

wherein L is _MC Is a motion consistency loss function, L _SC Is a shape consistency loss function, L _SP Is the scale-preserving loss function, and α and β are balance parameters used to balance the contributions of the three loss functions.

In some of these embodiments, the motion consistency loss function is:

wherein B' _t And B' _t-1 A transformed field map representing two adjacent frames of the network estimate,

and->

Representing the true value of the transformed field map of two adjacent frames;

the shape consistency loss function is:

wherein v is _i Represents the ith mesh vertex, N represents the total number of mesh vertices;

the scale retention loss function is:

where s represents a scale factor.

In another aspect, the present application provides a method for anti-shake of a minimum delay online video, including the steps of: obtaining an unstable frame in a video; extracting a second inter-frame motion of a video formed by an unstable frame and a previous continuous frame through a preset neural network model; performing path smoothing on the unstable frames based on the second inter-frame motion and the trained neural network model to obtain a transformation field diagram; the unstable frame is reset by transforming the field pattern.

In some of these embodiments, resetting the unstable frame by transforming the field map comprises the steps of: and adjusting the positions of all pixels on the unstable frame according to the displacement vectors of all the pixel points provided by the transformation field diagram to obtain the stable frame.

In some of these embodiments, the neural network model being trained is a convolutional neural network model.

In some of these embodiments, the second inter-frame motion is represented in the form of a sparse grid; after extracting the second inter-frame motion of the video formed by the unstable frame and the previous continuous frames, before performing path smoothing on the unstable frame based on the second inter-frame motion and the neural network model after training to obtain a transformed field map, the method comprises the following steps: processing input data of the convolutional neural network model: interpolation is carried out through a sparse grid formed by the second inter-frame motion to obtain a flow field diagram; the flow field diagram comprises a channel dimension and a Gao Weihe wide dimension; and splicing the flow field graph on the channel dimension according to the time sequence by using the sliding window to form the input data of the convolutional neural network model.

The application also provides a minimum delay online video anti-shake device, comprising: the motion extraction device is used for extracting the second inter-frame motion of the video; a path smoothing device for smoothing a path of the video; a memory having a computer program stored thereon; a processor executing a computer program to implement the minimum delay online video anti-shake method of any of the above embodiments.

Compared with the prior art, the invention has the following advantages:

according to the learning method for video anti-shake, the motion of one shake video is transferred to one stable video, so that an unstable video corresponding to the original stable video is synthesized, and then the original stable video and the corresponding unstable video are used as training data required by the video anti-shake method. The invention does not need to carry out synchronous shooting on the stable video and the jittering video, and the picture content can be irrelevant.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly describe the drawings in the embodiments, it being understood that the following drawings only illustrate some embodiments of the present invention and should not be considered as limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a relationship between motion and a transformed field diagram of two adjacent frames in a video anti-shake method based on a deep learning method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a synthesis relationship of a processed video in a video anti-shake method based on a deep learning method according to an embodiment of the present invention;

fig. 3 is a flowchart of a video anti-shake method based on a deep learning method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a relationship between motion and a transform field of two adjacent frames in a loss function according to an embodiment of the present invention;

fig. 5 is a comparison chart of effects of a video anti-shake method based on a deep learning method according to an embodiment of the present invention;

FIG. 6 is a path diagram of a prior art dual camera video;

fig. 7 is a path diagram of a video anti-shake method based on a deep learning method according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention.

In the description of the present invention, it should be noted that, as the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., are used to indicate orientations or positional relationships based on those shown in the drawings, or those that are conventionally put in use in the product of the present invention, they are merely used to facilitate description of the present invention and simplify description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "horizontal," "vertical," and the like in the description of the present invention, if any, do not denote absolute levels or overhangs, but rather may be slightly inclined. As "horizontal" merely means that its direction is more horizontal than "vertical", and does not mean that the structure must be perfectly horizontal, but may be slightly inclined.

In the description of the present invention, it should also be noted that, unless explicitly stated and limited otherwise, the terms "disposed," "mounted," "connected," and "connected" should be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

The terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.

In one aspect, an embodiment of the present application provides a learning method for video anti-shake, including the following steps:

s10, training data are acquired. In S10, first, the first inter-frame motion of a jittered video is extracted using video motion estimation, the motion is expressed in the form of a grid stream, and then each frame of a stabilized video is transformed based on these first inter-frame motions, thereby obtaining a new jittered video. The first inter-frame motion of the plum is used for distinguishing from the second inter-frame motion, and the first inter-frame motion refers to the inter-frame motion of the acquired known jittering video in the process of acquiring training data; the second inter-frame motion is in the video anti-shake process, and the inter-frame motion of appointed continuous frames in the acquired video to be processed is acquired. The method does not need to carry out synchronous shooting on the stable video and the jittering video, and the picture content can be irrelevant.

S10 may specifically include the following steps:

s101, obtaining jittering video V _ust And stabilizing video V _stb . Wherein the video V is dithered _ust And stabilizing video V _stb May be uncorrelated, i.e. jittery video V _ust Content and stabilized video V of (2) _stb May be different.

S102, extracting first inter-frame motion of the jittered video. In S102, a Deep neural network model, such as Deep MeshFlow, may be used to estimate the jittered video V _ust And stabilizing video V _stb First inter-frame motion of (2)

And

s103, stabilizing the video V based on the first inter-frame motion of the jittery video _stb Each frame is transformed to obtain a new processed video V _syn . In S103, a video V with jitter is synthesized by migrating the first inter-frame motion of the jittered video onto a stable video _ust Is a dithering effect but picture and main path and stabilizing video V _stb New processed video V that remains consistent _syn For convenience of description, use

Frames respectively representing the three videos are obtained by +.>

To->

Transform to synthesize +.>

By the arrangement, each stable video can be combined into a new processing video, and a group of stable videos and the corresponding combined new processing video can form a stable/jittery video pair for network training. Referring to fig. 2, each video has the following relationship:

due to

And->

Has been calculated beforehand, so +.>

Can be expressed as:

in subsequent training, the path smoothing network will follow

For input, output +.>

Supervised training is performed for true values.

S104, taking the stable video and the corresponding processing video as training data.

And S20, training the training data to obtain a neural network model.

In the deep learning method, the loss function used by the neural network model to be trained is mainly as follows:

motion consistency Loss function (Motion-consistency Loss):

/>

and->

Representing the true value of the transformed field map of two adjacent frames. The motion consistency loss function is responsible for constraining the network to learn a reasonable anti-shake result while maintaining inter-frame continuity.

Shape consistency Loss function (Shape-consistency Loss):

wherein v is _i Representing the vertex of the ith mesh,

representing different mesh vertices, referring to figure 4,n represents the total number of mesh vertices. The output result of the shape consistency loss function constraint convolution neural network model cannot deviate from the general grid shape greatly, otherwise, the result picture is distorted and distorted.

Scale-preserving Loss function (Scale-preserving Loss):

where s represents a scale factor. Because we convert sparse motion in grid form into a dense flow field map and predict a grid-like transformed field map, it is necessary to introduce a scale-preserving loss function to ensure that the network can guarantee consistency of the output results in such a scale transformation.

This gives the final total loss function as follows:

L＝L _MC +αL _SC +βL _SP

where α and β are balance parameters used to balance the contributions of the three loss functions, where the value may be 0.01.

On the other hand, the application provides a video anti-shake method based on the deep learning method in any embodiment, which first uses a neural network model to estimate the second inter-frame motion of the input video, and specifically, the deep neural network model can be adopted. Preferably, the neural network model to be trained can adopt a convolutional neural network model, a sliding window is used for inputting a second inter-frame motion sequence of an input video into the convolutional neural network model with an attention mechanism for path smoothing, a transformation field diagram of the last frame of the sliding window is output, and finally, the shape and the position of the last frame in the window are transformed through the transformation field diagram to realize anti-shake. Different motion estimation methods may express motion in different modes, so we design motion estimated by different methods, and convert the motion into a unified dense flow field diagram according to the offset generated by the motion to each pixel position, so as to solve the problem of inconsistent motion expression mode, and the motion estimation method is also naturally suitable for being used as input of a convolutional neural network model.

Specifically, the video anti-shake method includes the following steps:

and T10, obtaining unstable frames in the video. In T10, the unstable frame of the video can be captured directly by the existing software, and the video recording device can capture the unstable frame I at time T _t As an example.

T20, extracting an unstable frame I including capturing T moment through a preset neural network model _t And a second inter-frame motion of the video formed by successive frames preceding the unstable frame, the predetermined neural network model may be set to be the same as the deep neural network model in step S102, and then I is recorded using a fixed window _t Past r video frames { I } _t } _r ＝<I _t ,I _t-1 ,…,I _t-r >And uses them to treat I _t Stabilization is performed. Since the whole process does not need to use I _t So at I _t After being captured, the device can be stabilized and output, so that the device is a minimum delay method. Second inter-frame motion { F _t The path smoothing network of the present application predicts transformed field patterns based solely on estimated motion, which may be responsible for another deep neural network model:

{B′ _t }＝φ({F _t ；θ})

where φ (-) represents the camera path smoothing network and θ represents the network parameters to be optimized.

T30, the second inter-frame motion, is represented in the form of a sparse grid. Processing input data of the convolutional neural network model: processing input data of the convolutional neural network model: interpolation is carried out through a sparse grid formed by the second inter-frame motion to obtain a flow field diagram; the flow field diagram comprises a channel dimension and a Gao Weihe wide dimension; splicing the flow field diagram on the channel dimension according to the time sequence by using a sliding window to form input data of a convolutional neural network model;

and T40, performing path smoothing on the unstable frames based on the second inter-frame motion and the trained neural network model obtained by the learning method for video anti-shake, and obtaining a transformation field diagram. At T40, the continuous flow field map within the sliding window may be input into a convolutional neural network model with a channel attention mechanism, and the transformed field map of the last frame in the sliding window is estimated. The convolutional neural network model used in the method adds a channel attention mechanism to the jump connection part on the basis of the UNet structure, so that the network can set weights for flow field diagrams at different time sequence positions according to the motion mode of an input sequence, and the anti-shake effect is improved.

And T50, resetting the unstable frame through the transformation field diagram. In T50, the elements in the transform field map estimated in T40 are in one-to-one correspondence with the pixel points at the same positions in the original frame, representing the displacement vector of the pixel from the position on the original frame to the position on the stable frame. According to the displacement vectors of all the pixel points provided by the transformation field diagram, the positions of all the pixels on the original frame can be adjusted to synthesize a stable frame I _t ′。

The embodiment of the application also provides a minimum delay online video anti-shake device, which comprises:

the motion extraction device is used for extracting the second inter-frame motion of the video;

a path smoothing device for smoothing a path of the video;

a memory having a computer program stored thereon;

a processor executing a computer program to implement the minimum delay online video anti-shake method of any of the above embodiments.

In the above embodiment, the processing efficiency can be improved by providing a dedicated device responsible for extracting the motion, but the neural network model of the other device focuses on smoothing the path.

In a specific example, the training is supervised, requiring a true transformed field map. In the training phase, the sequence of flow field patterns of two consecutive windows needs to be input together, because the motion consistency loss function is a time sequence loss function, which calculates the estimation result of the transformation field pattern of two consecutive frames. The shape consistency loss function and the scale retention loss function are used for constraining the quality of a single estimation result, and special treatment is not needed. In the reasoning stage, the loss function is not required to be calculated, and the flow field diagram sequences in the windows are sequentially sent into the convolution network according to the window sliding sequence.

The training process adopts Adam as an optimizer, the initial learning rate is set to be 1e-4, and a weight attenuation strategy is not used. We set 3 parameters beta of the optimizer ₁ ，β ₂ And E is 0.9,0.999 and 1e-8 respectively, the training is iterated for 10 ten thousand times, and the total time is about 20 hours on 2 NVIDIA 1080Ti display cards.

And (3) effect display:

referring to fig. 5, fig. 5 shows a comparison of the proposed method with two existing online anti-shake methods (columns 1, 2: two other methods; column 3: the method; column 4 original frame). The method can obtain good anti-shake effect in different scenes (rotation, scaling and the like), and can avoid the problems of excessive clipping, distortion and the like of the result.

Referring to fig. 6 and fig. 7, fig. 6 and fig. 7 show the effect of the proposed method on jitter video composition, fig. 6 is a path comparison of video pairs shot by double camera, fig. 7 is a path comparison of video pairs synthesized by the proposed method, dashed lines are jitter video paths, and solid lines are stable video paths. It can be seen that the method provided by the application can synthesize high-quality training data samples without generating a divergence in the path with the original stable video.

The embodiment of the application further provides a computer storage medium, on which a computer program is stored, the computer program being loaded by a processor to perform the video anti-shake method according to any of the above embodiments based on the deep learning method according to any of the above embodiments.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent variation, etc. of the above embodiment according to the technical matter of the present invention fall within the scope of the present invention.

Claims

1. A learning method for video anti-shake, comprising the steps of:

acquiring training data;

training a neural network model based on the training data;

the acquiring training data includes:

obtaining a jittering video and a stable video;

extracting a first inter-frame motion of the jittered video;

transforming each frame of the stable video based on the first inter-frame motion of the jittered video to obtain a processed video;

and taking the stable video and the processing video as training data.

2. The learning method for video anti-shake according to claim 1, wherein, when training a neural network model, a training process is constrained by a loss function, and the loss function of the neural network model to be trained is:

L＝L _MC +αL _SC +βL _SP

3. The learning method for video anti-shake according to claim 2, wherein the motion consistency loss function is:

wherein B is _t ' and B _t ′ _-1 A transformed field map representing two adjacent frames of the network estimate,

and->

the shape consistency loss function is:

the scale retention loss function is:

where s represents a scale factor.

4. The minimum delay online video anti-shake method is characterized by comprising the following steps of:

obtaining an unstable frame in a video;

extracting a second inter-frame motion of a video formed by an unstable frame and a previous continuous frame through a preset neural network model;

performing path smoothing on the unstable frames based on the second inter-frame motion and the neural network model after training to obtain a transformation field diagram;

resetting the unstable frame through the transformation field diagram.

5. The minimum delay online video anti-shake method according to claim 4, wherein said resetting the unstable frame through the transformed field map comprises the steps of:

and adjusting the positions of all pixels on the unstable frame according to the displacement vectors of all pixel points provided by the transformation field diagram to obtain a stable frame.

6. The minimum delay online video anti-shake method of claim 4, wherein the neural network model being trained is a convolutional neural network model.

7. The minimum delay online video anti-shake method of claim 6, wherein the second inter-frame motion is represented in the form of a sparse grid;

after said extracting a second inter-frame motion comprising an unstable frame and a video formed by a succession of frames preceding it, and before said smoothing the path of the unstable frame based on said second inter-frame motion and on said neural network model after training, obtaining a transformed field map, comprising the steps of:

processing input data of the convolutional neural network model:

interpolation is carried out on the sparse grid formed through the second inter-frame motion to obtain a flow field diagram; the flow field diagram comprises a channel dimension and a Gao Weihe wide dimension;

and splicing the flow field diagram on the channel dimension according to time sequence by using the sliding window to form the input data of the convolutional neural network model.

8. A minimum delay online video anti-shake apparatus, comprising:

a path smoothing device for smoothing a path of the video;

a memory having a computer program stored thereon;

a processor executing the computer program to implement the minimum delay online video anti-shake method of any of claims 4 or 7.