CN116866637A

CN116866637A - Video alignment method, device, computer equipment and readable storage medium

Info

Publication number: CN116866637A
Application number: CN202310900694.3A
Authority: CN
Inventors: 周凡
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-10-10

Abstract

The invention provides a video alignment method, a video alignment device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a first video to be aligned and a second video to be aligned, and unifying frame rates of the first video to be aligned and the second video to be aligned; determining a target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is the largest; determining whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity; if the video frames do not exist, the video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area are synthesized, and the aligned video of the first video to be aligned and the aligned video of the second video to be aligned are obtained. The invention can automatically align two videos, can avoid the problem of inaccurate alignment caused by cutting, and greatly improves the accuracy of video alignment.

Description

Video alignment method, device, computer equipment and readable storage medium

Technical Field

The present invention relates to the field of image quality enhancement, and in particular, to a video alignment method, apparatus, computer device, and readable storage medium.

Background

Image quality crowding is an important link of image quality enhancement effect evaluation, and is usually to compare the effects of the same video with different sources after different image quality enhancement applications, and verify the effects of different image quality enhancement applications by combining manual evaluation. Because the video clipping and the video encoding and decoding are involved, the two videos need to be aligned before the test is carried out, the existing video alignment mode has excessive manual intervention, and if the video clipping exists, the video pictures cannot be completely aligned, so that the working efficiency of the crowded test stage and the accuracy of the test result are greatly reduced.

Disclosure of Invention

One of the purposes of the present invention is to provide a video alignment method, apparatus, computer device and storage medium, which improves the accuracy of video alignment, and provides data assurance for testing different image quality enhancement applications, and the present invention can be realized as follows:

in a first aspect, the present invention provides a video alignment method, the method comprising: acquiring a first video to be aligned and a second video to be aligned, and unifying frame rates of the first video to be aligned and the second video to be aligned; determining a target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is the largest; determining whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity; if the video frames do not exist, the video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area are synthesized, and the aligned videos of the first video to be aligned and the second video to be aligned are obtained.

In a second aspect, the present invention provides a video alignment apparatus comprising: the device comprises an acquisition module, a determination module and an alignment module; the acquisition module is used for acquiring a first video to be aligned and a second video to be aligned, and carrying out frame rate unification on the first video to be aligned and the second video to be aligned; the determining module is used for determining a target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is the largest; the determining module is further configured to determine whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity; and the alignment module is used for synthesizing video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area when the determination module determines that no clipping exists, so as to obtain the alignment videos of the first video to be aligned and the second video to be aligned.

In a third aspect, the present invention provides a computer device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the video alignment method of the first aspect.

In a fourth aspect, the present invention provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the video alignment method of the first aspect.

According to the video alignment method, the device, the computer equipment and the readable storage medium, after the first video to be aligned and the second video to be aligned are subjected to frame rate unification, the target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is maximum is determined, the maximum average similarity indicates that the overlapping degree of the two videos to be aligned is the highest, at the moment, whether the two videos to be aligned have a clipping condition is determined according to the maximum average similarity, if the clipping condition does not exist, video synthesis can be performed in the overlapping video area of the two videos to be aligned, so that the aligned videos are obtained, otherwise, the video synthesis is not performed, the whole process adopts a global optimization mode, the two videos can be automatically aligned, meanwhile, the technical blank of judging the condition in the existing video alignment mode is made up, the problem of inaccurate alignment caused by the clipping condition is avoided, and the accuracy of video alignment is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram illustrating a first video alignment scheme in the prior art;

FIG. 2 is a diagram of a second video alignment scheme in the prior art;

FIG. 3 is a simplified schematic diagram of a video alignment method according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of a video alignment method according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of step S402 provided by the implementation of the present invention;

FIG. 6 is a schematic diagram of a scenario of step S402-2 according to an embodiment of the present invention;

FIG. 7 is a functional block diagram of a video alignment apparatus according to an embodiment of the present invention;

fig. 8 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be noted that, if the terms "upper", "lower", "inner", "outer", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, or the azimuth or the positional relationship in which the inventive product is conventionally put in use, it is merely for convenience of describing the present invention and simplifying the description, and it is not indicated or implied that the apparatus or element referred to must have a specific azimuth, be configured and operated in a specific azimuth, and thus it should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, if any, are used merely for distinguishing between descriptions and not for indicating or implying a relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

In order to test and evaluate the image quality enhancement effect of different image quality enhancement applications, the same image or the same video can be subjected to image quality enhancement by utilizing different image quality enhancement applications, and then evaluation is performed to determine which image quality enhancement application has better image quality enhancement effect, and a plurality of tools are arranged in comparison with the same frame image. There is currently no automated solution that can fully align two videos, but rather acquire aligned videos in two mainstream ways as follows.

Referring to fig. 1, fig. 1 is a first video alignment method in the prior art, in which a video a and a video B are two video segments to be aligned from the same video, and the implementation process of the video alignment method is as follows: the manual first frame alignment is matched with the same frame rate matching, the two videos are decoded by using the same frame rate, then the same frame of the two video sequences is manually aligned to serve as a first frame, so that the video frames with the same time stamps from the video A and the video B after the first frame can be strictly ensured to be completely corresponding, the whole process belongs to semi-automatic, and the first frame alignment needs manual intervention.

Referring to fig. 2, fig. 2 is a conventional alternative video alignment method, which mainly adopts a method for automatically searching for image structural similarity, and the implementation process is as follows: selecting a first frame of the video A or the video B as a matched reference frame; then in another video clip, calculating a frame with the maximum structural similarity SSIM (Structural Similarity, SSIM) with the reference frame as a video frame aligned with the time stamp based on the set searching range; and taking the next frame of the reference frame as a new reference frame, calculating the frame with the maximum SSIM in the updated window as a new result, sequentially until the reference frame reaches the last frame or the matching frame reaches the end of the last frame, and finally obtaining the two sequence synthesis comparison videos with the time stamps completely corresponding to each other. The video alignment mode needs to set a proper search range, the time stamp difference range of the starting frames of the videos A and B is required to be smaller than the search range, other super parameters need to be configured, and the scheme is also adopted or subjected to a local optimization algorithm, so that the flexibility and the practicability of the scheme are greatly reduced.

It is clear that both the video alignment methods shown in fig. 1 and fig. 2 have defects, and both have a common technical problem that when there is cropping in one of the videos to be aligned, the other video to be aligned will have a part of video content more than the one video to be aligned, so that the finally synthesized video frame contents are not completely aligned, and no method can be used for distinguishing.

In order to solve the above technical problems, an embodiment of the present invention provides a video alignment method, please refer to fig. 3, fig. 3 is a simple schematic diagram of the video alignment method provided by the embodiment of the present invention, and the main technical concept is: the video to be aligned is converted into the high-dimensional characteristic curve with the same length as the video, then the two high-dimensional characteristic curves are subjected to global optimization, so that the best matching effect is realized, and meanwhile, the condition of video cutting can be easily screened out by combining with a preset threshold value, so that the video matching method has universality and practicability.

The video alignment method provided by the embodiment of the invention is described in detail below.

Referring to fig. 4, fig. 4 is a schematic flowchart of a video alignment method according to an embodiment of the present invention, where the video alignment method may be applied to a computer device, and includes the following steps:

s401, acquiring a first video to be aligned and a second video to be aligned, and unifying frame rates of the first video to be aligned and the second video to be aligned;

s402, determining a target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is the largest;

s403, determining whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity;

And S404, if the video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area do not exist, synthesizing the video frames of the first video to be aligned and the second video to be aligned, and obtaining the first video to be aligned and the second video to be aligned.

In the technical solutions of steps S401 to S404, after the frame rates of the first video to be aligned and the second video to be aligned are unified, the target overlapping video area when the average similarity between the first video to be aligned and the second video to be aligned is the largest is determined, the average similarity is the largest, and at this time, whether the two videos to be aligned have clipping conditions is determined according to the largest average similarity, if not, video synthesis can be performed in the overlapping video area of the two videos to be aligned to obtain aligned videos, otherwise, video synthesis is not performed, and the whole process adopts a global optimization mode, so that automatic alignment can be performed on the two videos, meanwhile, the technical blank of judging clipping conditions in the existing video alignment mode is made up, the problem of inaccurate alignment caused by clipping conditions can be avoided, and the accuracy of video alignment is greatly improved.

The above steps are described in detail below.

In step S401, the first video to be aligned and the second video to be aligned may be the same video from different sources, for example, a video of a movie play from an archetype platform and a tiger tooth platform. In practice, the first video to be aligned and the second video to be aligned may have different frame rates, different resolutions, and different durations, for example, the first video to be aligned is video a of 1080P, the second video to be aligned is video B of 720P, and the video a and the video B are different resolutions. After the first video to be aligned and the second video to be aligned are obtained, the two videos need to be unified into the same resolution and frame rate, and in the subsequent processing process, more accurate processing results can be obtained under the condition that the two videos have the same resolution and frame rate.

In step S402, a global optimization manner is adopted to determine the overlapping video areas of the first video to be aligned and the second video to be aligned, and it can be understood that when the coincidence ratio of the first video to be aligned and the second video to be aligned is the highest, the average similarity is the largest, so that accurate alignment of the same video frame content can be realized, and compared with a local optimization algorithm, the accuracy and the robustness are higher.

An embodiment of the present invention provides a feasible implementation manner for the step S402, please refer to fig. 5, and fig. 5 is a schematic flowchart of the step S402 provided by the embodiment of the present invention, which may include the following steps:

s402-1, converting the first video to be aligned and the second video to be aligned respectively to obtain a first characteristic curve and a second characteristic curve.

In the embodiment of the invention, the characteristic curve represents the characteristic representation of the video to be aligned in the time dimension, and in the duration range of the video to be aligned, each moment corresponds to a characteristic vector of a video frame. In alternative embodiments, a structural feature encoder may be used to convert the video to be aligned into a feature curve.

The above-mentioned structural feature encoder and its optimization method will be described in detail. The structural feature encoder in the embodiment of the invention can be but is not limited to a classical model for extracting image features by adopting VGG and the like, and can be optimized in the following way:

step a1, acquiring a training video frame sequence set.

In the embodiment of the invention, since the frame rate of the video is usually greater than 25FPS, the difference between the pictures of two continuous frames is very small, and the model is difficult to distinguish such subtle difference, in order to reduce the learning difficulty of the model, the training video can be sampled at intervals, so as to obtain a video frame with a frame rate smaller than a preset frame rate value (such as 5 FPS) of 25 FPS. Applying a sliding window on the sampled video frames, extracting a continuous video frame sequence X in the window, and marking the continuous video frame sequence X as X epsilon R ^N×H×W Wherein N is respectively slidingThe number of frames in the window, H and W, is the height and width of the video frame. Since the video differences to be compared mainly lie in resolution, sharpness, noise, image quality effects, etc., X can be processed by various image quality conversion methods, which can be, but are not limited to: changing the resolution by scaling; reducing sharpness using gaussian blur; adding Gaussian noise or Poisson noise; the degree of image quality degradation is controlled, the random combination of the image quality conversion modes is adopted, the sequence is randomly disordered, different image quality conversion effects of the same video frame are realized, and the video frame sequence after image quality enhancement is sequentially recorded as { X } ₁ ′，X ₂ ′，...，X _n ' and summing X to obtain a final set of video frame sequences { X, X } ₁ ′，X ₂ ′，...，X _n ' as a set of training video frame sequences.

Thus, the implementation of step a1 may be: the method comprises the steps of performing interval sampling on an original video according to a preset frame rate, and extracting from sampled video frames according to a preset window length to obtain an original video frame sequence; and carrying out image quality conversion on the original video frame sequence according to a plurality of image quality conversion modes to obtain a plurality of video frame sequences with different image quality effects, and taking the original video frame sequence and the video frame sequence subjected to the image quality conversion as a training video frame sequence set.

Step a2, determining any two groups of training video frame sequences in the training video frame sequence set as target training video frame sequences, and sequentially inputting the target training video frame sequences into a structural feature encoder to obtain feature vectors of target training video frames in each target training video frame sequence;

in the embodiment of the invention, the input is a training video frame sequence set { X, X } ₁ ′，X ₂ ′，...，X _n Any two sets of video frame sequences, e.g. taking video frame sequence X _I ∈R ^N×H×W ，X _J ∈R ^N×H×W Wherein image X corresponds to the kth frame position _I，k ∈R ^H ^×W ，X _J，k ∈R ^H×W But the picture content of the picture is consistent, but the picture quality effect is different. X is to be _I And X _J And obtaining X= |X after splicing _I ，X _J |∈R ^2N×H×W As a Batch and input into the structural feature encoder, the output is that the feature vector is Y= [ Y ] _I ，Y _J ]∈R ^2N×M Where M represents the dimension of the feature vector, the structural feature encoder converts each frame into a feature vector of dimension M.

Optionally, before the training video frame sequence is input to the structural feature encoder, the input training video frame sequence may be first grayed, so that the color difference between the input training video frame sequence and the output training video frame sequence may be reduced, and the learning difficulty of the model may be reduced.

Step a3, taking each video frame in the two target training video frame sequences as a reference frame in turn, and calculating the similarity of feature vectors between the reference frame and all video frames except the reference frame;

In the embodiment of the invention, the optimization goal of the structural feature encoder is to minimize the video frame X corresponding to the same time stamp _I，k ，X _J，k Feature vector Y between _I，k ，Y _J，k ∈R ^M While maximizing the difference of video frames X corresponding to different time stamps _I，k ，X _J，l Feature vector Y between k+.l _I，k ，Y _J，l ∈R ^M The difference of k+.l. Thus, take turns Y _I Or Y _J As reference vectors, and Y _J Y is as follows _I The feature vector similarity calculation is performed on all feature vectors in (a), wherein the feature vector similarity can be, but is not limited to, cosine similarity, such as Y _I，k ，Y _J，k Is S (Y) _I，k ，Y _J，k )，Y _I，k ，Y _J，l Is S (Y) _I，k ，Y _l )，k≠l。

Step a4, determining a loss value of a loss function of the structural feature encoder according to the similarity of all the feature vectors, and updating model parameters of the structural feature encoder according to the loss value;

in the embodiment of the invention, for each reference frame, the feature vector similarity between one video frame at the same time position as the reference frame is taken as a molecule, and the sum of the feature vector similarities between the reference frame and all video frames except the reference frame is taken as a molecule, so as to construct a loss function, wherein the loss function is shown in the following formula:

as can be seen from the formula, the similarity of the feature vectors of the kth frame is taken as a numerator, and the sum of the similarity of the feature vectors between the kth frame and other frames is taken as a denominator, so that the purposes of maximizing the feature similarity of the same frame and minimizing the feature similarity of different video frames are achieved. Substituting the feature vector similarity calculated in the step a3 into the loss function L to obtain a loss value, and updating model parameters of the structural feature encoder according to the loss value.

And returning to the step a2 until a preset optimization condition is reached, and obtaining the optimized structural feature encoder.

Substituting the feature vector similarity calculated in the step a3 into the loss function to obtain a loss value, adjusting model parameters of the structural feature encoder according to the loss value, and then continuously selecting a target training video frame sequence to optimize until a preset optimization condition is reached, wherein the preset optimization condition can be: the loss value of the loss function is not reduced any more or the iteration times reach the preset times, and the optimized structured feature encoder can be obtained by fixing the corresponding model parameters at the moment.

In combination with the optimized structural feature encoder, the implementation of the step S402-1 may include:

step b1, extracting video frames from a first video to be aligned and a second video to be aligned respectively to obtain a first video frame sequence and a second video frame sequence;

it can be understood that the first video frame sequence and the second video frame sequence are obtained in the same manner as the training video frame sequence is obtained, and a detailed description thereof is omitted herein.

In order to ensure the accuracy of the processing result, the first video frame sequence and the second video frame sequence may also be grayed out before step b2 is performed.

Step b2, respectively inputting the first video frame sequence and the second video frame sequence into a pre-optimized structural feature encoder to perform feature extraction to obtain a first feature vector and a second feature vector;

and b3, forming a first characteristic curve according to the first characteristic vector, and forming a second characteristic curve according to the second characteristic vector.

It should be noted that, in the embodiment of the present invention, the form of the characteristic curves is used to determine the maximum average similarity and the target overlapping video area, which is merely an example, and is not a limitation of the embodiment of the present invention, and the two videos to be aligned may be converted into other forms to determine the maximum average similarity and the target overlapping video area, which is not limited herein.

As can be seen from the optimization mode of the structural feature encoder, the output of the structural feature encoder is a feature vector corresponding to each video frame in the video frame sequence, each video frame corresponds to a time stamp, so each feature vector is directly watched as a feature point, in the video duration range, each time stamp corresponding to each video frame on the time axis corresponds to a feature point, and a curve formed by connecting the feature points is a feature curve corresponding to the video frame sequence, and the feature curve is shown in fig. 3.

And S402-2, matching the first characteristic curve and the second characteristic curve, and determining the maximum average similarity and the target overlapping video area.

In the embodiment of the invention, when the content of the first video to be aligned is the same as that of the second video to be aligned, the corresponding feature matrixes (composed of the feature vectors corresponding to each video frame) have clear similarity in a high-dimensional space, and the feature matrixes are consistent with the change condition of time sequence. Therefore, one of the characteristic curves is used as a reference curve, the other characteristic curve is used for matching, in the matching process, the average similarity of the corresponding video frames in the overlapping area of the two characteristic curves is repeatedly calculated until the average similarity is maximum under the condition that a certain overlapping video area is determined, then the overlapping ratio of the two characteristic curves is the highest at the moment, the average similarity is the maximum, the overlapping video area at the moment can be used as a target overlapping video area, and the video frames in the area can be accurately aligned.

An embodiment of the above matching process will be described in detail, that is, step S402-2 may include the steps of:

step c1, determining one of the first characteristic curve and the second characteristic curve as a reference curve, aligning the tail frame position of the other characteristic curve with the head frame position of the reference curve, and moving the other characteristic curve along the sequence direction of the reference curve according to a preset step length;

Step c2, determining a start frame position and an end frame position when the other characteristic curve is overlapped with the reference curve when the preset step length is moved;

step c3, taking the region formed by the initial frame position and the end frame position as an overlapped video region, and calculating the average similarity between the feature vectors of the video frames in the overlapped video region;

and c4, stopping moving when the first frame position of the other characteristic curve is overlapped with the tail frame position of the reference curve, obtaining the maximum average similarity in the whole moving process, and taking the overlapping video area corresponding to the maximum average similarity as a target overlapping video area.

For easy understanding of the above embodiments, please refer to fig. 6, fig. 6 is a schematic view of a scene of step S402-2 provided in the embodiment of the present invention, wherein it is assumed that the first video to be aligned is video a, and the second video to be aligned is video B, N _A And N _B For video a and video B, given from the corresponding video frames. Y is Y _A And Y _B And the characteristic curves corresponding to the video A and the video B are respectively. In one embodiment, video frame Y _A Will be used as a reference curve, Y _B Dematching Y _A . As shown in FIG. 6, a curve Y is set _A The starting frame position of (2) is 0 and the ending frame position is N _A Then Y _B Initially correspond toThe position range of (C) is-N _B 0, its end frame and Y _A Corresponds to the start frame of (a).

The matching process may be: y is set to _B Along Y _A Gradually moving from the start frame position to the end frame position, wherein each moving step length is one video frame, and Y is determined in the moving process _B In Y _A The start frame position and the end frame position on the corresponding time axis are respectively denoted as L _B And H _B ，L _B The range of the value of (C) is [ -N _B ,N _A ]，H _B The value range of (2) is [0, N _A +N _B ]The B curve needs to be shifted by N in total _A +N _B Recording the overlapping video area in the moving process as M= [ L ] _M ,H _M ]，L _M ,H _M Respectively represent Y _A And Y _B Overlapping a start frame and an end frame of a video area, where L _M ＝max(L _B ,0),H _M ＝min(H _B ,N _A )。Y _B For each video frame moved, Y in the overlapping video area is calculated _B And Y _A The average similarity of (3) is specifically: calculation of Y _B And Y _A Average similarity of feature vectors of all video frames each located within the overlapping video region.

For example, let Y _B After the t step of moving, Y _A And Y _B Overlapping video regions areCorresponding frame number ofCalculation of Y _B And Y _A Each at M ^t The average similarity of feature vectors of all video frames in the frame is shown as the average similarity of the movement in the t-th step as shown in the following formula.

Wherein t is E [0, N _A +N _B ]，Y is respectively _A And Y _B An i-th frame located in the overlapping video area.

Gradually move Y _B Until N is moved _A +N _B After steps, i.e. Y _B Start frame and Y of (2) _A Corresponds to the end frame of (1) at this time Y _B The corresponding range of the curve is N _A ～N _A +N _B The average similarity set after each step of movement can be obtainedThe maximum average similarity is MAXS _M ＝max{S _M }∈R，MAXS _M Video frames within the corresponding overlapping intervals may be used to compose a comparison video.

Because the above average similarity considers the whole overlapped video area, uses all the video frame information which can be referred to, achieves the effect of global similarity optimization, and has the maximum average similarity when the coincidence degree of the characteristic curve of the video B and the characteristic curve of the video B is the highest, and the accuracy and the robustness of the accurate alignment of the same video frame content are higher compared with the local optimization algorithm. Recording the overlapped video area with the maximum average similarity as M _C Will be located in the overlapping video area M _C After the video frames of the video A and the video B in the video frame are combined into a comparison video, each frame can ensure that the picture contents are completely aligned.

Through the above embodiment, the maximum average similarity corresponding to the first video to be aligned and the second video to be aligned and the overlapping video area at this time have been obtained, and then the maximum average similarity can be used to determine whether the two videos to be aligned have clipping conditions, please refer to step S403.

In step S403, whether clipping exists in the first video to be aligned and the second video to be aligned is determined according to the maximum average similarity, it is understood that when clipping exists in the first video to be aligned and the second video to be aligned, the maximum average similarity is smaller, so that a threshold may be preset, and when the maximum average similarity is greater than the preset threshold, it is considered that clipping does not exist in the first video to be aligned and the second video to be aligned, step S404 may be executed to obtain the aligned video. When the maximum average similarity is smaller than the preset threshold, it can be considered that a clipping condition exists, and step S04 should not be performed.

In step S404, when the first video to be aligned and the second video to be aligned do not have clipping, then video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area are synthesized to obtain aligned videos of the first video to be aligned and the second video to be aligned, so that an accurate alignment effect can be achieved.

In summary, the embodiment of the invention provides a video alignment method based on image structural feature coding, which performs feature coding on two identical (identical picture content, different time length and different picture quality) sequence video frames through a structural feature coder to obtain a high-dimensional feature curve corresponding to the two video segments, and finally performs global optimization in a mode of maximizing average similarity to obtain a final comparison video, thereby improving the accuracy of an alignment result, providing guarantee for comparing and testing different picture quality enhancement applications, greatly improving the working efficiency of a crowding stage, and shortening the iteration and online period of the whole picture quality enhancement function.

Based on the same inventive concept, the embodiment of the present invention further provides a video alignment device, as shown in fig. 7, fig. 7 is a functional block diagram of the video alignment device provided in the embodiment of the present invention, and the video alignment device 700 may include: an acquisition module 710, a determination module 720, and an alignment module 730.

The acquiring module 710 is configured to acquire a first video to be aligned and a second video to be aligned, and perform frame rate unification on the first video to be aligned and the second video to be aligned;

a determining module 720, configured to determine a target overlapping video area when the average similarity between the first video to be aligned and the second video to be aligned is the largest; determining whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity;

the alignment module 730 is configured to, if the determination module 720 determines that clipping does not exist, synthesize video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area, and obtain aligned videos of the first video to be aligned and the second video to be aligned.

It is appreciated that the acquisition module 710, the determination module 720, and the alignment module 730 may cooperatively perform the steps of fig. 4 to achieve corresponding technical effects.

In an alternative embodiment, the determining module 720 is specifically configured to: converting the first video to be aligned and the second video to be aligned respectively to obtain a first characteristic curve and a second characteristic curve; and matching the first characteristic curve and the second characteristic curve, and determining the maximum average similarity and the target overlapping video area.

In an alternative embodiment, the determining module 720 is specifically configured to: extracting video frames from the first video to be aligned and the second video to be aligned respectively to obtain a first video frame sequence and a second video frame sequence; respectively inputting the first video frame sequence and the second video frame sequence into a pre-optimized structural feature encoder to perform feature extraction to obtain a first feature vector and a second feature vector; a first characteristic curve is formed according to the first characteristic vector, and a second characteristic curve is formed according to the second characteristic vector.

In an alternative embodiment, the structural feature code is optimized by: acquiring a training video frame sequence set; determining any two groups of training video frame sequences in the training video frame sequence set as target training video frame sequences, and inputting the target training video frame sequences into a structural feature encoder to obtain feature vectors of target training video frames in each target training video frame sequence; taking each video frame in the two target training video frame sequences as a reference frame in turn, and calculating the similarity of feature vectors between the reference frame and all video frames except the reference frame; determining a loss value of a loss function of the structural feature encoder according to the similarity of all the feature vectors, and updating model parameters of the structural feature encoder according to the loss value; and returning to the step of determining any two groups of training video frame sequences in the training video frame sequence set as target training video frame sequences until a preset optimization condition is reached, and obtaining the optimized structural feature encoder.

In an alternative embodiment, the determining module 720 is specifically configured to: the method comprises the steps of performing interval sampling on an original video according to a preset frame rate, and extracting from sampled video frames according to a preset window length to obtain an original video frame sequence; and carrying out image quality conversion on the original video frame sequence according to a plurality of image quality conversion modes to obtain a plurality of video frame sequences with different image quality effects, and taking the original video frame sequence and the video frame sequence subjected to the image quality conversion as a training video frame sequence set.

In an alternative embodiment, the determining module 720 is specifically configured to: determining one of the first characteristic curve and the second characteristic curve as a reference curve, aligning the tail frame position of the other characteristic curve with the head frame position of the reference curve, and moving the other characteristic curve along the sequence direction of the reference curve according to a preset step length; determining a start frame position and an end frame position when the other characteristic curve is overlapped with the reference curve when the preset step length is moved; taking a region formed by the initial frame position and the end frame position as an overlapped video region, and calculating the average similarity between the feature vectors of the video frames in the overlapped video region; and stopping moving when the first frame position of the other characteristic curve is overlapped with the tail frame position of the reference curve, obtaining the maximum average similarity in the whole moving process, and taking the overlapping video area corresponding to the maximum average similarity as a target overlapping video area.

In an alternative embodiment, the determining module 720 is further specifically configured to: if the maximum average similarity is larger than a preset threshold, determining that the first video to be aligned and the second video to be aligned do not have clipping, otherwise, determining that the first video to be aligned and the second video to be aligned have clipping.

It should be noted that, in the above embodiments of the present application, the division of the modules is merely schematic, and there may be another division manner in actual implementation, and in addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or may exist separately and physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Referring to fig. 8, fig. 8 is a block diagram of a computer device according to an embodiment of the present invention, where the computer device is configured to execute a video alignment method according to an embodiment of the present invention, and the computer device 800 includes: the memory 801, the processor 802, the communication interface 803, and the bus 804 are electrically connected directly or indirectly to each other to realize transmission or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

Alternatively, bus 804 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.

In an embodiment of the present invention, the processor 802 may be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, where the methods, steps, and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution. A software module may be located in the memory 801, the processor 802 reading the program instructions in the memory 801 and performing the steps of the above method in connection with its hardware.

In the embodiment of the present invention, the memory 801 may be a nonvolatile memory, such as a hard disk (HDD) or a Solid State Drive (SSD), or may be a volatile memory (RAM). The memory may also be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory in embodiments of the present invention may also be circuitry or any other device capable of performing memory functions for storing instructions and/or data.

The memory 801 may be used to store software programs and modules, such as instructions/modules of the video alignment apparatus 600 provided in the embodiments of the present invention, may be stored in the memory 801 in the form of software or firmware (firmware) or be solidified in an Operating System (OS) of the computer device 800, and the processor 802 executes the software programs and modules stored in the memory 801 to thereby perform various functional applications and data processing. The communication interface 803 may be used for communication of signaling or data with other node devices.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

It is to be understood that the configuration shown in fig. 8 is merely illustrative, and that the computer device 800 may also include more or fewer components than shown in fig. 8, or have a different configuration than shown in fig. 8. The components shown in fig. 8 may be implemented in hardware, software, or a combination thereof.

The computer device 800 may be any electronic product that can interact with a user, such as a personal computer, tablet, smart phone, personal digital assistant (Personal Digital Assistant, PDA), game console, interactive web television (Internet Protocol Television, IPTV), smart wearable device, etc.

Computer device 800 may also include network devices and/or user devices. Network devices include, but are not limited to, a single network server, a server group of multiple network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.

The network in which the computer device 800 is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.

Based on the above embodiments, the present application also provides a storage medium in which a computer program is stored, which when executed by a computer, causes the computer to execute the video alignment method provided in the above embodiments.

Based on the above embodiments, the present invention also provides a computer program, which when run on a computer, causes the computer to perform the video alignment method provided in the above embodiments.

Based on the above embodiments, the present invention further provides a chip, where the chip is configured to read a computer program stored in a memory, and is configured to perform the video alignment method provided in the above embodiments.

Embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the video alignment method provided in the above embodiments.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by instructions. These instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The present invention is not limited to the above embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of video alignment, the method comprising:

acquiring a first video to be aligned and a second video to be aligned, and unifying frame rates of the first video to be aligned and the second video to be aligned;

determining a target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is the largest;

determining whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity;

if the video frames do not exist, the video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area are synthesized, and the aligned videos of the first video to be aligned and the second video to be aligned are obtained.

2. The video alignment method according to claim 1, wherein determining a target overlapping video area when the average similarity between the first video to be aligned and the second video to be aligned is the largest comprises:

converting the first video to be aligned and the second video to be aligned respectively to obtain a first characteristic curve and a second characteristic curve;

and matching the first characteristic curve with the second characteristic curve, and determining the maximum average similarity and the target overlapping video area.

3. The video alignment method according to claim 2, wherein converting the first video to be aligned and the second video to be aligned to obtain a first feature curve and a second feature curve, respectively, includes:

extracting video frames from the first video to be aligned and the second video to be aligned respectively to obtain a first video frame sequence and a second video frame sequence;

the first video frame sequence and the second video frame sequence are respectively and sequentially input into a pre-optimized structural feature encoder for feature extraction, and a first feature vector and a second feature vector of corresponding frames are obtained;

and forming the first characteristic curve according to the first characteristic vector, and forming the second characteristic curve according to the second characteristic vector.

4. A video alignment method as defined in claim 3, wherein the structural feature encoding is optimized by:

acquiring a training video frame sequence set;

determining any two groups of training video frame sequences in the training video frame sequence set as target training video frame sequences, and inputting the target training video frame sequences into the structural feature encoder to obtain feature vectors of target training video frames in each target training video frame sequence;

Taking each video frame in the two target training video frame sequences as a reference frame in turn, and calculating the similarity of feature vectors between the reference frame and all video frames except the reference frame;

determining a loss value of a loss function of the structural feature encoder according to all the feature vector similarity, and updating model parameters of the structural feature encoder according to the loss value;

and returning to the step of determining any two groups of training video frame sequences in the training video frame sequence set as target training video frame sequences until a preset optimization condition is reached, and obtaining the optimized structural feature encoder.

5. The video alignment method of claim 4, wherein obtaining a set of training video frame sequences comprises:

the method comprises the steps of performing interval sampling on an original video according to a preset frame rate, and extracting from sampled video frames according to a preset window length to obtain an original video frame sequence;

and carrying out image quality conversion on the original video frame sequence according to a plurality of image quality conversion modes to obtain a plurality of video frame sequences with different image quality effects, and taking the original video frame sequence and the video frame sequence subjected to the image quality conversion as the training video frame sequence set.

6. The video alignment method of claim 2, wherein matching the first and second characteristic curves to determine the maximum average similarity and the target overlapping video region comprises:

determining one of the first characteristic curve and the second characteristic curve as a reference curve, aligning the tail frame position of the other characteristic curve with the head frame position of the reference curve, and moving the other characteristic curve along the sequence direction of the reference curve according to a preset step length;

determining a start frame position and an end frame position when the other characteristic curve is overlapped with the reference curve when the preset step length is moved;

taking a region formed by the initial frame position and the end frame position as the overlapped video region, and calculating the average similarity between the feature vectors of the video frames in the overlapped video region;

and stopping moving when the first frame position of the other characteristic curve is overlapped with the last frame position of the reference curve, obtaining the maximum average similarity in the whole moving process, and taking the overlapping video area corresponding to the maximum average similarity as the target overlapping video area.

7. The video alignment method of claim 1, wherein determining whether clipping exists for the first video to be aligned and the second video to be aligned according to a maximum average similarity comprises:

if the maximum average similarity is larger than a preset threshold, determining that the first video to be aligned and the second video to be aligned do not have clipping, otherwise, determining that the first video to be aligned and the second video to be aligned have clipping.

8. A video alignment apparatus, comprising: the device comprises an acquisition module, a determination module and an alignment module;

the acquisition module is used for acquiring a first video to be aligned and a second video to be aligned, and carrying out frame rate unification on the first video to be aligned and the second video to be aligned;

the determining module is used for determining a target overlapping video area when the average similarity of the first video to be aligned and the second video to be aligned is the largest;

the determining module is further configured to determine whether clipping exists between the first video to be aligned and the second video to be aligned according to the maximum average similarity;

and the alignment module is used for synthesizing video frames of the first video to be aligned and the second video to be aligned in the target overlapping video area when the determination module determines that no clipping exists, so as to obtain the alignment videos of the first video to be aligned and the second video to be aligned.

9. A computer device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor executable to implement the video alignment method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the video alignment method according to any of claims 1 to 7.