CN108898150B

CN108898150B - Video structure alignment method and system

Info

Publication number: CN108898150B
Application number: CN201810903732.XA
Authority: CN
Inventors: 胡事民; 汪淼; 方晓楠; 杨国炜
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2020-08-21
Anticipated expiration: 2038-08-09
Also published as: CN108898150A

Abstract

The embodiment of the invention provides a video structure alignment method and a video structure alignment system, wherein the alignment method comprises the following steps: fusing a structure edge information graph of any input video in the two input videos and a gradient information graph of any input video to obtain a salient edge feature graph of any input video; performing down-sampling on a time dimension and a space dimension on a salient edge feature map of any input video, and constructing a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; matching the significant edge feature layers of the two input videos based on the matching measurement of the correlation to obtain the preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure. The embodiment of the invention avoids the complex calculation cost brought by edge calculation detection and the like, and can efficiently and quickly acquire the alignment result to align the video.

Description

Video structure alignment method and system

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a video structure alignment method and a video structure alignment system.

Background

The video matching technology is a very important tool in the field of computer graphics and can help users to quickly search video contents meeting requirements. Since 2002 video time-space alignment was proposed, video matching technology has been widely studied, but the prior art is based on local feature matching or color matching and is used for quickly retrieving video contents of identical or visually similar scenes, such as a video serving system, and alignment in unmatched scene time intervals can be predicted based on time intervals of partial scene matching on a time line. However, video alignment between different scenes is a challenge because: firstly, video contents among different scenes do not have robust local features for matching; secondly, matching of videos between different scenes needs to satisfy visual structural information alignment.

Edge detection of images is a fundamental problem in the field of computer vision and graphics. Since the Sobel operator in 1983 and the Canny edge detection algorithm in 1986, a great deal of work was successively proposed,such as the statistical edge method and the gPb method. With the development of deep learning technology in recent years, some methods based on convolutional neural network are proposed, such as based on N⁴Domain edge detection, and overall nested edge detection, etc. However, there has been little research on how to extract significant edges from video to achieve video structure alignment between different scenes.

Disclosure of Invention

The embodiment of the invention provides a method and a system for aligning a video structure, aiming at the technical problem of how to match the salient edge features of a video to align the structure of the video.

The embodiment of the invention provides a video structure alignment method, which comprises the following steps: fusing a structure edge information graph of any one of two input videos and a gradient information graph of the any one input video to obtain a salient edge feature graph of the any one input video; down-sampling a time dimension and a space dimension of the salient edge feature map of any input video to construct a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; and matching the significant edge feature layers of the two input videos based on the matching metric of the correlation to obtain the preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure.

An embodiment of the present invention provides a video structure alignment system, including: the acquisition module is used for fusing a structure edge information graph of any input video of the two input videos and a gradient information graph of the any input video to acquire a salient edge feature graph of the any input video; the construction module is used for conducting time dimension and space dimension down-sampling on the salient edge feature map of any input video and constructing a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; and the matching module is used for matching the significant edge feature layers of the two input videos based on the matching measurement of the correlation, acquiring the alignment results of the preset number, and selecting any one of the alignment results of the preset number to align the video structure.

An embodiment of the present invention provides a video structure alignment apparatus, including: at least one processor; and at least one memory communicatively coupled to the processor, wherein: the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the alignment method described above.

Embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the above alignment method.

According to the video structure alignment method and system provided by the embodiment of the invention, the salient edge feature layer of the input video is obtained through setting, so that the subsequent salient edge features of the video can be conveniently matched to align the video. The alignment result is searched by setting the matching metric based on the correlation for the salient edge feature matching of the video, so that the complex calculation cost caused by edge calculation detection and the like is avoided, and the alignment result can be efficiently and quickly obtained to align the video.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of an embodiment of a video structure alignment method of the present invention;

FIG. 2 is a block diagram of an embodiment of a video structure alignment system of the present invention;

fig. 3 is a schematic diagram of a framework of a video structure alignment apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an embodiment of a video structure alignment method according to the present invention, as shown in fig. 1, including: s101, fusing a structure edge information graph of any input video of two input videos and a gradient information graph of the input video to obtain a salient edge feature graph of the input video; s102, performing time dimension and space dimension down-sampling on the salient edge feature map of any input video, and constructing a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; s103, matching the significant edge feature layers of the two input videos based on the matching metric of the correlation, obtaining the preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure.

Specifically, in step S101, the structure edge information map of the input video contains edge information of the main content of the input video. The gradient information map includes a gradient magnitude component and a gradient direction component of the any one of the input videos, the gradient direction component being represented by an angle. The saliency edge feature map of said any input video comprises intensity information and angle information of said any input video.

Further, step S101 is to fuse the structure edge information map of any input video and the gradient information map of any input video, obtain the saliency edge feature map of any input video, which represents that the information map containing the gradient magnitude component and the gradient direction component of any input video and the edge information map of its corresponding main content are fused, obtain the structure edge information map containing the intensity information and the angle information, and refer to this structure edge information map containing the intensity information and the angle information as the saliency edge feature map. Image Fusion (Image Fusion) refers to extracting beneficial information in respective channels to the maximum extent by Image processing, computer technology and the like on Image data about the same object collected by a multi-source channel, and finally synthesizing the Image data into a high-quality Image.

Further, after step S101 is finished, the saliency edge feature maps of two input videos are obtained, the problem to be solved by the embodiment of the present invention is how to match the saliency edge features of the videos to align the videos, so step S102 is introduced, and the purpose of step S102 is to obtain the hierarchical structure of the saliency edge feature map of any input video, that is, the saliency edge feature map of any input video represented by the hierarchical structure. The method comprises the steps that a saliency edge feature map of any input video comprises a time dimension and a space dimension, wherein the time dimension is that the saliency edge feature map is specific to a video file, and the video file comprises time information; the inclusion of the spatial dimension means that the input video includes contents of different levels, and in the embodiment of the present invention, in step S102, the salient edge feature map of any input video including the temporal dimension and the spatial dimension is down-sampled in the temporal dimension and the spatial dimension to obtain a hierarchical structure of the salient edge feature map of any input video, where different levels represent different resolutions. It should be noted that down-sampling refers to a process of reducing the sampling rate of a specific signal.

Further, through step S102, a hierarchical structure of a significant edge feature map of any input video is obtained, and through matching significant edge feature map layers of two input videos, a matched significant edge feature map layer can be obtained, which is specifically represented as: the number of layers of the hierarchical structure of the salient edge feature maps of the two input videos is equal, the corresponding layer of the salient edge feature map layer of one input video is matched with the corresponding layer of the salient edge feature map layer of the other input video, and finally, a matched salient edge feature map layer is obtained. And through traversing the matched significant edge feature image layers, acquiring a plurality of layer alignment results at each layer, integrating all the layer alignment results, and acquiring a preset number of layer alignment results as alignment results. It should be noted that matching in the embodiment of the present invention is based on a matching metric of correlation, matching of feature layers of a video needs to be based on a matching metric, and the embodiment of the present invention performs matching by using the matching metric of correlation.

The embodiment of the invention preferably arranges all the layer alignment results in a descending order according to the alignment effect from good to poor, and obtains the alignment results of the preset number in sequence from the first layer alignment result.

It should be noted that the alignment result means: the result obtained if the two input videos are aligned.

Further, after a preset number of alignment results are obtained, one of the alignment results is selected to align the two input videos, and the embodiment of the present invention preferably selects the alignment result with the best alignment effect to perform the last alignment operation. Preferably, the alignment effect means that the difference occurring after aligning the two videos is minimal.

It should be further noted that the two input videos in the embodiment of the present invention belong to the same or different scenes.

Further, the video structure alignment method provided by the embodiment of the invention can be particularly applied to scene switching of movie and television production, two input videos belong to videos of two scenes of movie and television starting and ending, and the video structure alignment method provided by the embodiment of the invention can be used for efficiently aligning the videos so as to enable the movie and television production starting to be coherent. The video structure alignment method provided by the embodiment of the invention can also be used for searching videos and searching videos with the best alignment effect with one video in a video pool. In addition, the video structure alignment method provided by the embodiment of the invention can be applied to video structure alignment visualization and video content production unified stylization.

According to the video structure alignment method provided by the embodiment of the invention, the salient edge feature layer of the input video is obtained through setting, so that the subsequent matching of the salient edge features of the video can be facilitated to align the video. The alignment result is searched by setting the matching metric based on the correlation for the salient edge feature matching of the video, so that the complex calculation cost caused by edge calculation detection and the like is avoided, and the alignment result can be efficiently and quickly obtained to align the video.

Based on the above embodiment, the step S101 is to fuse a structure edge information map of any one of two input videos and a gradient information map of the any one input video to obtain a significant edge feature map of the any one input video, and specifically includes: acquiring a binary structure edge information graph of any one of the two input videos; fusing the binaryzation structure edge information graph of any input video with the gradient information graph of any input video to obtain a salient edge feature graph of any input video; if any pixel in the binarized structure edge information map belongs to a structure edge, a first value in two values is given to the any pixel, and if any pixel in the binarized structure edge information map does not belong to a structure edge, a second value in two values is given to the any pixel.

Specifically, in the embodiment of the present invention, the structural edge information map of the input video is binarized, and in the embodiment of the present invention, it is preferably configured to assign 1 to any pixel g (p) if any pixel in the binarized structural edge information map belongs to a structural edge, and assign 0 to any pixel g (p) if any pixel in the binarized structural edge information map does not belong to a structural edge.

Further, floating point number gradient information is represented by (I, θ), where I represents a gradient magnitude component of any pixel, and θ represents a gradient direction component of any pixel, and the obtained significant edge feature of any pixel is represented by the following formula:

where M' (p) is the significant edge feature of any pixel, and M (p) is the gradient information of any pixel.

Further, that any pixel belongs to the structural edge in the embodiment of the present invention means that the content displayed by any pixel belongs to the edge information of the main content of the input video.

Further, the salient edge features of all the pixels are integrated into a salient edge feature map of any input video.

The video structure alignment method provided by the embodiment of the invention is characterized in that a binarized structure edge information graph of any input video and a gradient information graph of any input video are fused to obtain a salient edge feature graph of any input video, and salient edge features of the video can be automatically and effectively extracted by fusing gradient information (gradient amplitude component and gradient direction component) to express a video content structure.

Based on the above embodiment, the obtaining of the binarized structure edge information map of any one of the two input videos specifically includes: by being based on L₀A smooth edge image holding filter for smoothing each frame of any one of the two input videos to obtain any input video after smoothing; and calculating and obtaining the structure edge of any input video after smoothing through a three-dimensional mean shift algorithm, and obtaining a binary structure edge information graph of any input video based on the structure edge of any input video after smoothing.

Specifically, this embodiment is a preprocessing process, where the preprocessing includes two steps, the first step is smoothing of an image, and the second step is obtaining a binarized structure edge information map.

Further, image smoothing, which is a conventional image processing technique, is an area where some brightness changes too much or some bright spots (also referred to as noise) appear on an image under the influence of various factors. The smoothing of the image is one of the processing methods for smoothing the brightness of the image to suppress noise. The image smoothing is in fact a low-pass filtering.

It is to be noted that based on L₀Filter smoothing strength parameter optimization for a smoothed edge preserving image filterSet to 0.05.

Further, the mean shift algorithm is an effective statistical iterative algorithm, and has been widely applied to aspects such as cluster analysis, tracking, image segmentation, image smoothing, filtering, image edge extraction, information fusion and the like.

Specifically, the step of obtaining the structure edge of any input video after smoothing through calculation of the three-dimensional mean shift algorithm refers to, as indicated in the above embodiment, assigning a first value of two values to any pixel in the binarized structure edge information map if the any pixel belongs to a structure edge, and assigning a second value of two values to any pixel if the any pixel in the binarized structure edge information map does not belong to a structure edge.

Further, the step of obtaining the binarized structure edge information map of any input video based on the smoothed structure edge of any input video refers to integrating assigned values of all pixels into the obtained binarized structure edge information map.

According to the video structure alignment method provided by the embodiment of the invention, the image smoothing and binarization are set, so that the subsequent remarkable edge characteristics of the video can be conveniently obtained.

Based on the above embodiment, in step S102, the time dimension and the space dimension of the significant edge feature map of any input video are downsampled, and the constructing of the significant edge feature map layer of any input video specifically includes: down-sampling the salient edge feature map of any input video to a preset frame number per second in a time dimension, and constructing a feature pyramid with a preset layer number in a space dimension on the salient edge feature map of any input video; and acquiring the significant edge feature map layer of any input video based on the significant edge feature map of any input video, which is obtained by down-sampling to a preset frame number per second in a time dimension and constructing a preset number of layers of feature pyramids in a space dimension.

Specifically, it has been pointed out in the above embodiments that down-sampling refers to a process of reducing the sampling rate of a specific signal, and is applied to the embodiment expressed as: the salient edge feature map of any input video is down-sampled to a preset number of frames per second in the time dimension, and preferably the salient edge feature map of any input video is down-sampled to 1 frame per second in the time dimension.

Further, in the above embodiment, it has been pointed out that the resolution of the saliency edge feature map layer increases layer by layer from top to bottom, and for the saliency edge feature map of any input video, constructing a preset number of feature pyramids in the spatial dimension, that is, increasing layer by layer from top to bottom, preferably constructing 9 layers, setting the scaling ratio between layers to be √ 2/2, and the length and width of the topmost layer to be 1/16 of the original size respectively.

Further, after the time dimension is down-sampled to a preset frame number per second and a preset number of layers of feature pyramids are constructed in the space dimension, the construction of the significant edge feature layer of any input video is completed.

Based on the above embodiment, in step S103, matching the significant edge feature layers of the two input videos based on the matching metric of the correlation to obtain the alignment results with the preset number, specifically includes: calculating the correlation of the salient edge feature maps corresponding to the two input videos under any one set of alignment parameters, and taking the correlation as the matching measurement of the correlation, wherein any one set of alignment parameters comprises any translation amount of each input video under the global three-dimensional coordinate; matching the salient edge feature layers of the two input videos based on the matching measurement of the correlation to obtain matched salient edge feature layers; and traversing any one group of alignment parameters for each layer in the matched significant edge feature layer to obtain a preset number of alignment results.

Specifically, the global three-dimensional coordinates refer to three-dimensional coordinates where two input videos are co-located. Calculating the correlation of the corresponding significant edge feature maps of the two input videos under any group of alignment parameters, and specifically expressing the correlation by using a formula as follows:

wherein, C (o)₁,o₂) For correlation, o₁For any translation of a video in global three-dimensional coordinates, o₂D is the three-dimensional dimension size of the video in time and space, r represents the three-dimensional coordinate point (x, y, z) traversing the video dimension D, θ 'is the angle component of the video saliency edge feature map, I' is the intensity component of the video saliency edge feature map, and is 0.01 as an auxiliary constant for preventing division by 0.

Further, in the foregoing embodiment, it has been described that traversing is performed on each layer in the matched significant edge feature layer to obtain the alignment results of the preset number, where this embodiment specifically indicates that traversing is performed on any one group of alignment parameters to obtain the alignment results of the preset number.

The previous embodiment has indicated that any one set of alignment parameters needs to be traversed to obtain a preset number of alignment results, and as to how to traverse any one set of alignment parameters, this embodiment will be explained as follows: the traversing any one set of alignment parameters for each layer in the matched significant edge feature layer to obtain a preset number of alignment results specifically includes: traversing any one group of alignment parameters for each layer in the matched significant edge feature layer in the order of increasing resolution to obtain a layer optimal alignment result and a plurality of layer optimal alignment results corresponding to each layer; and acquiring a preset number of alignment results based on the optimal alignment result of one layer corresponding to each layer and a plurality of layer optimal alignment results.

Specifically, in the order of increasing resolution, traversing any one of the sets of alignment parameters for each layer in the matched significant edge feature map layer, and obtaining a layer optimal alignment result and a plurality of layer optimal alignment results corresponding to each layer may be interpreted as: firstly, traversing any one group of alignment parameters at the top layer with the lowest resolution, and acquiring the layer optimal alignment result and a plurality of layer optimal alignment results of the top layer, wherein the layer optimal alignment result and the layer optimal alignment result are small areas of the top layer. After the layer optimal alignment result and the level optimal alignment result of the top layer are obtained, traversing any one group of alignment parameters by using the local area of the second layer corresponding to the small area of the layer optimal alignment result of the top layer, correspondingly obtaining the layer optimal alignment result and the level optimal alignment result of the second layer, sequentially traversing the rest layers by using the method, and finally obtaining one layer optimal alignment result and a plurality of level optimal alignment results corresponding to each layer.

Further, based on a layer optimal alignment result corresponding to each layer and a plurality of layer optimal alignment results, a preset number of alignment results are obtained.

Based on the above embodiments, fig. 2 is a block diagram of an embodiment of a video structure alignment system according to the present invention, as shown in fig. 2, including: an obtaining module 201, configured to fuse a structure edge information graph of any one of two input videos and a gradient information graph of the any one input video, and obtain a saliency edge feature graph of the any one input video; a constructing module 202, configured to perform time dimension and space dimension down-sampling on the significant edge feature map of any input video, and construct a significant edge feature map layer of any input video, where the significant edge feature map layer is raised layer by layer from a top layer to a bottom layer in resolution; the matching module 203 is configured to match the significant edge feature layers of the two input videos based on the matching metric of the correlation, obtain alignment results of a preset number, and select any one of the alignment results of the preset number to align the video structure.

The computing system of the embodiment of the present invention may be used to execute the technical solution of the embodiment of the video structure alignment method shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

Based on the above embodiments, fig. 3 is a schematic frame diagram of a video structure alignment apparatus in an embodiment of the present invention. Referring to fig. 3, an embodiment of the present invention provides a video structure alignment apparatus, including: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 complete communication with each other through the bus 340. The processor 310 may call logic instructions in the memory 330 to perform methods comprising: fusing a structure edge information graph of any one of two input videos and a gradient information graph of the any one input video to obtain a salient edge feature graph of the any one input video; down-sampling a time dimension and a space dimension of the salient edge feature map of any input video to construct a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; and matching the significant edge feature layers of the two input videos based on the matching metric of the correlation to obtain the preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure.

An embodiment of the present invention discloses a computer program product, which includes a computer program stored on a non-transitory computer readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the alignment method provided by the above-mentioned method embodiments, for example, the method includes: fusing a structure edge information graph of any one of two input videos and a gradient information graph of the any one input video to obtain a salient edge feature graph of the any one input video; down-sampling a time dimension and a space dimension of the salient edge feature map of any input video to construct a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; and matching the significant edge feature layers of the two input videos based on the matching metric of the correlation to obtain the preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure.

Based on the foregoing embodiments, an embodiment of the present invention provides a non-transitory computer-readable storage medium, which stores computer instructions, where the computer instructions cause the computer to execute the alignment method provided by the foregoing method embodiments, for example, including: fusing a structure edge information graph of any one of two input videos and a gradient information graph of the any one input video to obtain a salient edge feature graph of the any one input video; down-sampling a time dimension and a space dimension of the salient edge feature map of any input video to construct a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer; and matching the significant edge feature layers of the two input videos based on the matching metric of the correlation to obtain the preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure.

Those of ordinary skill in the art will understand that: the implementation of the above-described apparatus embodiments or method embodiments is merely illustrative, wherein the processor and the memory may or may not be physically separate components, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a usb disk, a removable hard disk, a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for video structure alignment, comprising:

fusing a structure edge information graph of any one of two input videos and a gradient information graph of the any one input video to obtain a salient edge feature graph of the any one input video;

down-sampling a time dimension and a space dimension of the salient edge feature map of any input video to construct a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer;

matching the significant edge feature layers of the two input videos based on the matching measurement of the correlation to obtain a preset number of alignment results, and selecting any one of the preset number of alignment results to align the video structure;

the merging the structure edge information graph of any one of the two input videos and the gradient information graph of any one of the two input videos to obtain the salient edge feature graph of any one of the input videos specifically comprises:

acquiring a binary structure edge information graph of any one of the two input videos;

fusing the binaryzation structure edge information graph of any input video with the gradient information graph of any input video to obtain a salient edge feature graph of any input video;

if any pixel in the binarized structure edge information map belongs to a structure edge, assigning a first value in two values to the any pixel, and if any pixel in the binarized structure edge information map does not belong to the structure edge, assigning a second value in two values to the any pixel;

the acquiring of the binarized structure edge information map of any one of the two input videos specifically includes:

by being based on L₀A smooth edge image holding filter for smoothing each frame of any one of the two input videos to obtain any input video after smoothing;

calculating and obtaining the structure edge of any input video after smoothing through a three-dimensional mean shift algorithm, and obtaining a binary structure edge information graph of any input video based on the structure edge of any input video after smoothing;

the matching metric based on the correlation is used for matching the significant edge feature layers of the two input videos to obtain the alignment results with the preset number, and the method specifically comprises the following steps:

calculating the correlation of the salient edge feature maps corresponding to the two input videos under any one set of alignment parameters, and taking the correlation as the matching measurement of the correlation, wherein any one set of alignment parameters comprises any translation amount of each input video under the global three-dimensional coordinate;

matching the salient edge feature layers of the two input videos based on the matching measurement of the correlation to obtain matched salient edge feature layers;

and traversing any one group of alignment parameters for each layer in the matched significant edge feature layer to obtain a preset number of alignment results.

2. The alignment method according to claim 1, wherein the downsampling a time dimension and a space dimension of the significant edge feature map of any input video to construct the significant edge feature map layer of any input video specifically comprises:

down-sampling the salient edge feature map of any input video to a preset frame number per second in a time dimension, and constructing a feature pyramid with a preset layer number in a space dimension on the salient edge feature map of any input video;

and acquiring the significant edge feature map layer of any input video based on the significant edge feature map of any input video, which is obtained by down-sampling to a preset frame number per second in a time dimension and constructing a preset number of layers of feature pyramids in a space dimension.

3. The alignment method according to claim 1, wherein the traversing any one set of alignment parameters for each of the matched significant edge feature image layers to obtain a preset number of alignment results specifically comprises:

traversing any one group of alignment parameters for each layer in the matched significant edge feature layer in the order of increasing resolution to obtain a layer optimal alignment result and a plurality of layer optimal alignment results corresponding to each layer;

and acquiring a preset number of alignment results based on the optimal alignment result of one layer corresponding to each layer and a plurality of layer optimal alignment results.

4. The alignment method according to claim 1, wherein the gradient information map comprises a gradient magnitude component and a gradient direction component of the any input video, and the significant edge feature map of the any input video comprises intensity information and angle information of the any input video.

5. A video structure alignment system, comprising:

the acquisition module is used for fusing a structure edge information graph of any input video of the two input videos and a gradient information graph of the any input video to acquire a salient edge feature graph of the any input video;

the construction module is used for conducting time dimension and space dimension down-sampling on the salient edge feature map of any input video and constructing a salient edge feature map layer of any input video, wherein the resolution of the salient edge feature map layer is increased layer by layer from the top layer to the bottom layer;

the matching module is used for matching the significant edge feature layers of the two input videos based on the matching measurement of the correlation, acquiring the alignment results of the preset number, and selecting any one of the alignment results of the preset number to align the video structure;

6. A video structure alignment apparatus, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the alignment method of any of claims 1 to 4.

7. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the alignment method according to any one of claims 1 to 4.