CN111984820A

CN111984820A - Video abstraction method based on double-self-attention capsule network

Info

Publication number: CN111984820A
Application number: CN201911313856.3A
Authority: CN
Inventors: 王洪星; 傅豪; 徐玲; 杨梦宁; 洪明坚; 葛永新; 黄晟; 陈飞宇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-11-24
Anticipated expiration: 2039-12-19
Also published as: CN111984820B

Abstract

The invention discloses a video abstraction method based on a double-self-attention capsule network, which comprises the following steps: s1: treating the video abstract problem as a marking problem of a video frame sequence; s2: extracting an initial feature vector of each video frame for a given video; s3: performing feature refinement on the initial feature vector by using a double-attention model; s4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video; s5: training the model in a deep learning mode by using the corresponding objective function; s6: the final summary is generated based on the trained model of S5. Has the advantages that: the method can effectively capture the short-term and long-term dependency relationship without being limited by the video duration, and can process in parallel, reduce the running duration, and finally obtain the abstract video without redundancy and complete.

Description

Video abstraction method based on double-self-attention capsule network

Technical Field

The invention relates to the technical field of video processing, in particular to a video abstraction method based on a double-self-attention capsule network.

Background

With the development and popularization of video shooting devices such as mobile phones and digital cameras, the number of videos is increasing dramatically. Limited by the lack of professional photographic knowledge of most photographers, most videos that people take are often redundant and may contain very little important information in one video. Browsing and understanding such videos is very time consuming. Therefore, for easy browsing and understanding, we need to generate a concise, non-redundant summary for a given video, and the summary cannot lose important semantic information.

Video summarization is inherently a subset selection problem. By subset selection we can get a summary of the three levels of frames, shots and objects. To obtain a frame-level video summary, it is first generally necessary to extract visual features for each frame, for example using a widely used pre-trained CNN (convolutional neural network) model. However, video is not only spatial, but also temporal. While these pre-trained CNN models ignore temporal dependencies between video frames. In order to integrate time information, recent studies have utilized sequence models such as RNNs (recurrent neural networks) and LSTM (long-short memory models). However, these RNN/LSTM-based approaches still face two major challenges: (1) because RNN/LSTM is not friendly to parallel processing, the calculation is greatly burdened; (2) RNN/LSTM has difficulty capturing long-term dependencies over 100 frames. Therefore, it is desirable to have a method that can process temporal information between video frames in parallel and can handle long-term dependencies without being limited by the number of video frames.

Furthermore, in another aspect, once the features of the video frames are generated, we also need to use these features to learn the underlying frame selection criteria. Most existing methods attempt to sort the frames and pick out the higher scoring frames. However, since most adjacent frames in a video are visually similar, it is difficult to obtain an accurate scoring function.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a video abstraction method based on a double-self-attention capsule network, which has the advantages of effectively capturing short-term and long-term dependency relationships, performing feature fusion and learning potential key frame selection criteria, and further solves the problems in the background art.

(II) technical scheme

In order to realize the advantages of effectively capturing short-term and long-term dependency relationship, carrying out feature fusion and learning potential key frame selection criteria, the invention adopts the following specific technical scheme:

a video abstraction method based on a double-self-attention capsule network comprises the following steps:

s1: the video abstract problem is regarded as a marking problem of a video frame sequence through a preset method;

S2: for a given video, extracting an initial feature vector of each video frame using a *** lenet model pre-trained on ImageNet data sets;

s3: performing feature refinement on the initial feature vector obtained in the step S2 by using a double-attention model;

s4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video to obtain a key frame capsule u₁And non-key frame capsule u₂And the length of each capsule represents the probability that it belongs to that category;

s5: training the model in a deep learning mode by using a corresponding objective function so that the model can generate a concise and complete abstract;

s6: according to S5 training the model, and executing the steps S1-S4 to obtain the key frame capsule u for the newly input video₁Is used to generate the final summary.

Preferably, the step S1 regarding the video summarization problem as a mark problem of a video frame sequence by a preset method specifically includes the following steps:

s11: definition V ═ { V₁,v₂,…,v_TDenotes a video, where T denotes the total frame number of the video, v_tRepresents the t-th frame;

s12: assigning a video a sequence of tags Y ═ Y ₁,y₂,...,y_TIn which y_tE {0,1} and y _t1 represents that the t-th frame is a key frame and should be selected into the summary; conversely, y _t0 represents the tth frame as a non-key frame.

Preferably, the step S2, for a given video, extracting an initial feature vector of each video frame by using a *** lenet model pre-trained on the ImageNet dataset, specifically includes the following steps:

s21: acquiring a built and trained GoogLeNet model;

s22: inputting a video and parsing the video into frames;

s23: extracting initial feature vectors of each frame in the video data by adopting a GoogleLeNet model in S21 and using f_tThe feature vector representing the t-th frame is defined as follows:

f_t＝CNN(v_t) (1)；

wherein the content of the first and second substances,

GoogLeNet model, v, representing pre-training_tRepresents the t-th frame;

s24: combining the obtained feature vectors of all the frames to obtain a feature vector F ═ F of the video₁,f₂,...,f_T}. Since video is unstructured data, it cannot be directly solved as a mathematical problem. Therefore, it is first necessary to convert unstructured data into structured dataThe data is thus converted into a mathematical problem to be solved. Meanwhile, a large amount of irrelevant information exists in the video, and effective information needs to be extracted from the video, so that the GoogLeNet model is used for extracting the initial characteristic vector of each video frame, each video frame can be converted into a 1024-dimensional vector, and the effective information can be screened out preliminarily.

Preferably, the feature refinement of the initial feature vector obtained in S2 by using the dual attention model in step S3 specifically includes the following steps:

s31: constructing a double attention module comprising a local self-attention network and a global self-attention network based on a self-attention mechanism;

s32: dividing a video sequence containing T frames into M video segments, wherein each segment contains N continuous frames, and all frames of each segment are used as an input stream and input into the local self-attention network to capture short-term dependency; extracting frames at corresponding positions from each segment to form a new input stream, inputting the global self-attention network to capture long-term dependency relationship, and expressing the number of video segments as follows:

wherein the content of the first and second substances,

represents a round-down operation;

s33: performing feature refinement on the initial features through the local self-attention network module; in the video abstract, people tend to focus on the action part of the video, so the aim of the step is to utilize a local self-attention module to refine the motion information of each frame.

S34: feature refinement is performed on the initial features by the global self-attention network module. Since the video is sequence data, the front frame and the rear frame of the video have an association relationship. And the initial features extracted in S2 are only for each individual frame in the video, and the context relationship between the video frames is ignored. Therefore, the purpose of this step is to perform feature refinement on the initial feature vectors using a global self-attention model, so that the feature vectors of each frame have a context relationship.

Preferably, the step S33 of performing feature refinement on the initial feature through the local self-attention network module specifically includes the following steps:

s331: all frames of each segment are formed into an input stream { l ] of the local self-attention network_m:m＝1,2,...,M}：

l_m＝{f_t:t＝(m-1)·N+1,…,m·N-1,m·N} (3)；

Wherein l_mFor the mth input stream, M is the total number of segments, f_tThe feature vector of the t frame;

s332: the m-th input stream l obtained in S331 is fed to_mInputting into the local self-attention network to obtain the refined characteristics of the m local input stream:

wherein the content of the first and second substances,

a self-attention network is represented that is,

representing the refined features;

s333: recombining all of the refined features into a composite video frame according to the original video frame order of the video data

Wherein T represents the total number of video frames,

indicating that the t-th frame has passed through the local selfAnd (5) attention to the feature vector after network refinement.

Preferably, the step S34, performing feature refinement on the initial feature through the global self-attention network module, specifically includes the following steps:

s341: selecting the corresponding position frame from each video clip to form the input stream { g ] of the global self-attention network_n:n＝1,2,…,N}：

g_n＝{f_t:t＝n,n+N,…,n+(M-1)·N} (5)；

Wherein, g_nFor the nth input stream, N is the total number of frames per segment, M is the total number of segments, f_tThe feature vector of the t frame;

s342: adding a position coding vector to the characteristics of the nth input stream to strengthen the sequence information of the video, and inputting the sequence information into a self-attention network to obtain the refined characteristics of the nth global input stream:

wherein the content of the first and second substances,

the position-coding function is represented by a function,

for refined features, g_nIs the nth data stream;

s343: recombining all of the refined features into a composite video frame according to the original video frame order of the video data

Wherein T represents the total number of video frames,

and representing the feature vector of the t-th frame after global self-attention network refinement.

Preferably, the dual-flow capsule network in step S4 includes two-flow capsule networks, i.e., a local flow and a global flow, each flow includes a convolution layer and a main capsule layer, and finally the two flows are merged by using the category capsule layer. The convolutional layer, the main capsule layer and the category capsule layer are used for classifying the features of each video frame after being refined, namely: for determining whether the frame is a key frame or a non-key frame.

Preferably, the step S4 uses a dual-stream capsule network to fuse the refined features and label each frame of the video, so as to obtain a key frame capsule u₁And non-key frame capsule u₂The method specifically comprises the following steps:

s41: extracting high-level features from a feature map with a plurality of convolution kernels as input, reshaping a 1024-dimensional vector obtained from the self-attention network into 32x32, and inputting the vector into the convolution layer to obtain:

wherein the content of the first and second substances,

representing a reshaping function, Conv1^lAnd Conv1^gRespectively representing the convolution operations of the local stream and the global stream,

and

respectively representing the characteristics of the t-th frame after the local self-attention network and the global self-attention network are refined,

and

feature maps representing local and global streams, respectively;

s42: converting scalar output generated by the convolution layer into vector output capsule through the main capsule layer, wherein each convolution kernel is in the feature map

And

after the convolution operation, a series of capsules are obtained, the capsules form a capsule channel, and the j-th capsule channel generated by the local flow and the global flow is respectively represented as:

wherein, represents the convolution operation,

and

bias terms for local and global streams respectively,

Representing a compression function for limiting the length of each capsule vector, which compresses the length of each vector between 0 and 1,

and

convolution kernels for local and global streams, respectively;

s43: stacking all the capsule channels of the local and global flows separately, yields:

wherein J represents the number of capsule channels;

s44: after obtaining capsules from the main capsule layer of each flow, all capsules are connected and the short-term and long-term dependencies are fused using a dynamic routing algorithm to get a category capsule layer that includes key frame capsules u₁And non-key frame capsule u₂：

[u₁,u₂]＝Routing([P^local,P^global]) (13)；

Wherein the content of the first and second substances,

representing a dynamic routing function. The above steps obtain the refined local features and the refined global features, which are very important factors for judging whether the two features are key frames, so that feature fusion needs to be performed on the key frames.

Preferably, the step S5 trains the above model in a deep learning manner using a corresponding objective function, so that the model can generate a concise and complete abstract, which specifically includes the following steps:

s51: the model obtains an objective function, and the expression of the objective function is as follows:

L_K＝T_Kmax(0,m⁺-||u_k||)²+λ(1-T_K)max(0,||u_k||-m^-)² (14)；

wherein L is_KFor the loss value of the kth capsule and k ∈ {1,2}, T is the training sample from the kth class _K1, otherwise T_K＝0，u_kDenotes the kth Capsule, m⁺And m^-Respectively an upper margin and a lower margin, and lambda is a balance parameter;

s52: training a model of the dual self-attention capsule network by objective function minimization using a video dataset with keyframe markers;

s53: iteratively training as per S1-S52 for each video in the data set until convergence or a maximum number of allowed iteration rounds is reached;

s54: and saving the trained model. Through the constraint of the objective function, the model can distinguish key frames and non-key frames for obtaining the abstract video.

Preferably, the step S6 is to execute the above-mentioned steps S1-S4 for the newly input video according to the model trained in S5, and obtain the key frame capsule u₁The step of generating the final abstract specifically includes the following steps:

s61, for the newly input video, first executing S1-S4 to obtain the key frame capsule u corresponding to each frame₁；

S62 passing through the key frame capsule u₁The length of the key frame capsule u corresponding to all the frames is obtained₁The probability of (d);

s63, dividing the input video into a plurality of shots by using an algorithm Kernel Temporal Segmentation;

s64, calculating the average value of the key frame capsule probability of each lens internal frame to obtain the average probability of the lens;

And S65, selecting the shots by using a dynamic programming algorithm on the average probability of the shots obtained in the step S64, wherein the dynamic programming aims at: under the condition that the total length of the selected shot does not exceed the preset proportion of the total length of the original video, maximizing the sum of the scores of the shot levels;

and S66, combining the shots selected in the S65 according to the time sequence to obtain the abstract.

(III) advantageous effects

Compared with the prior art, the invention provides a video abstraction method based on a double-self-attention capsule network, which at least has the following beneficial effects:

a novel double attention capsule network for video abstraction is provided, and a double self-attention network capable of performing parallel computation is designed, so that short-term and long-term dependence can be effectively captured without being limited by video duration, and feature expression of video frames is refined; by introducing a capsule network with superiority in the classification problem and providing a double-flow capsule network, feature fusion can be carried out and potential key frame selection criteria can be learned; furthermore, considering video summarization as a classification problem facilitates the collection of data sets. For example: only the original video and the edited video pairs need to be collected, and a large amount of manual marking is not needed. The problems of data shortage, difficult data annotation and the like can be well overcome. Compared with the most advanced method, the method can process in parallel, reduce the running time, and finally obtain the summary video which is free of redundancy and complete and has good performance and advantages.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a video summarization method based on a dual-attention capsule network according to an embodiment of the present invention;

FIG. 2 is a simplified schematic diagram of the video sequence tagging problem based on a dual self-attention network and a dual-flow capsule network in accordance with an embodiment of the present invention;

FIG. 3 is an overview diagram of dual-attention network in a video summarization method based on dual-attention capsule network according to an embodiment of the present invention;

fig. 4 is a comparison diagram of summaries generated on a video of a "tree crawl" by the video summarization method based on the dual-self-attention capsule network according to an embodiment of the present invention.

Detailed Description

For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.

According to the embodiment of the invention, a video summarization method based on a double-self-attention capsule network is provided.

The present invention will be further explained with reference to the accompanying drawings and specific embodiments, as shown in fig. 1 to 4, a video summarization method based on a dual-attention capsule network according to an embodiment of the present invention includes the following steps:

wherein, the S1 specifically includes the following steps:

s11: definition V ═ { V₁,v₂,...,v_TDenotes a video, where T denotes the total frame number of the video, v_tRepresents the t-th frame;

s12: assigning a video a sequence of tags Y ═ Y₁,y₂,...,y_TIn which y_tE {0,1} and y _t1 represents that the t-th frame is a key frame and should be selected into the summary; conversely, y _t0 represents the tth frame as a non-key frame.

wherein, the S2 specifically includes the following steps:

s21: acquiring a built and trained GoogLeNet model; in specific application, the GoogleLeNet model can be downloaded from the internet;

s22: inputting a video and parsing the video into frames; namely, V in S11 is a frame set corresponding to a video;

f_t＝CNN(v_t) (1)；

wherein the content of the first and second substances,

GoogLeNet model, v, representing pre-training_tRepresents the t-th frame;

s24: combining the obtained feature vectors of all the frames to obtain a feature vector F ═ F of the video₁,f₂,...,f_T}。

In order to obtain the short-term and long-term dependency relationship of each video frame and to be free from the limitation of the video length in the long-term dependency, a dual attention network (i.e., the dual attention net in fig. 2 and the dual attention feature refinement module in fig. 3) is proposed in this embodiment. The original video features are refined by learning attention weights for inter-frame context dependencies. To fully utilize the computing device, the present embodiment therefore uses a parallel computable self-attention mechanism. The double attention network refines the feature expression of the original video frame by learning short-term and long-term dependent attention weights between the video frames respectively.

wherein, the S3 specifically includes the following steps:

S31: constructing a double attention module comprising a local self-attention network and a global self-attention network based on a self-attention mechanism; in particular, the local and global self-attention networks are used to learn short-term and long-term dependency information, respectively, the difference between these two networks being that the input data is obtained within a video segment or between segments.

wherein the content of the first and second substances,

represents a round-down operation;

s33: performing feature refinement on the initial features through the local self-attention network module;

specifically, the S33 specifically includes the following steps:

l_m＝{f_t:t＝(m-1)·N+1,...,m·N-1,m·N} (3)；

Wherein l_mFor the mth input stream, M is the total number of segments, f _tThe feature vector of the t frame;

s332: the m-th input stream l obtained in S331 is fed to_mInputting into a self-attention network to obtain the refined characteristics of the m local input stream:

wherein the content of the first and second substances,

a self-attention network is represented that is,

representing the refined features;

Wherein T represents the total number of video frames,

and representing the feature vector of the t-th frame after local self-attention network refinement.

S34: feature refinement is performed on the initial features by the global self-attention network module.

Specifically, the S34 specifically includes the following steps:

s341: selecting the corresponding position frame from each video clip to form the input stream { g ] of the global self-attention network_n:n＝1,2,...,N}：

g_n＝{f_t:t＝n,n+N,...,n+(M-1)·N} (5)；

s342: since the self-attention mechanism does not contain any loop operation, a position coding vector is added to the characteristics of each input stream to strengthen the sequence information of the video, and the sequence information is input into the self-attention network to obtain the refined characteristics of the nth global input stream:

Wherein the content of the first and second substances,

the position-coding function is represented by a function,

for refined features, g_nIs the nth data stream;

s343: according to the video dataThe original video frame sequence of (a) recombining all the refined features into

Wherein T represents the total number of video frames,

In order to obtain an accurate scoring function in this embodiment, the video summary is completed by sequence marking. That is, a binary sequence labeling function is trained to indicate which frame in the video is selected as the representative frame to form the final summary. Compared with the frame score of a learning real numerical value, the direct training of the binary frame marking function is more intuitive, and the training data is easier to obtain. In particular, the present implementation utilizes a network of capsules that is superior in classification issues for labeling. To accommodate the dual output of DualAttentitionNet, the capsule network is further applied to the dual Stream case (i.e., Two-Stream Capsule network in FIG. 2 and dual Stream capsule network in FIG. 3).

S4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video to obtain a key frame capsule u ₁And non-key frame capsule u₂And the length of each capsule represents the probability that it belongs to that category; specifically, the dual-flow capsule network comprises a two-flow capsule network, namely a local flow and a global flow, each flow comprises a convolutional layer and a main capsule layer, and finally the two flows are fused by using a category capsule layer, wherein the convolutional layer is a common 2D convolution, the main capsule layer comprises 32 channels, and the capsule dimension on each channel is 8.

Wherein, the S4 specifically includes the following steps:

wherein the content of the first and second substances,

and

and

feature maps representing local and global streams, respectively;

And

wherein, represents the convolution operation,

and

bias terms for local and global streams respectively,

and

convolution kernels for local and global streams, respectively;

s43: stacking all the capsule channels of the local flow and the global flow respectively to obtain all the channels P of the local flow^localAnd all channels P of the global stream^global：

Wherein J represents the number of capsule channels;

s44: after obtaining capsules from the main capsule layer of each flow, all capsules are connected and the short-term and long-term dependencies are fused using a dynamic routing algorithm to obtain a category capsule layer that includes key frame capsules u₁And non-key frame capsule u₂：

[u₁,u₂]＝Routing([P^local,P^global]) (13)；

Wherein the content of the first and second substances,

representing a dynamic routing function.

Finally, a dual attention capsule network for video summarization is proposed, the overall framework of which is shown in fig. 3. Initial features for each frame were extracted using *** lenet. The dual self-attention network of the present implementation is built inside each video segment and between segments to learn features of short-term and long-term dependencies. And finally inputting the Two types of features into Two-Stream CapsNet for feature fusion and frame marking. Fig. 3 is an overview of a dual-self-attention capsule network, given a video, using *** lenet to extract features for each frame, then dividing the video into segments for dual-attention feature refinement, using refined features derived from two classes of self-attention models, using a dual-flow capsule network to learn sequence labeling criteria.

wherein, the S5 specifically includes the following steps:

s51: obtaining an objective function, wherein the expression of the objective function is as follows:

L_K＝T_Kmax(0,m⁺-||u_k||)²+λ(1-T_K)max(0,||u_k||-m^-)² (14)；

s54: storing the trained model;

s6: according to the model trained in S5, for the newly input video, the above-mentioned steps S1-S4 are executed to obtain the key frame capsule u₁For generating the final digest.

Wherein, the S6 specifically includes the following steps:

s61: for newly input video, firstly, the steps S1-S4 are executed to obtain the key frame capsule u corresponding to each frame ₁；

s63, dividing the input video into a plurality of shots by using an algorithm Kernel Temporal Segmentation (KTS); specifically, the algorithm Kernel Temporal Segmentation (KTS) is prior art and will not be described in detail herein;

s66: and combining the shots selected in the step S65 according to the time sequence to obtain the abstract.

In order to facilitate understanding of the technical scheme of the invention, corresponding experiments are also carried out in the invention. The experiment included: experimental setup, quantitative analysis and qualitative analysis.

1) Experimental setup:

a data set. This implementation was evaluated on two widely used video summary datasets (i.e., SumMe and TVSum). The SumMe dataset consists of 25 videos including various activities like travel, cooking and sports. The TVSum data set includes 50 videos captured from a video website. The labels of both data sets are importance scores at the frame level. In addition, the present implementation also utilizes two additional data sets, OVP and YouTube, for augmenting the training data to evaluate our approach under different training settings. Since this implementation treats video summarization as a sequence tagging problem, the Frame-level scores are converted to a uniform set of key frames according to the disclosed Score2Frame method.

② evaluation criteria. Similar to the conventional method, F-score was used as an evaluation index in the present embodiment. Specifically, a KTS algorithm is first used to divide a video into a plurality of non-repeating shots. The mean of the reference score within each shot and the frame-level scores predicted by the model is then calculated as the shot-level score. And finally, selecting according to the scores of the shot levels by using a knapsack algorithm under the condition of limiting the length of the abstract to be not more than 15% of the original video so as to ensure that the total score of the abstract is maximum. Let "predicted" represent the summary of the prediction, "annotated" be the summary of the user annotation, and "overlap" be the portion where "predicted" and "annotated" overlap. Then F-score can be calculated as:

in addition, the present implementation further evaluates the correlation between the prediction scores and the user annotation scores of the present solution on two evaluation criteria, Kendall's τ and Spearman's ρ.

And thirdly, evaluating and setting. For fair comparison, the present implementation uses three different evaluation settings, including criteria (C), enhancement (a), and transformation (T), as in the prior art approach. Table 1 shows the evaluation settings for SumMe. When testing on TVSum, it is sufficient to exchange SumMe with TVSum in the table. The present implementation randomly selects 80% of the videos from the test data set as the training set, and the rest as the test set. Five training sets and test sets were randomly generated and evaluated separately on top, and the average was taken as the final result.

TABLE 1 training set of SumMe datasets

And fourthly, realizing details. As with previous methods, this implementation downsamples each video to 2fps and uses *** lenet pre-trained on ImageNet to extract features of the video frames. The length of the local self-attention network input stream is set to 8. For the dual-flow capsule network, the convolution kernel size, convolution step size and output channel number of the convolution layer are (9, 9), 1 and 64; the convolution kernel size and step size of the main capsule layer are (9, 9) and 2, respectively. The dimensions of the capsules in the main and category capsule layers are 8 and 16, respectively. The initial learning rates for SumMe and TVSum were set to 5e-4 and 1e-4, respectively, which became 0.5 times the original for every 20 rounds. Our model was implemented using PyTorch.

2) Quantitative analysis

Melting experiment. To demonstrate the effectiveness of the dual attention network, ablation experiments were performed on the model of this implementation. The following three variants were used for comparison:

ours-local: models using only local self-attention networks and single-flow capsule networks;

ours-global: models using only global self-attention networks and single-flow capsule networks;

ours: the implementation proposes a dual self-attention capsule network model.

TABLE 2 ablation test results on SumMe and TVSum

Method	SumMe	TVSum
			Ours-local	45.2	58.3
Ours-global	45.4	58.5
			Ours	47.5	59.4

Table 2 shows the results of the ablation experiments. It can be observed that Ours-local and Ours-global behave almost identically. Furthermore, Ours outperformed Ours-local and Ours-global on both the SumMe and TVSum datasets. This verifies that local and global self-attention, which rely on short-term and long-term temporal information, are equally important for generating a high-quality summary.

TABLE 3F-score (%) values for comparative methods on SumMe and TVSum

TABLE 4 comparison of Kendall's τ and Spearman's ρ values on TVSum dataset by methods

Method	Kendall’sτ	Spearman’sρ
			Random	0.000	0.000
DR-DSN_sup	0.020	0.026
			dppLSTM	0.042	0.055
Ours	0.058	0.065

② comparing the methods. Compared with the LSTM-based approach, the model of the present implementation is computable in parallel and not limited by the video duration. In Table 3, the F-score performance differences between the existing LSTM-based method and the method of this implementation are compared. It can be seen that the method of this implementation achieves superior results under multiple settings of the two data.

This example also performed experiments on TVSum using Kendall's τ and Spearman's ρ. It should be noted that the model of this implementation is trained using binary sequence labels, which cannot be used to compute Kendall's τ and Spearman's ρ. Therefore, the frame level score for generating the binary label is adopted instead of the binary label as the reference data. And for each frame we use its pair directly Key frame capsule u should₁As a predicted score. The experimental results on the TVSum data set are shown in table 4. It can be observed that the method of the present implementation is significantly improved under both evaluation criteria compared to other methods. This further demonstrates the effectiveness of the present approach.

3) Qualitative analysis

To better illustrate the results of the method of the present invention, a video summary generated by the method of the present invention is shown in fig. 4. As can be seen, the video says that two bears crawl down the tree. Therefore, the user-standard reference summary is mainly located in the middle and rear part of the original video. It can be observed that there is a large amount of overlap between the summary generated by the method of the present invention and the user-labeled summary, and the generated summary can reveal the complete process of the bear climbing down from the tree.

In summary, by means of the above technical solution of the present invention, a new dual attention capsule network for video abstraction is provided, and by designing a dual attention network capable of parallel computation, short-term and long-term dependency can be effectively captured without being limited by video duration, and feature expression of video frames is refined; by introducing a capsule network with superiority in the classification problem and providing a double-flow capsule network, feature fusion can be carried out and potential key frame selection criteria can be learned; furthermore, considering video summarization as a classification problem facilitates the collection of data sets. For example: only the original video and the edited video pairs need to be collected, and a large amount of manual marking is not needed. The problems of data shortage, difficult data annotation and the like can be well overcome. Compared with the most advanced method, the method can process in parallel, reduce the running time, and finally obtain the summary video which is free of redundancy and complete and has good performance and advantages.

In the present invention, the above-mentioned embodiments are only preferred embodiments of the present invention, and the present invention is not limited thereto, and any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video abstraction method based on double-self-attention capsule network is characterized by comprising the following steps:

S6: according to the model trained in S5, for the newly input video, the above-mentioned steps S1-S4 are executed to obtain the key frame capsule u₁Is used to generate the final summary.

2. The method for video summarization based on dual-attention capsule network of claim 1, wherein the step S1 regards the video summarization problem as a labeling problem of a sequence of video frames by a preset method, and comprises the following steps:

s12: assigning a video a sequence of tags Y ═ Y₁,y₂,...,y_TIn which y_tE {0,1} and y_t1 represents that the t-th frame is a key frame and should be selected into the summary; conversely, y_t0 represents the tth frame as a non-key frame.

3. The video summarization method based on the bi-self attention capsule network of claim 1, wherein the step S2 of extracting the initial feature vector of each video frame using the *** lenet model pre-trained on ImageNet data set for a given video specifically comprises the following steps:

s21: acquiring a built and trained GoogLeNet model;

s22: inputting a video and parsing the video into frames;

f_t＝CNN(v_t) (1)；

wherein the content of the first and second substances,

GoogLeNet model, v, representing pre-training_tRepresents the t-th frame;

4. The video summarization method based on dual-attention capsule network of claim 1, wherein the step S3 feature-refining the initial feature vector obtained in S2 by using dual-attention model comprises the following steps:

Wherein the content of the first and second substances,

represents a round-down operation;

5. The video summarization method based on dual self-attention capsule network of claim 4 wherein the step S33 of feature refining the initial features through the local self-attention network module specifically comprises the following steps:

l_m＝{f_t:t＝(m-1)·N+1,...,m·N-1,m·N}(3)；

wherein the content of the first and second substances,

show self-attentionThe network of forces is such that,

representing the refined features;

Wherein T represents the total number of video frames,

6. The video summarization method based on dual self-attention capsule network of claim 4, wherein the step S34 of feature refining the initial features through the global self-attention network module specifically comprises the following steps:

g_n＝{f_t:t＝n,n+N,...,n+(M-1)·N}(5)；

wherein the content of the first and second substances,

the position-coding function is represented by a function,

for refined features, g_nIs the nth data stream;

Wherein T represents the total number of video frames,

7. The video summarization method based on dual-attention capsule network of claim 1 wherein the dual-stream capsule network in step S4 comprises two-stream capsule network, i.e. local stream and global stream, each stream comprising a convolutional layer and a main capsule layer, and finally the two streams are merged using the category capsule layer.

8. The method for video summarization based on dual-attention capsule network of claim 7 wherein step S4 utilizes dual-stream capsule network to fuse the refined features and label each frame of the video to obtain key frame capsule u₁And non-key frame capsule u₂The method specifically comprises the following steps:

wherein the content of the first and second substances,

and

and

feature maps representing local and global streams, respectively;

And

Wherein, represents the convolution operation,

and

bias terms for local and global streams respectively,

and

convolution kernels for local and global streams, respectively;

wherein J represents the number of capsule channels;

[u₁,u₂]＝Routing([P^local,P^global]) (13)；

Wherein the content of the first and second substances,

representing a dynamic routing function.

9. The method for video summarization based on dual-attention capsule network of claim 1, wherein the step S5 trains the model in a deep learning manner using the corresponding objective function, so that the model can generate a concise and complete summary specifically comprises the following steps:

L_K＝T_K max(0,m⁺-||u_k||)²+λ(1-T_K)max(0,||u_k||-m^-)² (14)；

s54: and saving the trained model.

10. The method for video summarization based on dual-attention capsule network of claim 1 wherein the step S6 is performed on the newly inputted video according to the trained model S5 to obtain the key frame capsule u through the steps S1-S4₁For generatingThe final abstract specifically comprises the following steps: