CN111984820A - Video abstraction method based on double-self-attention capsule network - Google Patents

Video abstraction method based on double-self-attention capsule network Download PDF

Info

Publication number
CN111984820A
CN111984820A CN201911313856.3A CN201911313856A CN111984820A CN 111984820 A CN111984820 A CN 111984820A CN 201911313856 A CN201911313856 A CN 201911313856A CN 111984820 A CN111984820 A CN 111984820A
Authority
CN
China
Prior art keywords
video
capsule
frame
attention
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911313856.3A
Other languages
Chinese (zh)
Other versions
CN111984820B (en
Inventor
王洪星
傅豪
徐玲
杨梦宁
洪明坚
葛永新
黄晟
陈飞宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201911313856.3A priority Critical patent/CN111984820B/en
Publication of CN111984820A publication Critical patent/CN111984820A/en
Application granted granted Critical
Publication of CN111984820B publication Critical patent/CN111984820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Television Signal Processing For Recording (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video abstraction method based on a double-self-attention capsule network, which comprises the following steps: s1: treating the video abstract problem as a marking problem of a video frame sequence; s2: extracting an initial feature vector of each video frame for a given video; s3: performing feature refinement on the initial feature vector by using a double-attention model; s4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video; s5: training the model in a deep learning mode by using the corresponding objective function; s6: the final summary is generated based on the trained model of S5. Has the advantages that: the method can effectively capture the short-term and long-term dependency relationship without being limited by the video duration, and can process in parallel, reduce the running duration, and finally obtain the abstract video without redundancy and complete.

Description

Video abstraction method based on double-self-attention capsule network
Technical Field
The invention relates to the technical field of video processing, in particular to a video abstraction method based on a double-self-attention capsule network.
Background
With the development and popularization of video shooting devices such as mobile phones and digital cameras, the number of videos is increasing dramatically. Limited by the lack of professional photographic knowledge of most photographers, most videos that people take are often redundant and may contain very little important information in one video. Browsing and understanding such videos is very time consuming. Therefore, for easy browsing and understanding, we need to generate a concise, non-redundant summary for a given video, and the summary cannot lose important semantic information.
Video summarization is inherently a subset selection problem. By subset selection we can get a summary of the three levels of frames, shots and objects. To obtain a frame-level video summary, it is first generally necessary to extract visual features for each frame, for example using a widely used pre-trained CNN (convolutional neural network) model. However, video is not only spatial, but also temporal. While these pre-trained CNN models ignore temporal dependencies between video frames. In order to integrate time information, recent studies have utilized sequence models such as RNNs (recurrent neural networks) and LSTM (long-short memory models). However, these RNN/LSTM-based approaches still face two major challenges: (1) because RNN/LSTM is not friendly to parallel processing, the calculation is greatly burdened; (2) RNN/LSTM has difficulty capturing long-term dependencies over 100 frames. Therefore, it is desirable to have a method that can process temporal information between video frames in parallel and can handle long-term dependencies without being limited by the number of video frames.
Furthermore, in another aspect, once the features of the video frames are generated, we also need to use these features to learn the underlying frame selection criteria. Most existing methods attempt to sort the frames and pick out the higher scoring frames. However, since most adjacent frames in a video are visually similar, it is difficult to obtain an accurate scoring function.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a video abstraction method based on a double-self-attention capsule network, which has the advantages of effectively capturing short-term and long-term dependency relationships, performing feature fusion and learning potential key frame selection criteria, and further solves the problems in the background art.
(II) technical scheme
In order to realize the advantages of effectively capturing short-term and long-term dependency relationship, carrying out feature fusion and learning potential key frame selection criteria, the invention adopts the following specific technical scheme:
a video abstraction method based on a double-self-attention capsule network comprises the following steps:
s1: the video abstract problem is regarded as a marking problem of a video frame sequence through a preset method;
S2: for a given video, extracting an initial feature vector of each video frame using a *** lenet model pre-trained on ImageNet data sets;
s3: performing feature refinement on the initial feature vector obtained in the step S2 by using a double-attention model;
s4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video to obtain a key frame capsule u1And non-key frame capsule u2And the length of each capsule represents the probability that it belongs to that category;
s5: training the model in a deep learning mode by using a corresponding objective function so that the model can generate a concise and complete abstract;
s6: according to S5 training the model, and executing the steps S1-S4 to obtain the key frame capsule u for the newly input video1Is used to generate the final summary.
Preferably, the step S1 regarding the video summarization problem as a mark problem of a video frame sequence by a preset method specifically includes the following steps:
s11: definition V ═ { V1,v2,…,vTDenotes a video, where T denotes the total frame number of the video, vtRepresents the t-th frame;
s12: assigning a video a sequence of tags Y ═ Y 1,y2,...,yTIn which ytE {0,1} and y t1 represents that the t-th frame is a key frame and should be selected into the summary; conversely, y t0 represents the tth frame as a non-key frame.
Preferably, the step S2, for a given video, extracting an initial feature vector of each video frame by using a *** lenet model pre-trained on the ImageNet dataset, specifically includes the following steps:
s21: acquiring a built and trained GoogLeNet model;
s22: inputting a video and parsing the video into frames;
s23: extracting initial feature vectors of each frame in the video data by adopting a GoogleLeNet model in S21 and using ftThe feature vector representing the t-th frame is defined as follows:
ft=CNN(vt) (1);
wherein the content of the first and second substances,
Figure BDA0002325274350000031
GoogLeNet model, v, representing pre-trainingtRepresents the t-th frame;
s24: combining the obtained feature vectors of all the frames to obtain a feature vector F ═ F of the video1,f2,...,fT}. Since video is unstructured data, it cannot be directly solved as a mathematical problem. Therefore, it is first necessary to convert unstructured data into structured dataThe data is thus converted into a mathematical problem to be solved. Meanwhile, a large amount of irrelevant information exists in the video, and effective information needs to be extracted from the video, so that the GoogLeNet model is used for extracting the initial characteristic vector of each video frame, each video frame can be converted into a 1024-dimensional vector, and the effective information can be screened out preliminarily.
Preferably, the feature refinement of the initial feature vector obtained in S2 by using the dual attention model in step S3 specifically includes the following steps:
s31: constructing a double attention module comprising a local self-attention network and a global self-attention network based on a self-attention mechanism;
s32: dividing a video sequence containing T frames into M video segments, wherein each segment contains N continuous frames, and all frames of each segment are used as an input stream and input into the local self-attention network to capture short-term dependency; extracting frames at corresponding positions from each segment to form a new input stream, inputting the global self-attention network to capture long-term dependency relationship, and expressing the number of video segments as follows:
Figure BDA0002325274350000032
wherein the content of the first and second substances,
Figure BDA0002325274350000033
represents a round-down operation;
s33: performing feature refinement on the initial features through the local self-attention network module; in the video abstract, people tend to focus on the action part of the video, so the aim of the step is to utilize a local self-attention module to refine the motion information of each frame.
S34: feature refinement is performed on the initial features by the global self-attention network module. Since the video is sequence data, the front frame and the rear frame of the video have an association relationship. And the initial features extracted in S2 are only for each individual frame in the video, and the context relationship between the video frames is ignored. Therefore, the purpose of this step is to perform feature refinement on the initial feature vectors using a global self-attention model, so that the feature vectors of each frame have a context relationship.
Preferably, the step S33 of performing feature refinement on the initial feature through the local self-attention network module specifically includes the following steps:
s331: all frames of each segment are formed into an input stream { l ] of the local self-attention networkm:m=1,2,...,M}:
lm={ft:t=(m-1)·N+1,…,m·N-1,m·N} (3);
Wherein lmFor the mth input stream, M is the total number of segments, ftThe feature vector of the t frame;
s332: the m-th input stream l obtained in S331 is fed tomInputting into the local self-attention network to obtain the refined characteristics of the m local input stream:
Figure BDA0002325274350000041
wherein the content of the first and second substances,
Figure BDA0002325274350000049
a self-attention network is represented that is,
Figure BDA0002325274350000042
representing the refined features;
s333: recombining all of the refined features into a composite video frame according to the original video frame order of the video data
Figure BDA0002325274350000043
Wherein T represents the total number of video frames,
Figure BDA0002325274350000044
indicating that the t-th frame has passed through the local selfAnd (5) attention to the feature vector after network refinement.
Preferably, the step S34, performing feature refinement on the initial feature through the global self-attention network module, specifically includes the following steps:
s341: selecting the corresponding position frame from each video clip to form the input stream { g ] of the global self-attention networkn:n=1,2,…,N}:
gn={ft:t=n,n+N,…,n+(M-1)·N} (5);
Wherein, gnFor the nth input stream, N is the total number of frames per segment, M is the total number of segments, ftThe feature vector of the t frame;
s342: adding a position coding vector to the characteristics of the nth input stream to strengthen the sequence information of the video, and inputting the sequence information into a self-attention network to obtain the refined characteristics of the nth global input stream:
Figure BDA0002325274350000045
wherein the content of the first and second substances,
Figure BDA00023252743500000410
the position-coding function is represented by a function,
Figure BDA0002325274350000046
for refined features, gnIs the nth data stream;
s343: recombining all of the refined features into a composite video frame according to the original video frame order of the video data
Figure BDA0002325274350000047
Wherein T represents the total number of video frames,
Figure BDA0002325274350000048
and representing the feature vector of the t-th frame after global self-attention network refinement.
Preferably, the dual-flow capsule network in step S4 includes two-flow capsule networks, i.e., a local flow and a global flow, each flow includes a convolution layer and a main capsule layer, and finally the two flows are merged by using the category capsule layer. The convolutional layer, the main capsule layer and the category capsule layer are used for classifying the features of each video frame after being refined, namely: for determining whether the frame is a key frame or a non-key frame.
Preferably, the step S4 uses a dual-stream capsule network to fuse the refined features and label each frame of the video, so as to obtain a key frame capsule u1And non-key frame capsule u2The method specifically comprises the following steps:
s41: extracting high-level features from a feature map with a plurality of convolution kernels as input, reshaping a 1024-dimensional vector obtained from the self-attention network into 32x32, and inputting the vector into the convolution layer to obtain:
Figure BDA0002325274350000051
Figure BDA0002325274350000052
wherein the content of the first and second substances,
Figure BDA00023252743500000518
representing a reshaping function, Conv1lAnd Conv1gRespectively representing the convolution operations of the local stream and the global stream,
Figure BDA0002325274350000053
and
Figure BDA0002325274350000054
respectively representing the characteristics of the t-th frame after the local self-attention network and the global self-attention network are refined,
Figure BDA0002325274350000055
and
Figure BDA0002325274350000056
feature maps representing local and global streams, respectively;
s42: converting scalar output generated by the convolution layer into vector output capsule through the main capsule layer, wherein each convolution kernel is in the feature map
Figure BDA0002325274350000057
And
Figure BDA0002325274350000058
after the convolution operation, a series of capsules are obtained, the capsules form a capsule channel, and the j-th capsule channel generated by the local flow and the global flow is respectively represented as:
Figure BDA0002325274350000059
Figure BDA00023252743500000510
wherein, represents the convolution operation,
Figure BDA00023252743500000511
and
Figure BDA00023252743500000512
bias terms for local and global streams respectively,
Figure BDA00023252743500000513
Representing a compression function for limiting the length of each capsule vector, which compresses the length of each vector between 0 and 1,
Figure BDA00023252743500000514
and
Figure BDA00023252743500000515
convolution kernels for local and global streams, respectively;
s43: stacking all the capsule channels of the local and global flows separately, yields:
Figure BDA00023252743500000516
Figure BDA00023252743500000517
wherein J represents the number of capsule channels;
s44: after obtaining capsules from the main capsule layer of each flow, all capsules are connected and the short-term and long-term dependencies are fused using a dynamic routing algorithm to get a category capsule layer that includes key frame capsules u1And non-key frame capsule u2
[u1,u2]=Routing([Plocal,Pglobal]) (13);
Wherein the content of the first and second substances,
Figure BDA0002325274350000061
representing a dynamic routing function. The above steps obtain the refined local features and the refined global features, which are very important factors for judging whether the two features are key frames, so that feature fusion needs to be performed on the key frames.
Preferably, the step S5 trains the above model in a deep learning manner using a corresponding objective function, so that the model can generate a concise and complete abstract, which specifically includes the following steps:
s51: the model obtains an objective function, and the expression of the objective function is as follows:
LK=TKmax(0,m+-||uk||)2+λ(1-TK)max(0,||uk||-m-)2 (14);
wherein L isKFor the loss value of the kth capsule and k ∈ {1,2}, T is the training sample from the kth class K1, otherwise TK=0,ukDenotes the kth Capsule, m+And m-Respectively an upper margin and a lower margin, and lambda is a balance parameter;
s52: training a model of the dual self-attention capsule network by objective function minimization using a video dataset with keyframe markers;
s53: iteratively training as per S1-S52 for each video in the data set until convergence or a maximum number of allowed iteration rounds is reached;
s54: and saving the trained model. Through the constraint of the objective function, the model can distinguish key frames and non-key frames for obtaining the abstract video.
Preferably, the step S6 is to execute the above-mentioned steps S1-S4 for the newly input video according to the model trained in S5, and obtain the key frame capsule u1The step of generating the final abstract specifically includes the following steps:
s61, for the newly input video, first executing S1-S4 to obtain the key frame capsule u corresponding to each frame1
S62 passing through the key frame capsule u1The length of the key frame capsule u corresponding to all the frames is obtained1The probability of (d);
s63, dividing the input video into a plurality of shots by using an algorithm Kernel Temporal Segmentation;
s64, calculating the average value of the key frame capsule probability of each lens internal frame to obtain the average probability of the lens;
And S65, selecting the shots by using a dynamic programming algorithm on the average probability of the shots obtained in the step S64, wherein the dynamic programming aims at: under the condition that the total length of the selected shot does not exceed the preset proportion of the total length of the original video, maximizing the sum of the scores of the shot levels;
and S66, combining the shots selected in the S65 according to the time sequence to obtain the abstract.
(III) advantageous effects
Compared with the prior art, the invention provides a video abstraction method based on a double-self-attention capsule network, which at least has the following beneficial effects:
a novel double attention capsule network for video abstraction is provided, and a double self-attention network capable of performing parallel computation is designed, so that short-term and long-term dependence can be effectively captured without being limited by video duration, and feature expression of video frames is refined; by introducing a capsule network with superiority in the classification problem and providing a double-flow capsule network, feature fusion can be carried out and potential key frame selection criteria can be learned; furthermore, considering video summarization as a classification problem facilitates the collection of data sets. For example: only the original video and the edited video pairs need to be collected, and a large amount of manual marking is not needed. The problems of data shortage, difficult data annotation and the like can be well overcome. Compared with the most advanced method, the method can process in parallel, reduce the running time, and finally obtain the summary video which is free of redundancy and complete and has good performance and advantages.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a video summarization method based on a dual-attention capsule network according to an embodiment of the present invention;
FIG. 2 is a simplified schematic diagram of the video sequence tagging problem based on a dual self-attention network and a dual-flow capsule network in accordance with an embodiment of the present invention;
FIG. 3 is an overview diagram of dual-attention network in a video summarization method based on dual-attention capsule network according to an embodiment of the present invention;
fig. 4 is a comparison diagram of summaries generated on a video of a "tree crawl" by the video summarization method based on the dual-self-attention capsule network according to an embodiment of the present invention.
Detailed Description
For further explanation of the various embodiments, the drawings which form a part of the disclosure and which are incorporated in and constitute a part of this specification, illustrate embodiments and, together with the description, serve to explain the principles of operation of the embodiments, and to enable others of ordinary skill in the art to understand the various embodiments and advantages of the invention, and, by reference to these figures, reference is made to the accompanying drawings, which are not to scale and wherein like reference numerals generally refer to like elements.
According to the embodiment of the invention, a video summarization method based on a double-self-attention capsule network is provided.
The present invention will be further explained with reference to the accompanying drawings and specific embodiments, as shown in fig. 1 to 4, a video summarization method based on a dual-attention capsule network according to an embodiment of the present invention includes the following steps:
s1: the video abstract problem is regarded as a marking problem of a video frame sequence through a preset method;
wherein, the S1 specifically includes the following steps:
s11: definition V ═ { V1,v2,...,vTDenotes a video, where T denotes the total frame number of the video, vtRepresents the t-th frame;
s12: assigning a video a sequence of tags Y ═ Y1,y2,...,yTIn which ytE {0,1} and y t1 represents that the t-th frame is a key frame and should be selected into the summary; conversely, y t0 represents the tth frame as a non-key frame.
S2: for a given video, extracting an initial feature vector of each video frame using a *** lenet model pre-trained on ImageNet data sets;
wherein, the S2 specifically includes the following steps:
s21: acquiring a built and trained GoogLeNet model; in specific application, the GoogleLeNet model can be downloaded from the internet;
s22: inputting a video and parsing the video into frames; namely, V in S11 is a frame set corresponding to a video;
S23: extracting initial feature vectors of each frame in the video data by adopting a GoogleLeNet model in S21 and using ftThe feature vector representing the t-th frame is defined as follows:
ft=CNN(vt) (1);
wherein the content of the first and second substances,
Figure BDA0002325274350000081
GoogLeNet model, v, representing pre-trainingtRepresents the t-th frame;
s24: combining the obtained feature vectors of all the frames to obtain a feature vector F ═ F of the video1,f2,...,fT}。
In order to obtain the short-term and long-term dependency relationship of each video frame and to be free from the limitation of the video length in the long-term dependency, a dual attention network (i.e., the dual attention net in fig. 2 and the dual attention feature refinement module in fig. 3) is proposed in this embodiment. The original video features are refined by learning attention weights for inter-frame context dependencies. To fully utilize the computing device, the present embodiment therefore uses a parallel computable self-attention mechanism. The double attention network refines the feature expression of the original video frame by learning short-term and long-term dependent attention weights between the video frames respectively.
S3: performing feature refinement on the initial feature vector obtained in the step S2 by using a double-attention model;
wherein, the S3 specifically includes the following steps:
S31: constructing a double attention module comprising a local self-attention network and a global self-attention network based on a self-attention mechanism; in particular, the local and global self-attention networks are used to learn short-term and long-term dependency information, respectively, the difference between these two networks being that the input data is obtained within a video segment or between segments.
S32: dividing a video sequence containing T frames into M video segments, wherein each segment contains N continuous frames, and all frames of each segment are used as an input stream and input into the local self-attention network to capture short-term dependency; extracting frames at corresponding positions from each segment to form a new input stream, inputting the global self-attention network to capture long-term dependency relationship, and expressing the number of video segments as follows:
Figure BDA0002325274350000091
wherein the content of the first and second substances,
Figure BDA0002325274350000092
represents a round-down operation;
s33: performing feature refinement on the initial features through the local self-attention network module;
specifically, the S33 specifically includes the following steps:
s331: all frames of each segment are formed into an input stream { l ] of the local self-attention networkm:m=1,2,...,M}:
lm={ft:t=(m-1)·N+1,...,m·N-1,m·N} (3);
Wherein lmFor the mth input stream, M is the total number of segments, f tThe feature vector of the t frame;
s332: the m-th input stream l obtained in S331 is fed tomInputting into a self-attention network to obtain the refined characteristics of the m local input stream:
Figure BDA0002325274350000093
wherein the content of the first and second substances,
Figure BDA0002325274350000097
a self-attention network is represented that is,
Figure BDA0002325274350000094
representing the refined features;
s333: recombining all of the refined features into a composite video frame according to the original video frame order of the video data
Figure BDA0002325274350000095
Wherein T represents the total number of video frames,
Figure BDA0002325274350000096
and representing the feature vector of the t-th frame after local self-attention network refinement.
S34: feature refinement is performed on the initial features by the global self-attention network module.
Specifically, the S34 specifically includes the following steps:
s341: selecting the corresponding position frame from each video clip to form the input stream { g ] of the global self-attention networkn:n=1,2,...,N}:
gn={ft:t=n,n+N,...,n+(M-1)·N} (5);
Wherein, gnFor the nth input stream, N is the total number of frames per segment, M is the total number of segments, ftThe feature vector of the t frame;
s342: since the self-attention mechanism does not contain any loop operation, a position coding vector is added to the characteristics of each input stream to strengthen the sequence information of the video, and the sequence information is input into the self-attention network to obtain the refined characteristics of the nth global input stream:
Figure BDA0002325274350000101
Wherein the content of the first and second substances,
Figure BDA0002325274350000102
the position-coding function is represented by a function,
Figure BDA0002325274350000103
for refined features, gnIs the nth data stream;
s343: according to the video dataThe original video frame sequence of (a) recombining all the refined features into
Figure BDA0002325274350000104
Wherein T represents the total number of video frames,
Figure BDA0002325274350000105
and representing the feature vector of the t-th frame after global self-attention network refinement.
In order to obtain an accurate scoring function in this embodiment, the video summary is completed by sequence marking. That is, a binary sequence labeling function is trained to indicate which frame in the video is selected as the representative frame to form the final summary. Compared with the frame score of a learning real numerical value, the direct training of the binary frame marking function is more intuitive, and the training data is easier to obtain. In particular, the present implementation utilizes a network of capsules that is superior in classification issues for labeling. To accommodate the dual output of DualAttentitionNet, the capsule network is further applied to the dual Stream case (i.e., Two-Stream Capsule network in FIG. 2 and dual Stream capsule network in FIG. 3).
S4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video to obtain a key frame capsule u 1And non-key frame capsule u2And the length of each capsule represents the probability that it belongs to that category; specifically, the dual-flow capsule network comprises a two-flow capsule network, namely a local flow and a global flow, each flow comprises a convolutional layer and a main capsule layer, and finally the two flows are fused by using a category capsule layer, wherein the convolutional layer is a common 2D convolution, the main capsule layer comprises 32 channels, and the capsule dimension on each channel is 8.
Wherein, the S4 specifically includes the following steps:
s41: extracting high-level features from a feature map with a plurality of convolution kernels as input, reshaping a 1024-dimensional vector obtained from the self-attention network into 32x32, and inputting the vector into the convolution layer to obtain:
Figure BDA0002325274350000106
Figure BDA0002325274350000111
wherein the content of the first and second substances,
Figure BDA0002325274350000112
representing a reshaping function, Conv1lAnd Conv1gRespectively representing the convolution operations of the local stream and the global stream,
Figure BDA0002325274350000113
and
Figure BDA0002325274350000114
respectively representing the characteristics of the t-th frame after the local self-attention network and the global self-attention network are refined,
Figure BDA0002325274350000115
and
Figure BDA0002325274350000116
feature maps representing local and global streams, respectively;
s42: converting scalar output generated by the convolution layer into vector output capsule through the main capsule layer, wherein each convolution kernel is in the feature map
Figure BDA0002325274350000117
And
Figure BDA0002325274350000118
after the convolution operation, a series of capsules are obtained, the capsules form a capsule channel, and the j-th capsule channel generated by the local flow and the global flow is respectively represented as:
Figure BDA0002325274350000119
Figure BDA00023252743500001110
wherein, represents the convolution operation,
Figure BDA00023252743500001111
and
Figure BDA00023252743500001112
bias terms for local and global streams respectively,
Figure BDA00023252743500001113
representing a compression function for limiting the length of each capsule vector, which compresses the length of each vector between 0 and 1,
Figure BDA00023252743500001114
and
Figure BDA00023252743500001115
convolution kernels for local and global streams, respectively;
s43: stacking all the capsule channels of the local flow and the global flow respectively to obtain all the channels P of the local flowlocalAnd all channels P of the global streamglobal
Figure BDA00023252743500001116
Figure BDA00023252743500001117
Wherein J represents the number of capsule channels;
s44: after obtaining capsules from the main capsule layer of each flow, all capsules are connected and the short-term and long-term dependencies are fused using a dynamic routing algorithm to obtain a category capsule layer that includes key frame capsules u1And non-key frame capsule u2
[u1,u2]=Routing([Plocal,Pglobal]) (13);
Wherein the content of the first and second substances,
Figure BDA00023252743500001118
representing a dynamic routing function.
Finally, a dual attention capsule network for video summarization is proposed, the overall framework of which is shown in fig. 3. Initial features for each frame were extracted using *** lenet. The dual self-attention network of the present implementation is built inside each video segment and between segments to learn features of short-term and long-term dependencies. And finally inputting the Two types of features into Two-Stream CapsNet for feature fusion and frame marking. Fig. 3 is an overview of a dual-self-attention capsule network, given a video, using *** lenet to extract features for each frame, then dividing the video into segments for dual-attention feature refinement, using refined features derived from two classes of self-attention models, using a dual-flow capsule network to learn sequence labeling criteria.
S5: training the model in a deep learning mode by using a corresponding objective function so that the model can generate a concise and complete abstract;
wherein, the S5 specifically includes the following steps:
s51: obtaining an objective function, wherein the expression of the objective function is as follows:
LK=TKmax(0,m+-||uk||)2+λ(1-TK)max(0,||uk||-m-)2 (14);
wherein L isKFor the loss value of the kth capsule and k ∈ {1,2}, T is the training sample from the kth class K1, otherwise TK=0,ukDenotes the kth Capsule, m+And m-Respectively an upper margin and a lower margin, and lambda is a balance parameter;
s52: training a model of the dual self-attention capsule network by objective function minimization using a video dataset with keyframe markers;
s53: iteratively training as per S1-S52 for each video in the data set until convergence or a maximum number of allowed iteration rounds is reached;
s54: storing the trained model;
s6: according to the model trained in S5, for the newly input video, the above-mentioned steps S1-S4 are executed to obtain the key frame capsule u1For generating the final digest.
Wherein, the S6 specifically includes the following steps:
s61: for newly input video, firstly, the steps S1-S4 are executed to obtain the key frame capsule u corresponding to each frame 1
S62 passing through the key frame capsule u1The length of the key frame capsule u corresponding to all the frames is obtained1The probability of (d);
s63, dividing the input video into a plurality of shots by using an algorithm Kernel Temporal Segmentation (KTS); specifically, the algorithm Kernel Temporal Segmentation (KTS) is prior art and will not be described in detail herein;
s64, calculating the average value of the key frame capsule probability of each lens internal frame to obtain the average probability of the lens;
and S65, selecting the shots by using a dynamic programming algorithm on the average probability of the shots obtained in the step S64, wherein the dynamic programming aims at: under the condition that the total length of the selected shot does not exceed the preset proportion of the total length of the original video, maximizing the sum of the scores of the shot levels;
s66: and combining the shots selected in the step S65 according to the time sequence to obtain the abstract.
In order to facilitate understanding of the technical scheme of the invention, corresponding experiments are also carried out in the invention. The experiment included: experimental setup, quantitative analysis and qualitative analysis.
1) Experimental setup:
a data set. This implementation was evaluated on two widely used video summary datasets (i.e., SumMe and TVSum). The SumMe dataset consists of 25 videos including various activities like travel, cooking and sports. The TVSum data set includes 50 videos captured from a video website. The labels of both data sets are importance scores at the frame level. In addition, the present implementation also utilizes two additional data sets, OVP and YouTube, for augmenting the training data to evaluate our approach under different training settings. Since this implementation treats video summarization as a sequence tagging problem, the Frame-level scores are converted to a uniform set of key frames according to the disclosed Score2Frame method.
② evaluation criteria. Similar to the conventional method, F-score was used as an evaluation index in the present embodiment. Specifically, a KTS algorithm is first used to divide a video into a plurality of non-repeating shots. The mean of the reference score within each shot and the frame-level scores predicted by the model is then calculated as the shot-level score. And finally, selecting according to the scores of the shot levels by using a knapsack algorithm under the condition of limiting the length of the abstract to be not more than 15% of the original video so as to ensure that the total score of the abstract is maximum. Let "predicted" represent the summary of the prediction, "annotated" be the summary of the user annotation, and "overlap" be the portion where "predicted" and "annotated" overlap. Then F-score can be calculated as:
Figure BDA0002325274350000131
Figure BDA0002325274350000132
in addition, the present implementation further evaluates the correlation between the prediction scores and the user annotation scores of the present solution on two evaluation criteria, Kendall's τ and Spearman's ρ.
And thirdly, evaluating and setting. For fair comparison, the present implementation uses three different evaluation settings, including criteria (C), enhancement (a), and transformation (T), as in the prior art approach. Table 1 shows the evaluation settings for SumMe. When testing on TVSum, it is sufficient to exchange SumMe with TVSum in the table. The present implementation randomly selects 80% of the videos from the test data set as the training set, and the rest as the test set. Five training sets and test sets were randomly generated and evaluated separately on top, and the average was taken as the final result.
TABLE 1 training set of SumMe datasets
Figure BDA0002325274350000141
And fourthly, realizing details. As with previous methods, this implementation downsamples each video to 2fps and uses *** lenet pre-trained on ImageNet to extract features of the video frames. The length of the local self-attention network input stream is set to 8. For the dual-flow capsule network, the convolution kernel size, convolution step size and output channel number of the convolution layer are (9, 9), 1 and 64; the convolution kernel size and step size of the main capsule layer are (9, 9) and 2, respectively. The dimensions of the capsules in the main and category capsule layers are 8 and 16, respectively. The initial learning rates for SumMe and TVSum were set to 5e-4 and 1e-4, respectively, which became 0.5 times the original for every 20 rounds. Our model was implemented using PyTorch.
2) Quantitative analysis
Melting experiment. To demonstrate the effectiveness of the dual attention network, ablation experiments were performed on the model of this implementation. The following three variants were used for comparison:
ours-local: models using only local self-attention networks and single-flow capsule networks;
ours-global: models using only global self-attention networks and single-flow capsule networks;
ours: the implementation proposes a dual self-attention capsule network model.
TABLE 2 ablation test results on SumMe and TVSum
Method SumMe TVSum
Ours-local 45.2 58.3
Ours-global 45.4 58.5
Ours 47.5 59.4
Table 2 shows the results of the ablation experiments. It can be observed that Ours-local and Ours-global behave almost identically. Furthermore, Ours outperformed Ours-local and Ours-global on both the SumMe and TVSum datasets. This verifies that local and global self-attention, which rely on short-term and long-term temporal information, are equally important for generating a high-quality summary.
TABLE 3F-score (%) values for comparative methods on SumMe and TVSum
Figure BDA0002325274350000142
Figure BDA0002325274350000151
TABLE 4 comparison of Kendall's τ and Spearman's ρ values on TVSum dataset by methods
Method Kendall’sτ Spearman’sρ
Random 0.000 0.000
DR-DSNsup 0.020 0.026
dppLSTM 0.042 0.055
Ours 0.058 0.065
② comparing the methods. Compared with the LSTM-based approach, the model of the present implementation is computable in parallel and not limited by the video duration. In Table 3, the F-score performance differences between the existing LSTM-based method and the method of this implementation are compared. It can be seen that the method of this implementation achieves superior results under multiple settings of the two data.
This example also performed experiments on TVSum using Kendall's τ and Spearman's ρ. It should be noted that the model of this implementation is trained using binary sequence labels, which cannot be used to compute Kendall's τ and Spearman's ρ. Therefore, the frame level score for generating the binary label is adopted instead of the binary label as the reference data. And for each frame we use its pair directly Key frame capsule u should1As a predicted score. The experimental results on the TVSum data set are shown in table 4. It can be observed that the method of the present implementation is significantly improved under both evaluation criteria compared to other methods. This further demonstrates the effectiveness of the present approach.
3) Qualitative analysis
To better illustrate the results of the method of the present invention, a video summary generated by the method of the present invention is shown in fig. 4. As can be seen, the video says that two bears crawl down the tree. Therefore, the user-standard reference summary is mainly located in the middle and rear part of the original video. It can be observed that there is a large amount of overlap between the summary generated by the method of the present invention and the user-labeled summary, and the generated summary can reveal the complete process of the bear climbing down from the tree.
In summary, by means of the above technical solution of the present invention, a new dual attention capsule network for video abstraction is provided, and by designing a dual attention network capable of parallel computation, short-term and long-term dependency can be effectively captured without being limited by video duration, and feature expression of video frames is refined; by introducing a capsule network with superiority in the classification problem and providing a double-flow capsule network, feature fusion can be carried out and potential key frame selection criteria can be learned; furthermore, considering video summarization as a classification problem facilitates the collection of data sets. For example: only the original video and the edited video pairs need to be collected, and a large amount of manual marking is not needed. The problems of data shortage, difficult data annotation and the like can be well overcome. Compared with the most advanced method, the method can process in parallel, reduce the running time, and finally obtain the summary video which is free of redundancy and complete and has good performance and advantages.
In the present invention, the above-mentioned embodiments are only preferred embodiments of the present invention, and the present invention is not limited thereto, and any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A video abstraction method based on double-self-attention capsule network is characterized by comprising the following steps:
s1: the video abstract problem is regarded as a marking problem of a video frame sequence through a preset method;
s2: for a given video, extracting an initial feature vector of each video frame using a *** lenet model pre-trained on ImageNet data sets;
s3: performing feature refinement on the initial feature vector obtained in the step S2 by using a double-attention model;
s4: fusing the refined features by using a double-flow capsule network, and marking each frame of the video to obtain a key frame capsule u1And non-key frame capsule u2And the length of each capsule represents the probability that it belongs to that category;
s5: training the model in a deep learning mode by using a corresponding objective function so that the model can generate a concise and complete abstract;
S6: according to the model trained in S5, for the newly input video, the above-mentioned steps S1-S4 are executed to obtain the key frame capsule u1Is used to generate the final summary.
2. The method for video summarization based on dual-attention capsule network of claim 1, wherein the step S1 regards the video summarization problem as a labeling problem of a sequence of video frames by a preset method, and comprises the following steps:
s11: definition V ═ { V1,v2,...,vTDenotes a video, where T denotes the total frame number of the video, vtRepresents the t-th frame;
s12: assigning a video a sequence of tags Y ═ Y1,y2,...,yTIn which ytE {0,1} and yt1 represents that the t-th frame is a key frame and should be selected into the summary; conversely, yt0 represents the tth frame as a non-key frame.
3. The video summarization method based on the bi-self attention capsule network of claim 1, wherein the step S2 of extracting the initial feature vector of each video frame using the *** lenet model pre-trained on ImageNet data set for a given video specifically comprises the following steps:
s21: acquiring a built and trained GoogLeNet model;
s22: inputting a video and parsing the video into frames;
S23: extracting initial feature vectors of each frame in the video data by adopting a GoogleLeNet model in S21 and using ftThe feature vector representing the t-th frame is defined as follows:
ft=CNN(vt) (1);
wherein the content of the first and second substances,
Figure FDA0002325274340000021
GoogLeNet model, v, representing pre-trainingtRepresents the t-th frame;
s24: combining the obtained feature vectors of all the frames to obtain a feature vector F ═ F of the video1,f2,...,fT}。
4. The video summarization method based on dual-attention capsule network of claim 1, wherein the step S3 feature-refining the initial feature vector obtained in S2 by using dual-attention model comprises the following steps:
s31: constructing a double attention module comprising a local self-attention network and a global self-attention network based on a self-attention mechanism;
s32: dividing a video sequence containing T frames into M video segments, wherein each segment contains N continuous frames, and all frames of each segment are used as an input stream and input into the local self-attention network to capture short-term dependency; extracting frames at corresponding positions from each segment to form a new input stream, inputting the global self-attention network to capture long-term dependency relationship, and expressing the number of video segments as follows:
Figure FDA0002325274340000022
Wherein the content of the first and second substances,
Figure FDA0002325274340000023
represents a round-down operation;
s33: performing feature refinement on the initial features through the local self-attention network module;
s34: feature refinement is performed on the initial features by the global self-attention network module.
5. The video summarization method based on dual self-attention capsule network of claim 4 wherein the step S33 of feature refining the initial features through the local self-attention network module specifically comprises the following steps:
s331: all frames of each segment are formed into an input stream { l ] of the local self-attention networkm:m=1,2,...,M}:
lm={ft:t=(m-1)·N+1,...,m·N-1,m·N}(3);
Wherein lmFor the mth input stream, M is the total number of segments, ftThe feature vector of the t frame;
s332: the m-th input stream l obtained in S331 is fed tomInputting into a self-attention network to obtain the refined characteristics of the m local input stream:
Figure FDA0002325274340000024
wherein the content of the first and second substances,
Figure FDA0002325274340000025
show self-attentionThe network of forces is such that,
Figure FDA0002325274340000026
representing the refined features;
s333: recombining all of the refined features into a composite video frame according to the original video frame order of the video data
Figure FDA0002325274340000031
Wherein T represents the total number of video frames,
Figure FDA0002325274340000032
and representing the feature vector of the t-th frame after local self-attention network refinement.
6. The video summarization method based on dual self-attention capsule network of claim 4, wherein the step S34 of feature refining the initial features through the global self-attention network module specifically comprises the following steps:
s341: selecting the corresponding position frame from each video clip to form the input stream { g ] of the global self-attention networkn:n=1,2,...,N}:
gn={ft:t=n,n+N,...,n+(M-1)·N}(5);
Wherein, gnFor the nth input stream, N is the total number of frames per segment, M is the total number of segments, ftThe feature vector of the t frame;
s342: adding a position coding vector to the characteristics of the nth input stream to strengthen the sequence information of the video, and inputting the sequence information into a self-attention network to obtain the refined characteristics of the nth global input stream:
Figure FDA0002325274340000033
wherein the content of the first and second substances,
Figure FDA0002325274340000034
the position-coding function is represented by a function,
Figure FDA0002325274340000035
for refined features, gnIs the nth data stream;
s343: recombining all of the refined features into a composite video frame according to the original video frame order of the video data
Figure FDA0002325274340000036
Wherein T represents the total number of video frames,
Figure FDA0002325274340000037
and representing the feature vector of the t-th frame after global self-attention network refinement.
7. The video summarization method based on dual-attention capsule network of claim 1 wherein the dual-stream capsule network in step S4 comprises two-stream capsule network, i.e. local stream and global stream, each stream comprising a convolutional layer and a main capsule layer, and finally the two streams are merged using the category capsule layer.
8. The method for video summarization based on dual-attention capsule network of claim 7 wherein step S4 utilizes dual-stream capsule network to fuse the refined features and label each frame of the video to obtain key frame capsule u1And non-key frame capsule u2The method specifically comprises the following steps:
s41: extracting high-level features from a feature map with a plurality of convolution kernels as input, reshaping a 1024-dimensional vector obtained from the self-attention network into 32x32, and inputting the vector into the convolution layer to obtain:
Figure FDA0002325274340000038
Figure FDA0002325274340000041
wherein the content of the first and second substances,
Figure FDA0002325274340000042
representing a reshaping function, Conv1lAnd Conv1gRespectively representing the convolution operations of the local stream and the global stream,
Figure FDA0002325274340000043
and
Figure FDA0002325274340000044
respectively representing the characteristics of the t-th frame after the local self-attention network and the global self-attention network are refined,
Figure FDA0002325274340000045
and
Figure FDA0002325274340000046
feature maps representing local and global streams, respectively;
s42: converting scalar output generated by the convolution layer into vector output capsule through the main capsule layer, wherein each convolution kernel is in the feature map
Figure FDA0002325274340000047
And
Figure FDA0002325274340000048
after the convolution operation, a series of capsules are obtained, the capsules form a capsule channel, and the j-th capsule channel generated by the local flow and the global flow is respectively represented as:
Figure FDA0002325274340000049
Figure FDA00023252743400000410
Wherein, represents the convolution operation,
Figure FDA00023252743400000411
and
Figure FDA00023252743400000412
bias terms for local and global streams respectively,
Figure FDA00023252743400000413
representing a compression function for limiting the length of each capsule vector, which compresses the length of each vector between 0 and 1,
Figure FDA00023252743400000414
and
Figure FDA00023252743400000415
convolution kernels for local and global streams, respectively;
s43: stacking all the capsule channels of the local and global flows separately, yields:
Figure FDA00023252743400000416
Figure FDA00023252743400000417
wherein J represents the number of capsule channels;
s44: after obtaining capsules from the main capsule layer of each flow, all capsules are connected and the short-term and long-term dependencies are fused using a dynamic routing algorithm to obtain a category capsule layer that includes key frame capsules u1And non-key frame capsule u2
[u1,u2]=Routing([Plocal,Pglobal]) (13);
Wherein the content of the first and second substances,
Figure FDA00023252743400000418
representing a dynamic routing function.
9. The method for video summarization based on dual-attention capsule network of claim 1, wherein the step S5 trains the model in a deep learning manner using the corresponding objective function, so that the model can generate a concise and complete summary specifically comprises the following steps:
s51: obtaining an objective function, wherein the expression of the objective function is as follows:
LK=TK max(0,m+-||uk||)2+λ(1-TK)max(0,||uk||-m-)2 (14);
wherein L isKFor the loss value of the kth capsule and k ∈ {1,2}, T is the training sample from the kth class K1, otherwise TK=0,ukDenotes the kth Capsule, m+And m-Respectively an upper margin and a lower margin, and lambda is a balance parameter;
s52: training a model of the dual self-attention capsule network by objective function minimization using a video dataset with keyframe markers;
s53: iteratively training as per S1-S52 for each video in the data set until convergence or a maximum number of allowed iteration rounds is reached;
s54: and saving the trained model.
10. The method for video summarization based on dual-attention capsule network of claim 1 wherein the step S6 is performed on the newly inputted video according to the trained model S5 to obtain the key frame capsule u through the steps S1-S41For generatingThe final abstract specifically comprises the following steps:
s61, for the newly input video, first executing S1-S4 to obtain the key frame capsule u corresponding to each frame1
S62 passing through the key frame capsule u1The length of the key frame capsule u corresponding to all the frames is obtained1The probability of (d);
s63, dividing the input video into a plurality of shots by using an algorithm Kernel Temporal Segmentation;
s64, calculating the average value of the key frame capsule probability of each lens internal frame to obtain the average probability of the lens;
And S65, selecting the shots by using a dynamic programming algorithm on the average probability of the shots obtained in the step S64, wherein the dynamic programming aims at: under the condition that the total length of the selected shot does not exceed the preset proportion of the total length of the original video, maximizing the sum of the scores of the shot levels;
and S66, combining the shots selected in the S65 according to the time sequence to obtain the abstract.
CN201911313856.3A 2019-12-19 2019-12-19 Video abstraction method based on double self-attention capsule network Active CN111984820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911313856.3A CN111984820B (en) 2019-12-19 2019-12-19 Video abstraction method based on double self-attention capsule network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911313856.3A CN111984820B (en) 2019-12-19 2019-12-19 Video abstraction method based on double self-attention capsule network

Publications (2)

Publication Number Publication Date
CN111984820A true CN111984820A (en) 2020-11-24
CN111984820B CN111984820B (en) 2023-10-27

Family

ID=73441638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911313856.3A Active CN111984820B (en) 2019-12-19 2019-12-19 Video abstraction method based on double self-attention capsule network

Country Status (1)

Country Link
CN (1) CN111984820B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307258A (en) * 2020-11-25 2021-02-02 中国计量大学 Short video click rate prediction method based on double-layer capsule network
CN113099374A (en) * 2021-03-30 2021-07-09 四川省人工智能研究院(宜宾) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN115695950A (en) * 2023-01-04 2023-02-03 石家庄铁道大学 Video abstract generation method based on content perception

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101072305A (en) * 2007-06-08 2007-11-14 华为技术有限公司 Lens classifying method, situation extracting method, abstract generating method and device
CN101477633A (en) * 2009-01-21 2009-07-08 北京大学 Method for automatically estimating visual significance of image and video
CN105228033A (en) * 2015-08-27 2016-01-06 联想(北京)有限公司 A kind of method for processing video frequency and electronic equipment
CN105657580A (en) * 2015-12-30 2016-06-08 北京工业大学 Capsule endoscopy video summary generation method
CA3016953A1 (en) * 2017-09-07 2019-03-07 Comcast Cable Communications, Llc Relevant motion detection in video
CN110110140A (en) * 2019-04-19 2019-08-09 天津大学 Video summarization method based on attention expansion coding and decoding network
CN110287374A (en) * 2019-06-14 2019-09-27 天津大学 It is a kind of based on distribution consistency from attention video summarization method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101072305A (en) * 2007-06-08 2007-11-14 华为技术有限公司 Lens classifying method, situation extracting method, abstract generating method and device
CN101477633A (en) * 2009-01-21 2009-07-08 北京大学 Method for automatically estimating visual significance of image and video
CN105228033A (en) * 2015-08-27 2016-01-06 联想(北京)有限公司 A kind of method for processing video frequency and electronic equipment
CN105657580A (en) * 2015-12-30 2016-06-08 北京工业大学 Capsule endoscopy video summary generation method
CA3016953A1 (en) * 2017-09-07 2019-03-07 Comcast Cable Communications, Llc Relevant motion detection in video
CN110110140A (en) * 2019-04-19 2019-08-09 天津大学 Video summarization method based on attention expansion coding and decoding network
CN110287374A (en) * 2019-06-14 2019-09-27 天津大学 It is a kind of based on distribution consistency from attention video summarization method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAO FU等: "video summarization with a dual attention capsule network", 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, pages 446 - 451 *
冀中;江俊杰;: "基于解码器注意力机制的视频摘要", 天津大学学报(自然科学与工程技术版), vol. 51, no. 10, pages 1023 - 1030 *
刘伟;吴毅红;: "基于图层优化与融合的2D―3D视频转换方法", 计算机辅助设计与图形学学报, vol. 24, no. 11, pages 1426 - 1439 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307258A (en) * 2020-11-25 2021-02-02 中国计量大学 Short video click rate prediction method based on double-layer capsule network
CN112307258B (en) * 2020-11-25 2021-07-20 中国计量大学 Short video click rate prediction method based on double-layer capsule network
CN113099374A (en) * 2021-03-30 2021-07-09 四川省人工智能研究院(宜宾) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
CN113099374B (en) * 2021-03-30 2022-08-05 四川省人工智能研究院(宜宾) Audio frequency three-dimensional method based on multi-attention audio-visual fusion
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN115002559B (en) * 2022-05-10 2024-01-05 上海大学 Video abstraction algorithm and system based on gating multi-head position attention mechanism
CN115695950A (en) * 2023-01-04 2023-02-03 石家庄铁道大学 Video abstract generation method based on content perception

Also Published As

Publication number Publication date
CN111984820B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
Dai et al. Human action recognition using two-stream attention based LSTM networks
CN111984820B (en) Video abstraction method based on double self-attention capsule network
US11741711B2 (en) Video classification method and server
Xiao et al. Convolutional hierarchical attention network for query-focused video summarization
CN109344288A (en) A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
Perez-Martin et al. Improving video captioning with temporal composition of a visual-syntactic embedding
Chen et al. Learning and fusing multiple user interest representations for micro-video and movie recommendations
Zhang et al. S3d: single shot multi-span detector via fully 3d convolutional networks
CN110933518B (en) Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism
CN112417097B (en) Multi-modal data feature extraction and association method for public opinion analysis
Hii et al. Multigap: Multi-pooled inception network with text augmentation for aesthetic prediction of photographs
CN112749330B (en) Information pushing method, device, computer equipment and storage medium
CN111274440A (en) Video recommendation method based on visual and audio content relevancy mining
Liu et al. Hybrid design for sports data visualization using AI and big data analytics
Liu et al. Video captioning with listwise supervision
CN115731498B (en) Video abstract generation method combining reinforcement learning and contrast learning
Li et al. Learning hierarchical video representation for action recognition
Gao et al. Play and rewind: Context-aware video temporal action proposals
Yan et al. CHAM: action recognition using convolutional hierarchical attention model
CN113051468B (en) Movie recommendation method and system based on knowledge graph and reinforcement learning
Guan et al. Pidro: Parallel isomeric attention with dynamic routing for text-video retrieval
Huo et al. Semantic relevance learning for video-query based video moment retrieval
Jiang Web-scale multimedia search for internet video content
CN113792167B (en) Cross-media cross-retrieval method based on attention mechanism and modal dependence
Zhang et al. SAPS: Self-attentive pathway search for weakly-supervised action localization with background-action augmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant