CN117312603B - Unsupervised segmentation video abstraction method based on double-attention mechanism - Google Patents
Unsupervised segmentation video abstraction method based on double-attention mechanism Download PDFInfo
- Publication number
- CN117312603B CN117312603B CN202311598370.5A CN202311598370A CN117312603B CN 117312603 B CN117312603 B CN 117312603B CN 202311598370 A CN202311598370 A CN 202311598370A CN 117312603 B CN117312603 B CN 117312603B
- Authority
- CN
- China
- Prior art keywords
- segment
- video
- lens
- feature
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000007246 mechanism Effects 0.000 title claims abstract description 29
- 230000011218 segmentation Effects 0.000 title claims description 18
- 230000000007 visual effect Effects 0.000 claims abstract description 23
- 230000006870 function Effects 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 7
- 230000002787 reinforcement Effects 0.000 claims abstract description 4
- 239000011159 matrix material Substances 0.000 claims description 64
- 238000004590 computer program Methods 0.000 claims description 19
- 239000013598 vector Substances 0.000 claims description 16
- 239000012634 fragment Substances 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 12
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 10
- 150000001875 compounds Chemical class 0.000 claims description 10
- 230000009977 dual effect Effects 0.000 claims description 8
- 238000003860 storage Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 4
- 230000000875 corresponding effect Effects 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000002596 correlated effect Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000000415 inactivating effect Effects 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 4
- 238000012512 characterization method Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 8
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000007774 longterm Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 206010044565 Tremor Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/738—Presentation of query results
- G06F16/739—Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
- G06V20/47—Detecting features for summarising video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Medical Informatics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the application provides an unsupervised segmented video abstraction method based on a double-attention mechanism, which comprises the following steps: after preprocessing an original video, dividing a video segment and a lens segment to obtain a video segment group and a lens segment group; respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract; constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.
Description
Technical Field
The application belongs to the field of computer vision, and particularly relates to an unsupervised segmentation video abstraction method based on a double-attention mechanism.
Background
In recent years, with the popularization of video sharing platforms such as voice trembling, loving, fast handholding and the like, uploading and downloading videos at any time and any place, sharing life has become a normal state, and people can meet the requirement of mental culture pursuit and simultaneously, accompanying multimedia data such as audio and video and the like show explosive growth. Therefore, how to process and manage such huge and complicated video data becomes a problem to be solved urgently. The video abstract is a technology for automatically extracting important frames or video clips from an original long video by a computer, can furthest reserve the original video content while shortening the total duration of the video, is convenient for subsequent efficient storage and browsing, and thus gradually draws extensive attention of researchers in the field of computer vision.
Modeling the timing relationships of video sequences is a challenge in video summarization tasks, while at the same time, it is also very important to accurately extract features that effectively characterize the entire video. The existing video abstraction method can be divided into two types, one type is to take static image characteristics as input, and to use a time characteristic aggregation mode to realize analysis and capture of time-sequence interaction relation. For example, ji et al employ *** net as an extraction network for visual features of video frames, use long-short-term memory networks as remote dependencies for encoder modeling of video frame sequences, and use attention mechanisms to enhance long-term dependencies from frame to frame; li et al also used the image features extracted by *** et as input, simulated similarity between pairs of frames using a self-attention mechanism and modeled for all pairs of frames, capturing global relationships for the entire video. Although the above methods have taken into account timing relationships, these methods merely simulate the front-to-back relationship of a set of still images, do not take into account the true potential timing relationships of a sequence of consecutive frames, and the features of the still images extracted by 2D convolution lack correlation between consecutive frames. In order to solve the above problems, the second category is proposed to take a method of covering the dynamic video feature of the fine granularity time sequence information as input, for example, lin et al propose to use 3D ResNeXt-101 as an extraction network of the dynamic video feature, design a hierarchical long and short time memory network, and obtain the long-term dependency of the video in parallel with a sequential annotation mechanism; liu et al studied which of ST3D and I3D networks is better suited as a feature extraction network for video abstraction, and proposed to map spatio-temporal features into a potential space that can encode spatio-temporal dependencies using 3DST-UNet exploration context information. However, the above method has a problem that the time sequence relation is emphasized too much to ignore the visual content, so that the model has deviation in understanding the visual content. Therefore, a video summarization method that can fully and accurately understand visual contents and complete time-series correlation modeling is needed.
Disclosure of Invention
In order to solve one of the technical defects, an unsupervised segmented video summarization method based on a dual-attention mechanism is provided in an embodiment of the present application.
According to a first aspect of an embodiment of the present application, there is provided an unsupervised segmented video summarization method based on a dual-attention mechanism, including:
preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;
respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing;
taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract;
and constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode.
Preferably, after the preprocessing is performed on the original video, a video segment group and a lens segment group are obtained by segmentation, including:
inputting an original long video, sampling and extracting a video frame sequence according to a frame rate;
inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship;
and detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.
Preferably, the video summary model comprises a video segment attention module and a lens segment attention module.
Preferably, the inputting the video segment group and the lens segment group into the video abstract model respectively, and obtaining the weighted video segment feature and the weighted lens segment feature after processing includes:
inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating the features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;
and inputting the lens segment group into a lens segment attention module, filtering out the segments which are irrelevant to the target segment or have lower correlation degree through coarse granularity similarity calculation, and calculating a similarity matrix of the residual segments after the residual lens segments in the lens are aggregated to obtain the weighted lens segment characteristics capable of enhancing the local region correlation semantic information.
Preferably, the inputting the video segment group into the video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating features in the segment into weighted video segment features capable of representing short-term time sequence dependency advanced semantic information, and outputting the weighted video segment features, which specifically includes:
taking the video segment group as input, and linearly mapping the video segment group into video segment query features, video segment key features and video segment value features by using matrixes with three different weights;
calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;
and weighting the segment-level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.
Preferably, the lens segment group is input into a lens segment attention module, the segments irrelevant to the target segment or having lower correlation degree are filtered through coarse granularity similarity calculation, the similarity matrix of the remaining segments is calculated after the remaining lens segments in the lens are aggregated, and the weighted lens segment characteristics capable of enhancing local region correlation semantic information are obtained, and the method specifically comprises the following steps:
taking a lens segment group as input, and linearly mapping the lens segment group into lens segment query characteristics, lens segment key characteristics and lens segment value characteristics by using matrixes with three different weights;
averaging the lens fragment inquiry features and the lens fragment key features according to rows to obtain inquiry average features and key average features which can represent the whole lens content, and multiplying the inquiry average features and the key average features by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens;
filtering out areas with no correlation or low correlation from the similarity matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set;
and calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature.
Preferably, the reward function comprises a representative rewardDiversity rewards->And regular term->Reporting function/>The method comprises the following steps:
(1)
the representative rewardThe following equation (2):
(2)
in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the diversity rewardsThe following equation (3):
(3)
in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Represent the firstSpace-time feature vectors of the individual shots;
the regular termThe following equation (4) is used to obtain:
(4)
in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.
Preferably, the feature extraction module adopts an X3D deep convolutional neural network, and the extracted lens vector is 2048 dimensions as the input space-time feature.
According to a second aspect of embodiments of the present application, there is provided a computer device comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method as claimed in any one of the above.
According to a third aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the dual attention mechanism based unsupervised segmented video summarization method as described in any one of the above.
By adopting the unsupervised segmentation video abstraction method based on the double-attention mechanism, which is provided by the embodiment of the application, after preprocessing an original video, segmenting a video segment and a lens segment to obtain a video segment group and a lens segment group, respectively inputting the video segment group and the lens segment group into a video abstraction model, calculating the importance score of each lens, and selecting the lens with high score, namely the lens with high importance degree, to generate a dynamic abstract; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a flowchart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to an embodiment of the present application;
fig. 2 is a system frame diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to an embodiment of the present application;
fig. 3 is a flow chart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a second embodiment of the present application;
fig. 4 is a network structure block diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to the third embodiment of the present application;
fig. 5 is a flowchart of a video clip group processing in a third embodiment of the present application;
FIG. 6 is a flowchart illustrating a lens segment group processing procedure in a third embodiment of the present application;
fig. 7 is a diagram illustrating a lens segment group processing structure in a third embodiment of the present application;
FIG. 8 is a graph showing experimental results of the method provided in the present application compared with other methods.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In the process of implementing the application, the inventor finds that the current video abstraction method does not consider the problem of unification of time sequence relation and visual content, so that the model has deviation in understanding the video content.
In view of the above problems, a method for unsupervised segmentation video summarization based on a dual-attention mechanism is provided in a first embodiment of the present application, fig. 1 is a schematic flow chart of the first embodiment, and fig. 2 is a system frame diagram of the first embodiment, as shown in fig. 1 and 2, where the method includes:
s1, preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;
s2, respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; the video abstract model comprises a video segment attention module and a lens segment attention module;
s3, taking the video segment features with the weight and the lens segment features with the weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely the lens with high importance degree, so as to generate a dynamic abstract;
and S4, constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode.
After preprocessing an original video, dividing the original video to obtain a video segment group and a lens segment group, respectively inputting the video segment group and the lens segment group into a video abstract model, calculating importance scores of each lens, and selecting lenses with high scores, namely high importance degrees, to generate a dynamic abstract; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.
Fig. 3 is a flow chart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a second embodiment of the present application, on the basis of the first embodiment, as shown in fig. 3, after preprocessing an original video, segmenting the original video to obtain a video segment group and a lens segment group, including:
s11, inputting an original long video, sampling and extracting a video frame sequence according to a frame rate, wherein the frame rate is different from different videos, and the frame rate is represented by the total frame number of downsampled videos.
S12, inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship; the feature extraction module adopts an X3D deep convolutional neural network, and the extracted lens vector is 2048D and is used as an input space-time feature;
s13, detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting the visual appearance change points to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.
Specifically, the input original long video is downsampled according to the video frame rate to obtain a video frame sequenceWherein->Representing the total number of frames for downsampling of a video, the frame rate may be different for different videos; using X3D depth neural network as space-time feature extractor, video frame sequence +.>In a temporally continuous manner->Frame input as a shotObtaining corresponding space-time characteristic vector capable of representing visual content and time sequence relation from X3D depth neural network>Wherein->The shot characteristics of each video are denoted +.>Wherein->,/>Representing the lens characteristic dimension output by the X3D deep neural network, wherein the value is 2048; shot feature of each video is +.>As input, the algorithm detects +.>The characteristic of the lens with severe visual content variation in the middle and the lens as the demarcation point of the video clips divide the video into unequal and non-overlapping video clip groups +.>Wherein->Is the number of video clips, and at the same time, equally divides each lens feature into non-overlapping lens clip group +.>WhereinIs the number of lens segments.
Fig. 4 is a network structure block diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a third embodiment of the present application, as shown in fig. 4, the video segment group and the shot segment group are respectively input into a video abstraction model, and the video segment feature with weight and the shot segment feature with weight are obtained after processing, including:
s21, inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;
s22, inputting the lens segment group into a lens segment attention module, filtering out segments which are irrelevant to the target segment or have low correlation degree through coarse granularity similarity calculation, and calculating a residual segment similarity matrix after the residual lens segments in the lens are aggregated to obtain weighted lens segment characteristics capable of enhancing local region correlation semantic information.
As shown in fig. 5, the processing flow for the video clip group includes:
s211, taking the video segment group as input, and linearly mapping the video segment group into a video segment query feature, a video segment key feature and a video segment value feature by using matrixes of three different weights;
s212, calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;
s213, weighting the segment level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.
Specifically, the video segment attention module is composed of a plurality of self-attention units capable of capturing time sequence correlation in the segment, and each unit takes each video segment characteristic as input to calculate inter-segment inter-lens similarity matrixOne item in the similarity matrix between lenses>The calculation process of (1) can be specifically expressed as follows:
(5)
in the formula (5), the amino acid sequence of the compound,and->Is a parameter to be learned by the model, +.>Is a constant.
Inter-lens similarity matrixOutput as normalized similarity matrix via softmax function +.>The matrix is followed by a random inactivation layer and a layer via a learnable linear mapping layer>Matrix multiplication of segment features to obtain weighted video segment features->The calculation process is as follows:
(6)。
as shown in fig. 6, the processing of the lens segment group includes:
s221, taking a lens segment group as input, and linearly mapping a matrix of three different weights into a lens segment query feature, a lens segment key feature and a lens segment value feature;
s222, averaging the lens fragment inquiry characteristics and the lens fragment key characteristics according to rows to obtain inquiry average characteristics and key average characteristics which can represent the whole lens content, and multiplying the inquiry average characteristics and the key average characteristics by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens;
s223, filtering out areas without correlation or with low correlation from the similar shape matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set;
s224, calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature.
Specifically, as shown in fig. 7, the lens segment attention module is composed of a plurality of dual attention units capable of realizing targeting of more important content, and each unit takes a lens segment group as input to firstly perform coarse-granularity lens segment screening and then perform fine-granularity weight distribution; first, a lens segment groupObtaining lens fragment query characteristics through matrix mapping>Lens segment key feature->Lens segment value featureWherein->、/>、/>Are all a learnable mapping matrix; query feature for shot segment>And lens segment key feature->Obtaining inquiry mean value characteristics capable of characterizing visual contents in coarse granularity by line taking mean valueAnd Key mean feature->Then for query mean feature->Sum key mean featurePerforming matrix multiplication to obtain a similarity matrix covering the interaction relationship between the lens segments>For similarity matrix->Get before line->Lens fragment index value with large similarity +.>The method comprises the steps of carrying out a first treatment on the surface of the Then according toThe key characteristics of the lens segment are extracted respectively>And lens segment value feature->Is aggregated into a key feature set of high importance only +.>Sum feature set->The method comprises the steps of carrying out a first treatment on the surface of the Inquiring the feature of the lens fragment->Sum key feature setMatrix multiplication is performed to calculate a regional level similarity matrix between the intra-lens segments>Then output as normalized similarity matrix via softmax function>The matrix is followed by a random deactivation layer and a set of value featuresMatrix multiplication is performed to obtain the characteristic of the lens segment with weight>。
Fusion mapping of weighted video segment features reflecting high-level time sequence relation and weighted shot segment features reflecting high-level visual content into importance scores per shotWherein->And selecting shots with high scores, namely high importance degree, and generating dynamic abstracts.
The method and the device introduce a video segment attention module and a lens segment attention module dual attention module, irrelevant areas are filtered through coarse-granularity mean similarity to prevent weight dispersion, and then a weight distribution process taking content as a guide is completed through fine-granularity area-level similarity to obtain the characteristics of the lens segment with weight capable of representing important visual content; in modeling of the timing relationship, unequal segmentation is adopted, so that timing dependence of capturing errors caused by damage to the integrity of video clips is prevented; in the process of obtaining the advanced visual semantic information view, a dual-attention mechanism is adopted, and a weight distribution process with importance content as a guide is realized.
The reward function includes a representative rewardDiversity rewards->And regular term->Return function->The method comprises the following steps:
(1)
the representative rewardThe following equation (2):
(2)
in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the diversity rewardsThe following equation (3):
(3)
in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Represent the firstSpace-time feature vectors of the individual shots;
the regular termThe following equation (4) is used to obtain:
(4)
in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.
To verify the effectiveness of the present invention, the present application performed experiments on two video summary standard data sets SumMe and TVSum and two enhancement data sets YouTube and OVP, and evaluated the present invention using three settings, standard (C), enhancement (a), transfer (T), respectively. In the standard setting, the specified dataset was randomly divided into five parts, training was performed using 80% of the data, leaving 20% for testing; in the enhancement setting, 80% of the given dataset was used with the other three datasets for training, while the remaining 20% were used for testing; in the transfer setting, three data sets are used for training, while the remaining one is used for testing; in all settings, the model was evaluated using the F score and run five times and take the average of the five times as the final result; as shown in fig. 8, according to fig. 8, compared with other advanced methods, the present application achieves the best performance in the standard and enhancement setting of SumMe; the most competitive results were achieved in the standard and enhancement settings of TVSum; while being superior to most models in the migration setup of both data sets. In summary, the proposed method can effectively model time-series relationships and visual content to improve the performance of the model.
A computer device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method as described above.
A computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the unsupervised segmented video summarization method based on the dual-attention mechanism as described above.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The schemes in the embodiments of the present application may be implemented in various computer languages, for example, C language, VHDL language, verilog language, object-oriented programming language Java, and transliteration scripting language JavaScript, etc.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.
Claims (7)
1. An unsupervised segmented video abstraction method based on a double-attention mechanism is characterized by comprising the following steps:
preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;
respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; the video abstract model comprises a video segment attention module and a lens segment attention module;
taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract;
constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode;
the video segment group and the lens segment group are respectively input into a video abstract model, and the video segment characteristic with weight and the lens segment characteristic with weight are obtained after processing, and the method comprises the following steps:
inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating the features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;
inputting the lens segment group into a lens segment attention module, filtering out segments which are irrelevant to a target segment or have low correlation degree through coarse granularity similarity calculation, and calculating a similarity matrix of the remaining segments after the remaining lens segments in the lens are aggregated to obtain weighted lens segment characteristics capable of enhancing local region correlation semantic information; the method specifically comprises the following steps: taking a lens segment group as input, and linearly mapping the lens segment group into lens segment query characteristics, lens segment key characteristics and lens segment value characteristics by using matrixes with three different weights; averaging the lens fragment inquiry features and the lens fragment key features according to rows to obtain inquiry average features and key average features which can represent the whole lens content, and multiplying the inquiry average features and the key average features by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens; filtering out areas with no correlation or low correlation from the similarity matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set; calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature;
the lens segment groupAs input, obtaining lens segment query characteristics through matrix mappingLens segment key feature->Lens segment value feature +.>Wherein->、/>、/>Are all a learnable mapping matrix; query feature for shot segment>And lens segment key feature->Obtaining inquiry mean value characteristics capable of characterizing visual contents in coarse granularity by line averaging>And Key mean feature->Then for query mean feature->And Key mean feature->Performing matrix multiplication to preliminarily obtain a similarity matrix covering the interaction relationship among the lens segmentsFor similarity matrix->Get before line->Lens fragment index value with large similarity +.>The method comprises the steps of carrying out a first treatment on the surface of the According to->The key characteristics of the lens segment are extracted respectively>And lens segment value feature->Is aggregated into a key feature set of high importance only +.>Sum feature set->The method comprises the steps of carrying out a first treatment on the surface of the Inquiring the feature of the lens fragment->And Key feature set->Matrix multiplication is performed to calculate a regional level similarity matrix between the intra-lens segments>Then output as normalized similarity matrix via softmax function>The matrix is followed by a random inactivating layer and a value feature set +.>Matrix multiplication is performed to obtain the characteristic of the lens segment with weight>。
2. The method for summarizing an unsupervised segmented video based on a dual-attention mechanism according to claim 1, wherein the preprocessing of the original video and the segmentation to obtain a video segment group and a shot segment group comprises:
inputting an original long video, sampling and extracting a video frame sequence according to a frame rate;
inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship;
and detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.
3. The method for the unsupervised segmentation video abstraction based on the dual-attention mechanism as set forth in claim 1, wherein the inputting the video segment group into the video segment attention module calculates an inter-shot similarity matrix in the video segment, aggregates the features in the segment into weighted video segment features capable of characterizing short-term timing dependency high-level semantic information, and outputs the weighted video segment features, and specifically includes:
taking the video segment group as input, and linearly mapping the video segment group into video segment query features, video segment key features and video segment value features by using matrixes with three different weights;
calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;
and weighting the segment-level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.
4. The dual attention mechanism based unsupervised segmented video summarization method of claim 1, wherein the reward function comprises a representative rewardDiversity rewards->And regular term->Return function->The method comprises the following steps:
(1)
the representative rewardThe following equation (2):
(2)
in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the diversity rewardsThe following equation (3):
(3)
in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the regular termThe following equation (4) is used to obtain:
(4)
in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.
5. The method for unsupervised segmented video summarization based on the dual-attention mechanism according to claim 2, wherein the feature extraction module uses an X3D deep convolutional neural network, and the extracted shot vector is 2048 dimensions as the input spatio-temporal feature.
6. A computer device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method of any one of claims 1 to 5.
7. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement the dual attention mechanism based unsupervised segmented video summarization method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311598370.5A CN117312603B (en) | 2023-11-28 | 2023-11-28 | Unsupervised segmentation video abstraction method based on double-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311598370.5A CN117312603B (en) | 2023-11-28 | 2023-11-28 | Unsupervised segmentation video abstraction method based on double-attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117312603A CN117312603A (en) | 2023-12-29 |
CN117312603B true CN117312603B (en) | 2024-03-01 |
Family
ID=89281414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311598370.5A Active CN117312603B (en) | 2023-11-28 | 2023-11-28 | Unsupervised segmentation video abstraction method based on double-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117312603B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100284670A1 (en) * | 2008-06-30 | 2010-11-11 | Tencent Technology (Shenzhen) Company Ltd. | Method, system, and apparatus for extracting video abstract |
CN115002559A (en) * | 2022-05-10 | 2022-09-02 | 上海大学 | Video abstraction algorithm and system based on gated multi-head position attention mechanism |
CN116662604A (en) * | 2023-06-26 | 2023-08-29 | 浙江千从科技有限公司 | Video abstraction method based on layered Transformer |
-
2023
- 2023-11-28 CN CN202311598370.5A patent/CN117312603B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100284670A1 (en) * | 2008-06-30 | 2010-11-11 | Tencent Technology (Shenzhen) Company Ltd. | Method, system, and apparatus for extracting video abstract |
CN115002559A (en) * | 2022-05-10 | 2022-09-02 | 上海大学 | Video abstraction algorithm and system based on gated multi-head position attention mechanism |
CN116662604A (en) * | 2023-06-26 | 2023-08-29 | 浙江千从科技有限公司 | Video abstraction method based on layered Transformer |
Non-Patent Citations (1)
Title |
---|
基于自注意力网络的动态视频摘要方法研究;姚慧敏;《中国优秀硕士学位论文全文数据库 信息科技辑》(第01期);第1-75页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117312603A (en) | 2023-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111079532B (en) | Video content description method based on text self-encoder | |
CN111768432B (en) | Moving target segmentation method and system based on twin deep neural network | |
US11200424B2 (en) | Space-time memory network for locating target object in video content | |
Liu et al. | Teinet: Towards an efficient architecture for video recognition | |
CN110503076B (en) | Video classification method, device, equipment and medium based on artificial intelligence | |
Li et al. | Short-term and long-term context aggregation network for video inpainting | |
CN111046821B (en) | Video behavior recognition method and system and electronic equipment | |
CN112016682B (en) | Video characterization learning and pre-training method and device, electronic equipment and storage medium | |
CN112232134B (en) | Human body posture estimation method based on hourglass network and attention mechanism | |
CN111028166B (en) | Video deblurring method based on iterative neural network | |
CN115695950B (en) | Video abstract generation method based on content perception | |
GB2579262A (en) | Space-time memory network for locating target object in video content | |
CN114549913A (en) | Semantic segmentation method and device, computer equipment and storage medium | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN115131218A (en) | Image processing method, image processing device, computer readable medium and electronic equipment | |
Zhou et al. | Transformer-based multi-scale feature integration network for video saliency prediction | |
CN113763385A (en) | Video object segmentation method, device, equipment and medium | |
CN117312603B (en) | Unsupervised segmentation video abstraction method based on double-attention mechanism | |
WO2023185320A1 (en) | Cold start object recommendation method and apparatus, computer device and storage medium | |
CN116229073A (en) | Remote sensing image segmentation method and device based on improved ERFNet network | |
CN113627342B (en) | Method, system, equipment and storage medium for video depth feature extraction optimization | |
CN112926697B (en) | Abrasive particle image classification method and device based on semantic segmentation | |
CN111046232B (en) | Video classification method, device and system | |
Xia et al. | MFC-Net: Multi-scale fusion coding network for Image Deblurring | |
CN113553471A (en) | Video abstract generation method of LSTM model based on space attention constraint |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |