CN117312603B - Unsupervised segmentation video abstraction method based on double-attention mechanism - Google Patents

Unsupervised segmentation video abstraction method based on double-attention mechanism Download PDF

Info

Publication number
CN117312603B
CN117312603B CN202311598370.5A CN202311598370A CN117312603B CN 117312603 B CN117312603 B CN 117312603B CN 202311598370 A CN202311598370 A CN 202311598370A CN 117312603 B CN117312603 B CN 117312603B
Authority
CN
China
Prior art keywords
segment
video
lens
feature
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311598370.5A
Other languages
Chinese (zh)
Other versions
CN117312603A (en
Inventor
单晓冬
梁梦男
徐恩格
蒋鹏飞
鲍复劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou International Science Park Data Center Co ltd
Original Assignee
Suzhou International Science Park Data Center Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou International Science Park Data Center Co ltd filed Critical Suzhou International Science Park Data Center Co ltd
Priority to CN202311598370.5A priority Critical patent/CN117312603B/en
Publication of CN117312603A publication Critical patent/CN117312603A/en
Application granted granted Critical
Publication of CN117312603B publication Critical patent/CN117312603B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides an unsupervised segmented video abstraction method based on a double-attention mechanism, which comprises the following steps: after preprocessing an original video, dividing a video segment and a lens segment to obtain a video segment group and a lens segment group; respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract; constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.

Description

Unsupervised segmentation video abstraction method based on double-attention mechanism
Technical Field
The application belongs to the field of computer vision, and particularly relates to an unsupervised segmentation video abstraction method based on a double-attention mechanism.
Background
In recent years, with the popularization of video sharing platforms such as voice trembling, loving, fast handholding and the like, uploading and downloading videos at any time and any place, sharing life has become a normal state, and people can meet the requirement of mental culture pursuit and simultaneously, accompanying multimedia data such as audio and video and the like show explosive growth. Therefore, how to process and manage such huge and complicated video data becomes a problem to be solved urgently. The video abstract is a technology for automatically extracting important frames or video clips from an original long video by a computer, can furthest reserve the original video content while shortening the total duration of the video, is convenient for subsequent efficient storage and browsing, and thus gradually draws extensive attention of researchers in the field of computer vision.
Modeling the timing relationships of video sequences is a challenge in video summarization tasks, while at the same time, it is also very important to accurately extract features that effectively characterize the entire video. The existing video abstraction method can be divided into two types, one type is to take static image characteristics as input, and to use a time characteristic aggregation mode to realize analysis and capture of time-sequence interaction relation. For example, ji et al employ *** net as an extraction network for visual features of video frames, use long-short-term memory networks as remote dependencies for encoder modeling of video frame sequences, and use attention mechanisms to enhance long-term dependencies from frame to frame; li et al also used the image features extracted by *** et as input, simulated similarity between pairs of frames using a self-attention mechanism and modeled for all pairs of frames, capturing global relationships for the entire video. Although the above methods have taken into account timing relationships, these methods merely simulate the front-to-back relationship of a set of still images, do not take into account the true potential timing relationships of a sequence of consecutive frames, and the features of the still images extracted by 2D convolution lack correlation between consecutive frames. In order to solve the above problems, the second category is proposed to take a method of covering the dynamic video feature of the fine granularity time sequence information as input, for example, lin et al propose to use 3D ResNeXt-101 as an extraction network of the dynamic video feature, design a hierarchical long and short time memory network, and obtain the long-term dependency of the video in parallel with a sequential annotation mechanism; liu et al studied which of ST3D and I3D networks is better suited as a feature extraction network for video abstraction, and proposed to map spatio-temporal features into a potential space that can encode spatio-temporal dependencies using 3DST-UNet exploration context information. However, the above method has a problem that the time sequence relation is emphasized too much to ignore the visual content, so that the model has deviation in understanding the visual content. Therefore, a video summarization method that can fully and accurately understand visual contents and complete time-series correlation modeling is needed.
Disclosure of Invention
In order to solve one of the technical defects, an unsupervised segmented video summarization method based on a dual-attention mechanism is provided in an embodiment of the present application.
According to a first aspect of an embodiment of the present application, there is provided an unsupervised segmented video summarization method based on a dual-attention mechanism, including:
preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;
respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing;
taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract;
and constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode.
Preferably, after the preprocessing is performed on the original video, a video segment group and a lens segment group are obtained by segmentation, including:
inputting an original long video, sampling and extracting a video frame sequence according to a frame rate;
inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship;
and detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.
Preferably, the video summary model comprises a video segment attention module and a lens segment attention module.
Preferably, the inputting the video segment group and the lens segment group into the video abstract model respectively, and obtaining the weighted video segment feature and the weighted lens segment feature after processing includes:
inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating the features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;
and inputting the lens segment group into a lens segment attention module, filtering out the segments which are irrelevant to the target segment or have lower correlation degree through coarse granularity similarity calculation, and calculating a similarity matrix of the residual segments after the residual lens segments in the lens are aggregated to obtain the weighted lens segment characteristics capable of enhancing the local region correlation semantic information.
Preferably, the inputting the video segment group into the video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating features in the segment into weighted video segment features capable of representing short-term time sequence dependency advanced semantic information, and outputting the weighted video segment features, which specifically includes:
taking the video segment group as input, and linearly mapping the video segment group into video segment query features, video segment key features and video segment value features by using matrixes with three different weights;
calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;
and weighting the segment-level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.
Preferably, the lens segment group is input into a lens segment attention module, the segments irrelevant to the target segment or having lower correlation degree are filtered through coarse granularity similarity calculation, the similarity matrix of the remaining segments is calculated after the remaining lens segments in the lens are aggregated, and the weighted lens segment characteristics capable of enhancing local region correlation semantic information are obtained, and the method specifically comprises the following steps:
taking a lens segment group as input, and linearly mapping the lens segment group into lens segment query characteristics, lens segment key characteristics and lens segment value characteristics by using matrixes with three different weights;
averaging the lens fragment inquiry features and the lens fragment key features according to rows to obtain inquiry average features and key average features which can represent the whole lens content, and multiplying the inquiry average features and the key average features by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens;
filtering out areas with no correlation or low correlation from the similarity matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set;
and calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature.
Preferably, the reward function comprises a representative rewardDiversity rewards->And regular term->Reporting function/>The method comprises the following steps:
(1)
the representative rewardThe following equation (2):
(2)
in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the diversity rewardsThe following equation (3):
(3)
in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Represent the firstSpace-time feature vectors of the individual shots;
the regular termThe following equation (4) is used to obtain:
(4)
in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.
Preferably, the feature extraction module adopts an X3D deep convolutional neural network, and the extracted lens vector is 2048 dimensions as the input space-time feature.
According to a second aspect of embodiments of the present application, there is provided a computer device comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method as claimed in any one of the above.
According to a third aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the dual attention mechanism based unsupervised segmented video summarization method as described in any one of the above.
By adopting the unsupervised segmentation video abstraction method based on the double-attention mechanism, which is provided by the embodiment of the application, after preprocessing an original video, segmenting a video segment and a lens segment to obtain a video segment group and a lens segment group, respectively inputting the video segment group and the lens segment group into a video abstraction model, calculating the importance score of each lens, and selecting the lens with high score, namely the lens with high importance degree, to generate a dynamic abstract; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a flowchart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to an embodiment of the present application;
fig. 2 is a system frame diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to an embodiment of the present application;
fig. 3 is a flow chart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a second embodiment of the present application;
fig. 4 is a network structure block diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to the third embodiment of the present application;
fig. 5 is a flowchart of a video clip group processing in a third embodiment of the present application;
FIG. 6 is a flowchart illustrating a lens segment group processing procedure in a third embodiment of the present application;
fig. 7 is a diagram illustrating a lens segment group processing structure in a third embodiment of the present application;
FIG. 8 is a graph showing experimental results of the method provided in the present application compared with other methods.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.
In the process of implementing the application, the inventor finds that the current video abstraction method does not consider the problem of unification of time sequence relation and visual content, so that the model has deviation in understanding the video content.
In view of the above problems, a method for unsupervised segmentation video summarization based on a dual-attention mechanism is provided in a first embodiment of the present application, fig. 1 is a schematic flow chart of the first embodiment, and fig. 2 is a system frame diagram of the first embodiment, as shown in fig. 1 and 2, where the method includes:
s1, preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;
s2, respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; the video abstract model comprises a video segment attention module and a lens segment attention module;
s3, taking the video segment features with the weight and the lens segment features with the weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely the lens with high importance degree, so as to generate a dynamic abstract;
and S4, constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode.
After preprocessing an original video, dividing the original video to obtain a video segment group and a lens segment group, respectively inputting the video segment group and the lens segment group into a video abstract model, calculating importance scores of each lens, and selecting lenses with high scores, namely high importance degrees, to generate a dynamic abstract; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.
Fig. 3 is a flow chart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a second embodiment of the present application, on the basis of the first embodiment, as shown in fig. 3, after preprocessing an original video, segmenting the original video to obtain a video segment group and a lens segment group, including:
s11, inputting an original long video, sampling and extracting a video frame sequence according to a frame rate, wherein the frame rate is different from different videos, and the frame rate is represented by the total frame number of downsampled videos.
S12, inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship; the feature extraction module adopts an X3D deep convolutional neural network, and the extracted lens vector is 2048D and is used as an input space-time feature;
s13, detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting the visual appearance change points to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.
Specifically, the input original long video is downsampled according to the video frame rate to obtain a video frame sequenceWherein->Representing the total number of frames for downsampling of a video, the frame rate may be different for different videos; using X3D depth neural network as space-time feature extractor, video frame sequence +.>In a temporally continuous manner->Frame input as a shotObtaining corresponding space-time characteristic vector capable of representing visual content and time sequence relation from X3D depth neural network>Wherein->The shot characteristics of each video are denoted +.>Wherein->,/>Representing the lens characteristic dimension output by the X3D deep neural network, wherein the value is 2048; shot feature of each video is +.>As input, the algorithm detects +.>The characteristic of the lens with severe visual content variation in the middle and the lens as the demarcation point of the video clips divide the video into unequal and non-overlapping video clip groups +.>Wherein->Is the number of video clips, and at the same time, equally divides each lens feature into non-overlapping lens clip group +.>WhereinIs the number of lens segments.
Fig. 4 is a network structure block diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a third embodiment of the present application, as shown in fig. 4, the video segment group and the shot segment group are respectively input into a video abstraction model, and the video segment feature with weight and the shot segment feature with weight are obtained after processing, including:
s21, inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;
s22, inputting the lens segment group into a lens segment attention module, filtering out segments which are irrelevant to the target segment or have low correlation degree through coarse granularity similarity calculation, and calculating a residual segment similarity matrix after the residual lens segments in the lens are aggregated to obtain weighted lens segment characteristics capable of enhancing local region correlation semantic information.
As shown in fig. 5, the processing flow for the video clip group includes:
s211, taking the video segment group as input, and linearly mapping the video segment group into a video segment query feature, a video segment key feature and a video segment value feature by using matrixes of three different weights;
s212, calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;
s213, weighting the segment level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.
Specifically, the video segment attention module is composed of a plurality of self-attention units capable of capturing time sequence correlation in the segment, and each unit takes each video segment characteristic as input to calculate inter-segment inter-lens similarity matrixOne item in the similarity matrix between lenses>The calculation process of (1) can be specifically expressed as follows:
(5)
in the formula (5), the amino acid sequence of the compound,and->Is a parameter to be learned by the model, +.>Is a constant.
Inter-lens similarity matrixOutput as normalized similarity matrix via softmax function +.>The matrix is followed by a random inactivation layer and a layer via a learnable linear mapping layer>Matrix multiplication of segment features to obtain weighted video segment features->The calculation process is as follows:
(6)。
as shown in fig. 6, the processing of the lens segment group includes:
s221, taking a lens segment group as input, and linearly mapping a matrix of three different weights into a lens segment query feature, a lens segment key feature and a lens segment value feature;
s222, averaging the lens fragment inquiry characteristics and the lens fragment key characteristics according to rows to obtain inquiry average characteristics and key average characteristics which can represent the whole lens content, and multiplying the inquiry average characteristics and the key average characteristics by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens;
s223, filtering out areas without correlation or with low correlation from the similar shape matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set;
s224, calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature.
Specifically, as shown in fig. 7, the lens segment attention module is composed of a plurality of dual attention units capable of realizing targeting of more important content, and each unit takes a lens segment group as input to firstly perform coarse-granularity lens segment screening and then perform fine-granularity weight distribution; first, a lens segment groupObtaining lens fragment query characteristics through matrix mapping>Lens segment key feature->Lens segment value featureWherein->、/>、/>Are all a learnable mapping matrix; query feature for shot segment>And lens segment key feature->Obtaining inquiry mean value characteristics capable of characterizing visual contents in coarse granularity by line taking mean valueAnd Key mean feature->Then for query mean feature->Sum key mean featurePerforming matrix multiplication to obtain a similarity matrix covering the interaction relationship between the lens segments>For similarity matrix->Get before line->Lens fragment index value with large similarity +.>The method comprises the steps of carrying out a first treatment on the surface of the Then according toThe key characteristics of the lens segment are extracted respectively>And lens segment value feature->Is aggregated into a key feature set of high importance only +.>Sum feature set->The method comprises the steps of carrying out a first treatment on the surface of the Inquiring the feature of the lens fragment->Sum key feature setMatrix multiplication is performed to calculate a regional level similarity matrix between the intra-lens segments>Then output as normalized similarity matrix via softmax function>The matrix is followed by a random deactivation layer and a set of value featuresMatrix multiplication is performed to obtain the characteristic of the lens segment with weight>
Fusion mapping of weighted video segment features reflecting high-level time sequence relation and weighted shot segment features reflecting high-level visual content into importance scores per shotWherein->And selecting shots with high scores, namely high importance degree, and generating dynamic abstracts.
The method and the device introduce a video segment attention module and a lens segment attention module dual attention module, irrelevant areas are filtered through coarse-granularity mean similarity to prevent weight dispersion, and then a weight distribution process taking content as a guide is completed through fine-granularity area-level similarity to obtain the characteristics of the lens segment with weight capable of representing important visual content; in modeling of the timing relationship, unequal segmentation is adopted, so that timing dependence of capturing errors caused by damage to the integrity of video clips is prevented; in the process of obtaining the advanced visual semantic information view, a dual-attention mechanism is adopted, and a weight distribution process with importance content as a guide is realized.
The reward function includes a representative rewardDiversity rewards->And regular term->Return function->The method comprises the following steps:
(1)
the representative rewardThe following equation (2):
(2)
in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the diversity rewardsThe following equation (3):
(3)
in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Represent the firstSpace-time feature vectors of the individual shots;
the regular termThe following equation (4) is used to obtain:
(4)
in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.
To verify the effectiveness of the present invention, the present application performed experiments on two video summary standard data sets SumMe and TVSum and two enhancement data sets YouTube and OVP, and evaluated the present invention using three settings, standard (C), enhancement (a), transfer (T), respectively. In the standard setting, the specified dataset was randomly divided into five parts, training was performed using 80% of the data, leaving 20% for testing; in the enhancement setting, 80% of the given dataset was used with the other three datasets for training, while the remaining 20% were used for testing; in the transfer setting, three data sets are used for training, while the remaining one is used for testing; in all settings, the model was evaluated using the F score and run five times and take the average of the five times as the final result; as shown in fig. 8, according to fig. 8, compared with other advanced methods, the present application achieves the best performance in the standard and enhancement setting of SumMe; the most competitive results were achieved in the standard and enhancement settings of TVSum; while being superior to most models in the migration setup of both data sets. In summary, the proposed method can effectively model time-series relationships and visual content to improve the performance of the model.
A computer device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method as described above.
A computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the unsupervised segmented video summarization method based on the dual-attention mechanism as described above.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The schemes in the embodiments of the present application may be implemented in various computer languages, for example, C language, VHDL language, verilog language, object-oriented programming language Java, and transliteration scripting language JavaScript, etc.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (7)

1. An unsupervised segmented video abstraction method based on a double-attention mechanism is characterized by comprising the following steps:
preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;
respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; the video abstract model comprises a video segment attention module and a lens segment attention module;
taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract;
constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode;
the video segment group and the lens segment group are respectively input into a video abstract model, and the video segment characteristic with weight and the lens segment characteristic with weight are obtained after processing, and the method comprises the following steps:
inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating the features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;
inputting the lens segment group into a lens segment attention module, filtering out segments which are irrelevant to a target segment or have low correlation degree through coarse granularity similarity calculation, and calculating a similarity matrix of the remaining segments after the remaining lens segments in the lens are aggregated to obtain weighted lens segment characteristics capable of enhancing local region correlation semantic information; the method specifically comprises the following steps: taking a lens segment group as input, and linearly mapping the lens segment group into lens segment query characteristics, lens segment key characteristics and lens segment value characteristics by using matrixes with three different weights; averaging the lens fragment inquiry features and the lens fragment key features according to rows to obtain inquiry average features and key average features which can represent the whole lens content, and multiplying the inquiry average features and the key average features by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens; filtering out areas with no correlation or low correlation from the similarity matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set; calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature;
the lens segment groupAs input, obtaining lens segment query characteristics through matrix mappingLens segment key feature->Lens segment value feature +.>Wherein->、/>、/>Are all a learnable mapping matrix; query feature for shot segment>And lens segment key feature->Obtaining inquiry mean value characteristics capable of characterizing visual contents in coarse granularity by line averaging>And Key mean feature->Then for query mean feature->And Key mean feature->Performing matrix multiplication to preliminarily obtain a similarity matrix covering the interaction relationship among the lens segmentsFor similarity matrix->Get before line->Lens fragment index value with large similarity +.>The method comprises the steps of carrying out a first treatment on the surface of the According to->The key characteristics of the lens segment are extracted respectively>And lens segment value feature->Is aggregated into a key feature set of high importance only +.>Sum feature set->The method comprises the steps of carrying out a first treatment on the surface of the Inquiring the feature of the lens fragment->And Key feature set->Matrix multiplication is performed to calculate a regional level similarity matrix between the intra-lens segments>Then output as normalized similarity matrix via softmax function>The matrix is followed by a random inactivating layer and a value feature set +.>Matrix multiplication is performed to obtain the characteristic of the lens segment with weight>
2. The method for summarizing an unsupervised segmented video based on a dual-attention mechanism according to claim 1, wherein the preprocessing of the original video and the segmentation to obtain a video segment group and a shot segment group comprises:
inputting an original long video, sampling and extracting a video frame sequence according to a frame rate;
inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship;
and detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.
3. The method for the unsupervised segmentation video abstraction based on the dual-attention mechanism as set forth in claim 1, wherein the inputting the video segment group into the video segment attention module calculates an inter-shot similarity matrix in the video segment, aggregates the features in the segment into weighted video segment features capable of characterizing short-term timing dependency high-level semantic information, and outputs the weighted video segment features, and specifically includes:
taking the video segment group as input, and linearly mapping the video segment group into video segment query features, video segment key features and video segment value features by using matrixes with three different weights;
calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;
and weighting the segment-level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.
4. The dual attention mechanism based unsupervised segmented video summarization method of claim 1, wherein the reward function comprises a representative rewardDiversity rewards->And regular term->Return function->The method comprises the following steps:
(1)
the representative rewardThe following equation (2):
(2)
in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the diversity rewardsThe following equation (3):
(3)
in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;
the regular termThe following equation (4) is used to obtain:
(4)
in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.
5. The method for unsupervised segmented video summarization based on the dual-attention mechanism according to claim 2, wherein the feature extraction module uses an X3D deep convolutional neural network, and the extracted shot vector is 2048 dimensions as the input spatio-temporal feature.
6. A computer device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method of any one of claims 1 to 5.
7. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement the dual attention mechanism based unsupervised segmented video summarization method of any one of claims 1 to 5.
CN202311598370.5A 2023-11-28 2023-11-28 Unsupervised segmentation video abstraction method based on double-attention mechanism Active CN117312603B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311598370.5A CN117312603B (en) 2023-11-28 2023-11-28 Unsupervised segmentation video abstraction method based on double-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311598370.5A CN117312603B (en) 2023-11-28 2023-11-28 Unsupervised segmentation video abstraction method based on double-attention mechanism

Publications (2)

Publication Number Publication Date
CN117312603A CN117312603A (en) 2023-12-29
CN117312603B true CN117312603B (en) 2024-03-01

Family

ID=89281414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311598370.5A Active CN117312603B (en) 2023-11-28 2023-11-28 Unsupervised segmentation video abstraction method based on double-attention mechanism

Country Status (1)

Country Link
CN (1) CN117312603B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100284670A1 (en) * 2008-06-30 2010-11-11 Tencent Technology (Shenzhen) Company Ltd. Method, system, and apparatus for extracting video abstract
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN116662604A (en) * 2023-06-26 2023-08-29 浙江千从科技有限公司 Video abstraction method based on layered Transformer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100284670A1 (en) * 2008-06-30 2010-11-11 Tencent Technology (Shenzhen) Company Ltd. Method, system, and apparatus for extracting video abstract
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN116662604A (en) * 2023-06-26 2023-08-29 浙江千从科技有限公司 Video abstraction method based on layered Transformer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于自注意力网络的动态视频摘要方法研究;姚慧敏;《中国优秀硕士学位论文全文数据库 信息科技辑》(第01期);第1-75页 *

Also Published As

Publication number Publication date
CN117312603A (en) 2023-12-29

Similar Documents

Publication Publication Date Title
CN111079532B (en) Video content description method based on text self-encoder
CN111768432B (en) Moving target segmentation method and system based on twin deep neural network
US11200424B2 (en) Space-time memory network for locating target object in video content
Liu et al. Teinet: Towards an efficient architecture for video recognition
CN110503076B (en) Video classification method, device, equipment and medium based on artificial intelligence
Li et al. Short-term and long-term context aggregation network for video inpainting
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN112016682B (en) Video characterization learning and pre-training method and device, electronic equipment and storage medium
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN111028166B (en) Video deblurring method based on iterative neural network
CN115695950B (en) Video abstract generation method based on content perception
GB2579262A (en) Space-time memory network for locating target object in video content
CN114549913A (en) Semantic segmentation method and device, computer equipment and storage medium
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN115131218A (en) Image processing method, image processing device, computer readable medium and electronic equipment
Zhou et al. Transformer-based multi-scale feature integration network for video saliency prediction
CN113763385A (en) Video object segmentation method, device, equipment and medium
CN117312603B (en) Unsupervised segmentation video abstraction method based on double-attention mechanism
WO2023185320A1 (en) Cold start object recommendation method and apparatus, computer device and storage medium
CN116229073A (en) Remote sensing image segmentation method and device based on improved ERFNet network
CN113627342B (en) Method, system, equipment and storage medium for video depth feature extraction optimization
CN112926697B (en) Abrasive particle image classification method and device based on semantic segmentation
CN111046232B (en) Video classification method, device and system
Xia et al. MFC-Net: Multi-scale fusion coding network for Image Deblurring
CN113553471A (en) Video abstract generation method of LSTM model based on space attention constraint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant