CN117312603B

CN117312603B - Unsupervised segmentation video abstraction method based on double-attention mechanism

Info

Publication number: CN117312603B
Application number: CN202311598370.5A
Authority: CN
Inventors: 单晓冬; 梁梦男; 徐恩格; 蒋鹏飞; 鲍复劼
Original assignee: Suzhou International Science Park Data Center Co ltd
Current assignee: Suzhou International Science Park Data Center Co ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-03-01
Anticipated expiration: 2043-11-28
Also published as: CN117312603A

Abstract

The embodiment of the application provides an unsupervised segmented video abstraction method based on a double-attention mechanism, which comprises the following steps: after preprocessing an original video, dividing a video segment and a lens segment to obtain a video segment group and a lens segment group; respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract; constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.

Description

Unsupervised segmentation video abstraction method based on double-attention mechanism

Technical Field

The application belongs to the field of computer vision, and particularly relates to an unsupervised segmentation video abstraction method based on a double-attention mechanism.

Background

In recent years, with the popularization of video sharing platforms such as voice trembling, loving, fast handholding and the like, uploading and downloading videos at any time and any place, sharing life has become a normal state, and people can meet the requirement of mental culture pursuit and simultaneously, accompanying multimedia data such as audio and video and the like show explosive growth. Therefore, how to process and manage such huge and complicated video data becomes a problem to be solved urgently. The video abstract is a technology for automatically extracting important frames or video clips from an original long video by a computer, can furthest reserve the original video content while shortening the total duration of the video, is convenient for subsequent efficient storage and browsing, and thus gradually draws extensive attention of researchers in the field of computer vision.

Modeling the timing relationships of video sequences is a challenge in video summarization tasks, while at the same time, it is also very important to accurately extract features that effectively characterize the entire video. The existing video abstraction method can be divided into two types, one type is to take static image characteristics as input, and to use a time characteristic aggregation mode to realize analysis and capture of time-sequence interaction relation. For example, ji et al employ *** net as an extraction network for visual features of video frames, use long-short-term memory networks as remote dependencies for encoder modeling of video frame sequences, and use attention mechanisms to enhance long-term dependencies from frame to frame; li et al also used the image features extracted by *** et as input, simulated similarity between pairs of frames using a self-attention mechanism and modeled for all pairs of frames, capturing global relationships for the entire video. Although the above methods have taken into account timing relationships, these methods merely simulate the front-to-back relationship of a set of still images, do not take into account the true potential timing relationships of a sequence of consecutive frames, and the features of the still images extracted by 2D convolution lack correlation between consecutive frames. In order to solve the above problems, the second category is proposed to take a method of covering the dynamic video feature of the fine granularity time sequence information as input, for example, lin et al propose to use 3D ResNeXt-101 as an extraction network of the dynamic video feature, design a hierarchical long and short time memory network, and obtain the long-term dependency of the video in parallel with a sequential annotation mechanism; liu et al studied which of ST3D and I3D networks is better suited as a feature extraction network for video abstraction, and proposed to map spatio-temporal features into a potential space that can encode spatio-temporal dependencies using 3DST-UNet exploration context information. However, the above method has a problem that the time sequence relation is emphasized too much to ignore the visual content, so that the model has deviation in understanding the visual content. Therefore, a video summarization method that can fully and accurately understand visual contents and complete time-series correlation modeling is needed.

Disclosure of Invention

In order to solve one of the technical defects, an unsupervised segmented video summarization method based on a dual-attention mechanism is provided in an embodiment of the present application.

According to a first aspect of an embodiment of the present application, there is provided an unsupervised segmented video summarization method based on a dual-attention mechanism, including:

preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;

respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing;

taking the video segment characteristics with weight and the lens segment characteristics with weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely high importance degree, to generate a dynamic abstract;

and constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode.

Preferably, after the preprocessing is performed on the original video, a video segment group and a lens segment group are obtained by segmentation, including:

inputting an original long video, sampling and extracting a video frame sequence according to a frame rate;

inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship;

and detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.

Preferably, the video summary model comprises a video segment attention module and a lens segment attention module.

Preferably, the inputting the video segment group and the lens segment group into the video abstract model respectively, and obtaining the weighted video segment feature and the weighted lens segment feature after processing includes:

inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating the features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;

and inputting the lens segment group into a lens segment attention module, filtering out the segments which are irrelevant to the target segment or have lower correlation degree through coarse granularity similarity calculation, and calculating a similarity matrix of the residual segments after the residual lens segments in the lens are aggregated to obtain the weighted lens segment characteristics capable of enhancing the local region correlation semantic information.

Preferably, the inputting the video segment group into the video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating features in the segment into weighted video segment features capable of representing short-term time sequence dependency advanced semantic information, and outputting the weighted video segment features, which specifically includes:

taking the video segment group as input, and linearly mapping the video segment group into video segment query features, video segment key features and video segment value features by using matrixes with three different weights;

calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;

and weighting the segment-level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.

Preferably, the lens segment group is input into a lens segment attention module, the segments irrelevant to the target segment or having lower correlation degree are filtered through coarse granularity similarity calculation, the similarity matrix of the remaining segments is calculated after the remaining lens segments in the lens are aggregated, and the weighted lens segment characteristics capable of enhancing local region correlation semantic information are obtained, and the method specifically comprises the following steps:

taking a lens segment group as input, and linearly mapping the lens segment group into lens segment query characteristics, lens segment key characteristics and lens segment value characteristics by using matrixes with three different weights;

averaging the lens fragment inquiry features and the lens fragment key features according to rows to obtain inquiry average features and key average features which can represent the whole lens content, and multiplying the inquiry average features and the key average features by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens;

filtering out areas with no correlation or low correlation from the similarity matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set;

and calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature.

Preferably, the reward function comprises a representative rewardDiversity rewards->And regular term->Reporting function/>The method comprises the following steps:

（1）

the representative rewardThe following equation (2):

（2）

in the formula (2), the amino acid sequence of the compound,lens length representing summary result, +.>Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;

the diversity rewardsThe following equation (3):

（3）

in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Represent the firstSpace-time feature vectors of the individual shots;

the regular termThe following equation (4) is used to obtain:

（4）

in the formula (4), the amino acid sequence of the compound,indicate->Importance scores for individual shots.

Preferably, the feature extraction module adopts an X3D deep convolutional neural network, and the extracted lens vector is 2048 dimensions as the input space-time feature.

According to a second aspect of embodiments of the present application, there is provided a computer device comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method as claimed in any one of the above.

According to a third aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the dual attention mechanism based unsupervised segmented video summarization method as described in any one of the above.

By adopting the unsupervised segmentation video abstraction method based on the double-attention mechanism, which is provided by the embodiment of the application, after preprocessing an original video, segmenting a video segment and a lens segment to obtain a video segment group and a lens segment group, respectively inputting the video segment group and the lens segment group into a video abstraction model, calculating the importance score of each lens, and selecting the lens with high score, namely the lens with high importance degree, to generate a dynamic abstract; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a flowchart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to an embodiment of the present application;

fig. 2 is a system frame diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to an embodiment of the present application;

fig. 3 is a flow chart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a second embodiment of the present application;

fig. 4 is a network structure block diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to the third embodiment of the present application;

fig. 5 is a flowchart of a video clip group processing in a third embodiment of the present application;

FIG. 6 is a flowchart illustrating a lens segment group processing procedure in a third embodiment of the present application;

fig. 7 is a diagram illustrating a lens segment group processing structure in a third embodiment of the present application;

FIG. 8 is a graph showing experimental results of the method provided in the present application compared with other methods.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is given with reference to the accompanying drawings, and it is apparent that the described embodiments are only some of the embodiments of the present application and not exhaustive of all the embodiments. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

In the process of implementing the application, the inventor finds that the current video abstraction method does not consider the problem of unification of time sequence relation and visual content, so that the model has deviation in understanding the video content.

In view of the above problems, a method for unsupervised segmentation video summarization based on a dual-attention mechanism is provided in a first embodiment of the present application, fig. 1 is a schematic flow chart of the first embodiment, and fig. 2 is a system frame diagram of the first embodiment, as shown in fig. 1 and 2, where the method includes:

s1, preprocessing an original video, and then dividing the original video to obtain a video fragment group and a lens fragment group;

s2, respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; the video abstract model comprises a video segment attention module and a lens segment attention module;

s3, taking the video segment features with the weight and the lens segment features with the weight as inputs, calculating the importance score of each lens, and selecting the lens with high score, namely the lens with high importance degree, so as to generate a dynamic abstract;

and S4, constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode.

After preprocessing an original video, dividing the original video to obtain a video segment group and a lens segment group, respectively inputting the video segment group and the lens segment group into a video abstract model, calculating importance scores of each lens, and selecting lenses with high scores, namely high importance degrees, to generate a dynamic abstract; the method emphasizes the importance of the visual content on the basis of modeling the timing relation, strengthens the characterization capability of the video abstract characteristics, and improves the understanding capability and analysis capability of the model to the video content.

Fig. 3 is a flow chart of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a second embodiment of the present application, on the basis of the first embodiment, as shown in fig. 3, after preprocessing an original video, segmenting the original video to obtain a video segment group and a lens segment group, including:

s11, inputting an original long video, sampling and extracting a video frame sequence according to a frame rate, wherein the frame rate is different from different videos, and the frame rate is represented by the total frame number of downsampled videos.

S12, inputting the video frame sequence into a feature extraction module for extraction to obtain space-time features capable of reflecting the visual content and the time sequence relationship; the feature extraction module adopts an X3D deep convolutional neural network, and the extracted lens vector is 2048D and is used as an input space-time feature;

s13, detecting visual appearance change points on the space-time characteristics by using a kernel segmentation algorithm, segmenting the visual appearance change points to obtain video segment groups, and equally dividing the space-time characteristics of each shot into non-overlapping shot segment sets.

Specifically, the input original long video is downsampled according to the video frame rate to obtain a video frame sequenceWherein->Representing the total number of frames for downsampling of a video, the frame rate may be different for different videos; using X3D depth neural network as space-time feature extractor, video frame sequence +.>In a temporally continuous manner->Frame input as a shotObtaining corresponding space-time characteristic vector capable of representing visual content and time sequence relation from X3D depth neural network>Wherein->The shot characteristics of each video are denoted +.>Wherein->，/>Representing the lens characteristic dimension output by the X3D deep neural network, wherein the value is 2048; shot feature of each video is +.>As input, the algorithm detects +.>The characteristic of the lens with severe visual content variation in the middle and the lens as the demarcation point of the video clips divide the video into unequal and non-overlapping video clip groups +.>Wherein->Is the number of video clips, and at the same time, equally divides each lens feature into non-overlapping lens clip group +.>WhereinIs the number of lens segments.

Fig. 4 is a network structure block diagram of an unsupervised segmentation video abstraction method based on a dual-attention mechanism according to a third embodiment of the present application, as shown in fig. 4, the video segment group and the shot segment group are respectively input into a video abstraction model, and the video segment feature with weight and the shot segment feature with weight are obtained after processing, including:

s21, inputting the video segment group into a video segment attention module, calculating a similarity matrix between shots in the video segment, aggregating features in the segment into weighted video segment features capable of representing short-term time sequence dependency high-level semantic information, and outputting the weighted video segment features;

s22, inputting the lens segment group into a lens segment attention module, filtering out segments which are irrelevant to the target segment or have low correlation degree through coarse granularity similarity calculation, and calculating a residual segment similarity matrix after the residual lens segments in the lens are aggregated to obtain weighted lens segment characteristics capable of enhancing local region correlation semantic information.

As shown in fig. 5, the processing flow for the video clip group includes:

s211, taking the video segment group as input, and linearly mapping the video segment group into a video segment query feature, a video segment key feature and a video segment value feature by using matrixes of three different weights;

s212, calculating the product of the video segment query feature and the video segment key feature to obtain a segment level similarity matrix, and obtaining a segment level similarity normalization matrix of the segment after linear scaling and Softmax functions;

s213, weighting the segment level similarity matrix to the video segment value characteristic by matrix multiplication to obtain the weighted video segment characteristic.

Specifically, the video segment attention module is composed of a plurality of self-attention units capable of capturing time sequence correlation in the segment, and each unit takes each video segment characteristic as input to calculate inter-segment inter-lens similarity matrixOne item in the similarity matrix between lenses>The calculation process of (1) can be specifically expressed as follows:

（5）

in the formula (5), the amino acid sequence of the compound,and->Is a parameter to be learned by the model, +.>Is a constant.

Inter-lens similarity matrixOutput as normalized similarity matrix via softmax function +.>The matrix is followed by a random inactivation layer and a layer via a learnable linear mapping layer>Matrix multiplication of segment features to obtain weighted video segment features->The calculation process is as follows:

（6）。

as shown in fig. 6, the processing of the lens segment group includes:

s221, taking a lens segment group as input, and linearly mapping a matrix of three different weights into a lens segment query feature, a lens segment key feature and a lens segment value feature;

s222, averaging the lens fragment inquiry characteristics and the lens fragment key characteristics according to rows to obtain inquiry average characteristics and key average characteristics which can represent the whole lens content, and multiplying the inquiry average characteristics and the key average characteristics by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens;

s223, filtering out areas without correlation or with low correlation from the similar shape matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set;

s224, calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature.

Specifically, as shown in fig. 7, the lens segment attention module is composed of a plurality of dual attention units capable of realizing targeting of more important content, and each unit takes a lens segment group as input to firstly perform coarse-granularity lens segment screening and then perform fine-granularity weight distribution; first, a lens segment groupObtaining lens fragment query characteristics through matrix mapping>Lens segment key feature->Lens segment value featureWherein->、/>、/>Are all a learnable mapping matrix; query feature for shot segment>And lens segment key feature->Obtaining inquiry mean value characteristics capable of characterizing visual contents in coarse granularity by line taking mean valueAnd Key mean feature->Then for query mean feature->Sum key mean featurePerforming matrix multiplication to obtain a similarity matrix covering the interaction relationship between the lens segments>For similarity matrix->Get before line->Lens fragment index value with large similarity +.>The method comprises the steps of carrying out a first treatment on the surface of the Then according toThe key characteristics of the lens segment are extracted respectively>And lens segment value feature->Is aggregated into a key feature set of high importance only +.>Sum feature set->The method comprises the steps of carrying out a first treatment on the surface of the Inquiring the feature of the lens fragment->Sum key feature setMatrix multiplication is performed to calculate a regional level similarity matrix between the intra-lens segments>Then output as normalized similarity matrix via softmax function>The matrix is followed by a random deactivation layer and a set of value featuresMatrix multiplication is performed to obtain the characteristic of the lens segment with weight>。

Fusion mapping of weighted video segment features reflecting high-level time sequence relation and weighted shot segment features reflecting high-level visual content into importance scores per shotWherein->And selecting shots with high scores, namely high importance degree, and generating dynamic abstracts.

The method and the device introduce a video segment attention module and a lens segment attention module dual attention module, irrelevant areas are filtered through coarse-granularity mean similarity to prevent weight dispersion, and then a weight distribution process taking content as a guide is completed through fine-granularity area-level similarity to obtain the characteristics of the lens segment with weight capable of representing important visual content; in modeling of the timing relationship, unequal segmentation is adopted, so that timing dependence of capturing errors caused by damage to the integrity of video clips is prevented; in the process of obtaining the advanced visual semantic information view, a dual-attention mechanism is adopted, and a weight distribution process with importance content as a guide is realized.

The reward function includes a representative rewardDiversity rewards->And regular term->Return function->The method comprises the following steps:

（1）

the representative rewardThe following equation (2):

（2）

the diversity rewardsThe following equation (3):

（3）

the regular termThe following equation (4) is used to obtain:

（4）

To verify the effectiveness of the present invention, the present application performed experiments on two video summary standard data sets SumMe and TVSum and two enhancement data sets YouTube and OVP, and evaluated the present invention using three settings, standard (C), enhancement (a), transfer (T), respectively. In the standard setting, the specified dataset was randomly divided into five parts, training was performed using 80% of the data, leaving 20% for testing; in the enhancement setting, 80% of the given dataset was used with the other three datasets for training, while the remaining 20% were used for testing; in the transfer setting, three data sets are used for training, while the remaining one is used for testing; in all settings, the model was evaluated using the F score and run five times and take the average of the five times as the final result; as shown in fig. 8, according to fig. 8, compared with other advanced methods, the present application achieves the best performance in the standard and enhancement setting of SumMe; the most competitive results were achieved in the standard and enhancement settings of TVSum; while being superior to most models in the migration setup of both data sets. In summary, the proposed method can effectively model time-series relationships and visual content to improve the performance of the model.

A computer device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method as described above.

A computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement the unsupervised segmented video summarization method based on the dual-attention mechanism as described above.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The schemes in the embodiments of the present application may be implemented in various computer languages, for example, C language, VHDL language, verilog language, object-oriented programming language Java, and transliteration scripting language JavaScript, etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. An unsupervised segmented video abstraction method based on a double-attention mechanism is characterized by comprising the following steps:

respectively inputting the video segment group and the lens segment group into a video abstract model, and obtaining the video segment characteristics with weight and the lens segment characteristics with weight after processing; the video abstract model comprises a video segment attention module and a lens segment attention module;

constructing a return function, calculating diversity and representativeness of the dynamic abstract, and training the video abstract model in an unsupervised reinforcement learning mode;

the video segment group and the lens segment group are respectively input into a video abstract model, and the video segment characteristic with weight and the lens segment characteristic with weight are obtained after processing, and the method comprises the following steps:

inputting the lens segment group into a lens segment attention module, filtering out segments which are irrelevant to a target segment or have low correlation degree through coarse granularity similarity calculation, and calculating a similarity matrix of the remaining segments after the remaining lens segments in the lens are aggregated to obtain weighted lens segment characteristics capable of enhancing local region correlation semantic information; the method specifically comprises the following steps: taking a lens segment group as input, and linearly mapping the lens segment group into lens segment query characteristics, lens segment key characteristics and lens segment value characteristics by using matrixes with three different weights; averaging the lens fragment inquiry features and the lens fragment key features according to rows to obtain inquiry average features and key average features which can represent the whole lens content, and multiplying the inquiry average features and the key average features by a matrix to obtain a similarity matrix which can reflect the correlation between different areas of the lens; filtering out areas with no correlation or low correlation from the similarity matrix to obtain a strongly correlated area index set, and taking out corresponding key feature sets and value feature sets from the lens segment key features and the lens segment value features according to the area index set; calculating the product of the lens segment query feature and the key feature set, obtaining a regional level similarity normalization matrix of the segment after linear scaling and Softmax functions, and weighting the regional level similarity matrix to the value feature set by matrix multiplication to obtain the weighted lens segment feature;

the lens segment groupAs input, obtaining lens segment query characteristics through matrix mappingLens segment key feature->Lens segment value feature +.>Wherein->、/>、/>Are all a learnable mapping matrix; query feature for shot segment>And lens segment key feature->Obtaining inquiry mean value characteristics capable of characterizing visual contents in coarse granularity by line averaging>And Key mean feature->Then for query mean feature->And Key mean feature->Performing matrix multiplication to preliminarily obtain a similarity matrix covering the interaction relationship among the lens segmentsFor similarity matrix->Get before line->Lens fragment index value with large similarity +.>The method comprises the steps of carrying out a first treatment on the surface of the According to->The key characteristics of the lens segment are extracted respectively>And lens segment value feature->Is aggregated into a key feature set of high importance only +.>Sum feature set->The method comprises the steps of carrying out a first treatment on the surface of the Inquiring the feature of the lens fragment->And Key feature set->Matrix multiplication is performed to calculate a regional level similarity matrix between the intra-lens segments>Then output as normalized similarity matrix via softmax function>The matrix is followed by a random inactivating layer and a value feature set +.>Matrix multiplication is performed to obtain the characteristic of the lens segment with weight>。

2. The method for summarizing an unsupervised segmented video based on a dual-attention mechanism according to claim 1, wherein the preprocessing of the original video and the segmentation to obtain a video segment group and a shot segment group comprises:

3. The method for the unsupervised segmentation video abstraction based on the dual-attention mechanism as set forth in claim 1, wherein the inputting the video segment group into the video segment attention module calculates an inter-shot similarity matrix in the video segment, aggregates the features in the segment into weighted video segment features capable of characterizing short-term timing dependency high-level semantic information, and outputs the weighted video segment features, and specifically includes:

4. The dual attention mechanism based unsupervised segmented video summarization method of claim 1, wherein the reward function comprises a representative rewardDiversity rewards->And regular term->Return function->The method comprises the following steps:

（1）

the representative rewardThe following equation (2):

（2）

the diversity rewardsThe following equation (3):

（3）

in the formula (3), the amino acid sequence of the compound,representing the dynamic summary generated,/->Indicate->Spatio-temporal feature vectors of individual shots +.>Indicate->Space-time feature vectors of the individual shots;

the regular termThe following equation (4) is used to obtain:

（4）

5. The method for unsupervised segmented video summarization based on the dual-attention mechanism according to claim 2, wherein the feature extraction module uses an X3D deep convolutional neural network, and the extracted shot vector is 2048 dimensions as the input spatio-temporal feature.

6. A computer device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the dual attention mechanism based unsupervised segmented video summarization method of any one of claims 1 to 5.

7. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement the dual attention mechanism based unsupervised segmented video summarization method of any one of claims 1 to 5.