CN111526434A

CN111526434A - Converter-based video abstraction method

Info

Publication number: CN111526434A
Application number: CN202010329511.3A
Authority: CN
Inventors: 梁国强; 张艳宁; 吕艳兵; 李书成; 吉时雨
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-08-11
Anticipated expiration: 2040-04-24
Also published as: CN111526434B

Abstract

The invention provides a converter-based video abstract extraction method. Firstly, processing a selected data set to obtain a training data set of a model; then, constructing a video abstract converter neural network model comprising a self-attention mechanism, calculating the similarity between video frames by using the self-attention mechanism, enhancing the capability of the model for capturing the global dependency relationship of the video frame sequence by adding the importance score of a previous frame, and training the model by using a training data set; and finally, processing the video data to be processed by using the trained model to obtain the importance score of each frame, and selecting to obtain the video abstract according to the score. The invention can capture the time sequence information between the video frame sequences well, and then predict the importance degree of the video frames well in a scoring mode, and the model network of the invention can train the frame sequences in a parallelization mode, and has the advantages of fast training time efficiency and complete and short obtained video abstract.

Description

Converter-based video abstraction method

Technical Field

The invention belongs to the technical field of computer vision and deep learning representation, and particularly relates to a converter-based video summarization method.

Background

With the rapid development of camera and video sharing technologies, the number of videos is increasing explosively. In the face of massive video data, how to efficiently extract useful information from video becomes an important problem. As an important technology for solving the problem, the video summarization technology aims to generate a complete and short summarized video for an original video, and the summarized video can transmit information to be expressed by the original video on the basis of short duration, and has become a hot spot in the fields of multimedia, computer vision and the like. The video abstraction technology comprehensively utilizes various technologies such as machine learning, artificial intelligence and the like, and has important functions in aspects such as video retrieval, storage, recommendation and the like.

Currently, most video summarization methods are divided into two stages, the first stage is to predict the importance scores of all video frames, and the second stage is to select the key shots of the video by using the results of the first stage, so as to obtain the final summarization result. The first stage is a key stage of a video summarization method, most of research of the current methods aims at predicting the importance scores of video frames, and many methods have good performance. For example, in the document "Ke Zhang, Wei-Lun Chao, FeiSha, et al, video summary with Long Short-Term Memory [ C ]// european conference on Computer vision. springer, Cham, 2016", two LSTM networks are used, one from front to back and one from back to front to extract the sequence information of the video frame and predict the video frame importance score, the network structure is simple, the key sequence information can be extracted, but the recurrent neural network is difficult to capture the Long-Term dependency, and the early sequence dependency is easy to lose when processing Long video information; in the document "Ji, Zhong, Xiong, kalin, Pang, Yanwei, et.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a video summarization method based on a converter. And optimizing an information flow route from the features to the decoder by using a converter based on an attention mechanism, weighting the importance scores output by the decoder and the original features, predicting the importance scores of the next frame, enhancing the relation between model input and output, realizing the complete parallelization of training, and better capturing global dependency information.

A video summarization method based on converter includes the following steps:

step 1: downsampling the video in the selected data set, and extracting the feature vector h of each frame of the video by using a pre-trained neural network_f∈R^dF is a frame number, F is 1,2,., F is a total length of the downsampled video, and d represents a length of the feature vector; the feature vectors of all frames of a video and the corresponding importance scores form a sample in a training set; said selected data set comprises TvSum and SumMe;

step 2: a position vector for the video frame is generated using:

wherein, PE_f(i) An ith element value, i ═ 1,2, …, d, representing the position vector of the f-th frame of the video;

then, the position vector of each frame of the video is added with the characteristic vector element by element, and a new vector x added with the position vector is obtained for each frame_f；

And step 3: constructing a video abstract converter neural network model, which comprises an encoder and a decoder, wherein the encoder is formed by sequentially connecting two encoder units with the same structure, each encoder unit sequentially comprises a multi-head self-attention mechanism module, a residual connection and normalization module 1, a two-layer feedforward network and a residual connection and normalization module 2, a video frame sequence added with a position vector is input into a first encoder unit, and a middle variable Y with sequence information and the same dimension as the input is obtained by the output of a second encoder unit;

the decoder is formed by sequentially connecting two decoder units with the same structure, wherein each decoder unit sequentially comprises a multi-head self-attention mechanism module with a mask, a residual connection and normalization module 1, a multi-head self-attention mechanism module, a residual connection and normalization module 2, a two-layer feedforward network and a residual connection and normalization module 3; the decoder has two inputs, when predicting the importance score of the kth frame, the product of the importance scores of the first k-1 predicted video frames and the feature vectors thereof is the input of the multi-head self-attention mechanism module with the mask in the first decoder unit, and the intermediate variable output by the encoder is input into the multi-head self-attention mechanism module of each decoder unit; connecting a linear layer and a sigmoid function behind the second decoder unit, and outputting an importance value prediction result of each frame;

initializing the input of the network model, specifically including: the input of the nth head of the multi-head self-attention mechanism module in the encoder unit is initialized as follows: q_n＝Q₀W_n ^Q∈R^F×d、K_n＝K₀W_n ^K∈R^F×d、V_n＝V₀W_n ^V∈R^F×dWhere n is 1,2,3,4, Q in the first encoder unit₀＝K₀＝V₀X is the position vector added video frame feature obtained in step 2, Q in the second encoder unit₀、K₀、V₀Is the output of the first encoder unit,

is a randomly generated matrix with size d × d to be learned in the training process, and input Q of nth head of multi-head self-attention mechanism module with mask in decoder unit_n、K_nAnd V_nInitialization method and encoderThe multiple-head self-attention mechanism module is the same, except in the first decoder unit

Wherein h is_fFor the feature vector, s, of the f-th frame obtained in step 1_fFor the predicted importance score corresponding to the f-th frame, Q in the second decoder unit₀、K₀、V₀Is the output of the first decoder unit; input Q of nth head of multi-head self-attention mechanism module in decoder unit_n、K_nAnd V_nThe initialization method of the encoder is the same as that of a multi-head self-attention mechanism module in the encoder, and the difference is K₀＝V₀＝Y，Q₀Z, wherein Y is an intermediate variable output by the encoder, and Z is a variable output by the residual error connection and normalization module 1 in the decoder unit;

and 4, step 4: training the neural network model of the video abstract converter constructed in the step 3 by using the training data set obtained in the step 1, and setting the loss function of the network as a mean square loss function

Wherein L represents the network loss, s_fAnd s'_fRespectively predicting the importance scores of the f frame of the video predicted by the model and the importance scores of the artificial labels in the data set;

and 5: preprocessing a video data set to be processed, including segment extraction, down-sampling, feature extraction and position vector addition, to obtain feature representation of each frame; then, extracting and obtaining the importance score of each frame of video by using the network model trained in the step 4; dividing the video into a plurality of scene shots by using a KTS algorithm, selecting an important video shot as a video abstract according to the importance score of the video frame by using a knapsack algorithm, wherein the length of the selected video abstract is not more than 15% of the length of the original video.

The invention has the beneficial effects that: because a cyclic neural network is abandoned, a multi-head self-attention mechanism is used in a coder-decoder, the association between video frames is realized, and in the training process, a mask is added to the input of a multi-head self-attention mechanism module in the decoder, namely the product of the artificial annotation score and the feature vector, so that the complete parallelization of the training of the video frame sequence is realized, and the advantage of quick training timeliness is realized; the designed decoder bottom end input adopts a mode of multiplying the feature vector and the importance score, namely when the importance score of the kth frame is predicted, the product of the importance score of the first k-1 video frames and the feature vector thereof which is predicted is used as the input of the multi-head self-attention mechanism module with the mask in the first decoder unit, thereby realizing the association of different time sequence outputs in the decoding process, improving the output result of the next moment through the output of the previous moment, further leading the sequence information to be more complete and obtaining better importance score prediction performance; in the overall view, the whole model constructed by the method is completely based on a self-attention mechanism, a cycle structure and excessive convolution operation are avoided, and the model is simple and easy to realize; and the use of the self-attention mechanism can enable the model to better pay attention to the detail information between the sequences, and the structure of the codec enables the global information of the sequences to be more complete.

Drawings

FIG. 1 is a flow chart of a converter-based video summarization method of the present invention.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

As shown in fig. 1, the present invention provides a converter-based video summarization method, which is implemented as follows:

1. data processing

Downsampling the video in the selected data set, and extracting the feature vector h of each frame of the video by using a pre-trained neural network_f∈R^dF is a frame number, F is 1,2,., F is a total length of the downsampled video, and d represents a length of the feature vector; the feature vectors of all frames of a video and the corresponding importance scores form a sample in a training set; the selected data set includes TvSum and SumMe, which compriseImportance scores s 'of several videos and manually labeled to each frame thereof'_f；

2. Adding position vectors

In order to represent the position information of each frame, a position representation vector needs to be added. A position vector for the video frame is generated using:

wherein, PE_f(i) The ith element value, i ═ 1,2, …, d, of the position vector representing the f-th frame of the video.

Then, the position vector of each frame of the video is added to the feature vector element by element, and each frame obtains a new vector x after the position vector is added_f＝h_f+PE_fAnd taking the obtained vector as the input of a subsequent network model encoder.

3. Constructing a video abstract converter neural network model

The present invention designs a converter model for video summarization, including an encoder and a decoder, using which importance scores of video frames are obtained.

The encoder is formed by sequentially connecting two encoder units with the same structure, each unit consists of a multi-head self-attention mechanism module, a residual connection and normalization module 1, a two-layer feedforward network and a residual connection and normalization module 2 in sequence, and the video frame characteristic sequence added with the position expression vector obtained in the step 2 is subjected to feature sequence analysis

And inputting the input data into a first encoder unit, and finally outputting an intermediate variable Y with sequence information and the same dimension as the input X by a second encoder unit. Among them, the multi-head attention mechanism is described in the document "Ashish Vaswani, Noam Shazeer, NikiParmar, et al]2017.

The decoder is formed by sequentially connecting two decoder units with the same structure, each unit sequentially consists of a multi-head self-attention machine system module with a mask, a residual error connection and normalization module 1, a multi-head self-attention machine system module, a residual error connection and normalization module 2, a two-layer feedforward network and a further residual error connection and normalization module 3, the decoder is provided with two inputs, when the importance score of the kth (k is 1,2,.., F) frame is predicted, the product of the importance score which is predicted by the first k-1 video frames and the characteristic vector of the frame is used as the input of the multi-head self-attention machine system module with the mask in the first decoder unit, and the intermediate variable output by the encoder is input into the multi-head self-attention machine system module of each decoder unit; and adding a linear layer and a sigmoid function behind the last decoder unit, and outputting an importance score prediction result of each frame.

The processing procedure of the encoder is as follows: first, the encoder receives X, initializes the input of the nth (n is 1,2,3,4) th head of the multi-head attention mechanism module:

wherein Q is₀＝K₀＝V₀＝X，

Is a randomly generated matrix of size d × d to be learned during training, and then pair Q according to a multi-headed self-attention mechanism_n,K_n,V_nAnd (3) processing:

M(Q₀,K₀,V₀)＝Concat(H₁,...,H₄)W^O(7)

where Concat is the splicing function, W^OIs a matrix of size 4d × d, M (Q), randomly generated and to be learned during training₀,K₀,V₀) The final output of the multi-head self-attention mechanism module; then, carrying out residual error connection and normalization operation; finally, a two-layer feedforward network and a residual connection and normalization module are used for further mapping the characteristics, the obtained variables are continuously input into a second encoder unit, and finally an intermediate variable Y with sequence information and the same dimensionality as the input variable is obtained through output;

the decoder processes as follows: first, when predicting the importance score of the kth frame, the product of the importance scores of the first k-1 predicted video frames and their feature vectors is used as the input of the first decoder unit, i.e.:

it should be noted that the first decoder unit uses the product of all the artificial labeling scores and the feature vectors as input in the training process to implement parallelization of training, so a mask needs to be added in the self-attention mechanism module to ensure that the prediction of the importance score of the current frame only depends on the output before the frame, and the processing process of the self-attention mechanism module is the same as that in the above-mentioned encoder;

then, residual error connection and normalization operation are carried out on the output of the self-attention mechanism module with the mask to obtain Z, and the Z and the intermediate variable Y obtained by the encoder are input into the self-attention mechanism module in the decoder unit together:

K₀＝V₀＝Y，Q₀＝Z (9)

then, adding and normalizing the features output by the self-attention mechanism module in the previous step and the original features, inputting the features into a two-layer feedforward network, finally performing residual connection and normalization operation again, and inputting the obtained variables into a second decoder unit;

finally, the output of the second decoder unit is processed by a linear layer and sigmoid function to obtain the prediction result of the importance value of the frame.

4. Training network model

Training the neural network model of the video abstract converter introduced in the step 3 by using the training data set obtained in the step 1, and setting the loss function of the network as a mean square loss function

Wherein L represents the network loss, s_fAnd s'_fRespectively carrying out iterative training for the importance scores of the f-th frame of the video predicted by the model and the importance scores of the manual labels in the data set for multiple times to obtain a trained model;

5. obtaining video summary using model

Preprocessing a video data set to be processed, including segment extraction, down-sampling, feature extraction and position vector addition, to obtain feature representation of each frame; and then, extracting and obtaining the importance score of each frame of video by using the network model trained in the step 4. And finally, dividing the video into a plurality of scene shots by utilizing a KTS algorithm, and selecting an important video shot, namely a video abstract according to the importance score of the video frame by utilizing a knapsack algorithm. The length of the selected video summary can not exceed 15% of the original video length.

Claims

1. A video summarization method based on converter includes the following steps:

step 2: a position vector for the video frame is generated using:

initializing the input of the network model, specifically including: the input of the nth head of the multi-head self-attention mechanism module in the encoder unit is initialized as follows:

where n is 1,2,3,4, Q in the first encoder unit₀＝K₀＝V₀X is the position vector added video frame feature obtained in step 2, Q in the second encoder unit₀、K₀、V₀Is the output of the first encoder unit,

is a randomly generated matrix with size d × d to be learned in the training process, and input Q of nth head of multi-head self-attention mechanism module with mask in decoder unit_n、K_nAnd V_nThe initialization method of (2) is the same as that of the multi-headed self-attention mechanism module in the encoder, except that in the first decoder unit

Wherein L represents the network loss, s_fAnd s'_fViews predicted for the model respectivelyThe importance scores of the f-th frame and the importance scores of the artificial labels in the data set are obtained;