CN109885728B

CN109885728B - Video abstraction method based on meta-learning

Info

Publication number: CN109885728B
Application number: CN201910037959.5A
Authority: CN
Inventors: 李学龙; 李红丽; 董永生
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2022-06-07
Anticipated expiration: 2039-01-16
Also published as: CN109885728A

Abstract

The invention relates to a video abstraction method based on meta-learning. The method is based on the thought of meta-learning, the abstract problem of each video is regarded as an independent video abstract task, and a learner model learns in the video abstract task space so as to improve the generalization performance of the model and explore a video abstract mechanism. Specifically, the invention uses a video abstraction Long Short Term Memory neural network (vsLSTM) as a learner model. The method mainly comprises the following steps: (1) randomly dividing all tasks (summary problem for each video in the data set) into a training set and a testing set; (2) the learner model learns among the tasks of the training set according to the two-stage learning scheme provided by the method, and explores a video summarization mechanism; (3) the performance of the test model on the test set completes the performance assessment.

Description

Video abstraction method based on meta-learning

Technical Field

The invention belongs to the technical field of computer vision, and is also one of the key problems of machine learning and pattern recognition. The invention abstracts the video and extracts the key frames in the video, thereby reducing the time of people for browsing the video and being applied to the aspects of video retrieval, video management and the like.

Background

With the wide spread of photographable devices such as mobile phones and mobile cameras, a great amount of video data is generated and spread every day. On one hand, these data provide people with rich information, and on the other hand, the time consumed for browsing and retrieving these video data is very considerable. In this context, video summarization has received much attention from researchers in the field of computer vision as a video compression technique.

The video summarization is to analyze the temporal-spatial redundancy existing in the video structure and content, remove redundant segments (frames) in the original video and extract meaningful segments (frames) in the original video in a semi-automatic or automatic mode. The method can improve the efficiency of browsing videos by people, lay a foundation for subsequent video analysis and processing, and be widely applied to aspects of video retrieval, video management and the like. Since its generation, it has been receiving much attention and many representative methods are emerging. However, because different people have different attention points when browsing videos, a video summarization method which is universal or can completely meet the needs of people does not exist so far, and therefore, the research of the video summarization algorithm has a wide exploration space.

Most recent methods use LSTM as a basic model because of the inherent structure and serialization characteristics of video data and the excellent sequence modeling performance of Long Short Term Memory neural networks (LSTM). Video summary long and short term memory networks (vsLSTM) and determinant point process long and short term memory networks (dppLSTM) proposed in Zhang et al in documents k.zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in pro.eur.conf.com.vis., pp.766-782,2016 are two typical video summary network models improved in recent years based on the basic LSTM model, and can well model the time dependence of different lengths in video; zhou and Qiao in the document k.zhou and y.qiao, "Deep learning for unsupervised video summary with depth representation perceived," arXiv:18.01.00054,2017. the various unsupervised and supervised version Depth Summary Networks (DSNs) proposed in Zhou and Qiao integrate the idea of Deep reinforcement learning into the learning process of LSTM networks to better capture the structural characteristics of video data; an Attention-based codec Video summary network Architecture (AVS) proposed in Ji et al, Z.Ji, K.Xiong, Y.Pang, and X.Li, "Video summary with annotation-based encoder-decoder networks," arXiv:1708.09545,2017, combines an encoder with LSTM as the basic model and an Attention-based decoder to extract Video key frames.

The problems that exist are as follows:

1) much attention is paid to the structural or serialization nature of video data, not the video summarization task itself;

2) the mechanism of exploring the video abstract by the model is not explicitly required, and the generalization capability of the model is not good enough.

Disclosure of Invention

Technical problem to be solved

In view of the above-mentioned shortcomings of the existing methods, the present invention provides a video summarization method based on meta-learning. The method is based on the thought of meta-learning, the abstract problem of each video is regarded as an independent video abstract task, and the learning of the model is carried out in the video abstract task space, so that more attention is paid to the video abstract task; through learning in the video abstract task space, the method requires the model to explore a video abstract mechanism on the display ground so as to improve the generalization performance of the model.

Technical scheme

A video abstraction method based on meta-learning is characterized by comprising the following steps:

step 1: preparing a data set

Using the open source video summary datasets SumMe, TVSum, Youtube and OVP: when SumMe is used as a test set, Youtube and OVP are used as training sets, and TVSum is used as a verification set; when TVSum is taken as a test set, Youtube and OVP are taken as training sets, and SumMe is taken as a verification set;

step 2: extracting video frame features

Inputting the video frame into a GoogleLeNet network, and taking the output of the second last layer of the network as the depth characteristic of the video frame; using a color histogram, GIST, HOG and dense SIFT as traditional features, wherein the color histogram is extracted from an RGB form of a video frame, and other traditional features are extracted from a corresponding gray map of the video frame;

and step 3: training video abstract model

Learning of a learner model vsLSTM network f theta parameter theta by adopting a two-stage network training algorithm based on a meta-learning thought, and randomly initializing the model parameter theta to theta before training₀The ith iteration of the method is to change the model parameters from theta_i-1Is updated to theta_iEach iteration in the training consists of a two-stage random gradient descent process:

the first stage is to change the parameter from theta_i-1Is updated to

Randomly selecting a task from a training set

Calculating the current parameter theta of the learner_i-1Performance on the task under State

And loss function

To find

To theta_i-1Derivative and update the learner parameter theta_i-1To is that

The learner model may then be recalculated

Rendering on the task and updating its parameters

This parameter update may be performed n times, where n is a positive integer as shown in the following equation:

wherein, alpha represents the learning rate,

and

is a learner model

And

at task

L of₁Loss function in which the parameters of the learner model are each θ_i-1And

L₁the loss function is defined as:

wherein y represents the output vector of the model, x represents the ground truth vector, and N represents the number of elements in the vector;

the second stage is to make the parameters from

Is updated to theta_i: randomly selecting a task from a training set

Calculating learner's present parameters

Performance on the task under State

And loss function

To find

To theta_i-1And updating the learner parameter to θ_iAs shown in formula (3):

wherein β represents the meta-learning rate, which is used as a hyper-parameter in the method of the invention;

is a learner model

At task

L of₁A loss function in which the parameters of the learner model are

The two-stage training algorithm is used as a meta learner model to guide the training of the learner model vsLSTM so as to explore a video abstraction mechanism, and the parameter theta of the learner model can be obtained in multiple iterations by maximizing the generalization ability of the learner model on a test set, namely minimizing the expected generalization error of the learner model on the test set;

and 4, step 4: and (3) inputting the video frame characteristics in the step (2) into the learner model vsLSTM network trained in the step (3) to obtain the probability of each frame of the selected video abstract.

The specific steps of step 4 are as follows: firstly, dividing a video into time-sequence disjoint segments according to the probability or score output by vsLSTM; then taking the average value of the video frame scores in each segment as the score of the video segment, and sequencing the video segments in a descending order according to the scores of the video segments; and (4) reserving the video segments with the maximum probability in sequence, stopping when the total length of the reserved video segments reaches 15% of the length of the original video in order to avoid overlong selected summarization results, and taking the selected video segments as the summarization results of the original video.

Advantageous effects

The video abstraction method based on meta-learning provided by the invention has the following beneficial effects:

1) the concept of meta-learning is applied for the first time to solve the video abstraction problem;

2) a simple and effective video abstract model training method is provided, so that the model focuses more on the video abstract task;

3) aiming at improving the generalization capability of the model and definitely requiring the video abstraction model to explore a video abstraction mechanism;

4) qualitative and quantitative experiment comparison proves that the algorithm has the characteristics of advancement, effectiveness and the like, and has high practical application value.

Drawings

FIG. 1 is a conceptual overall flow diagram of the present invention

FIG. 2 is a schematic diagram of an iterative process of the training method of the present invention

FIG. 3 is a graph showing the performance of the present invention under different hyper-parameters

FIG. 4 is a visualization result chart of the present invention

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the technical scheme for realizing the invention comprises the following steps:

1) preparing a data set

The method uses an open source video summary data set SumMe (M.Gygli, H.Grabner, H.Riemenschneider, and L.Van Gool, "creating summary from user video," in Proc. Eur. Conf. Compout. Vis., pp.505-520,2014), TVSum (Y.Song, J.Vallmitjana, A.Stent, and A.James, "TvSum: creating web video using title," inIEEE Conf.Comp.Vis.Pattern Recognit, pp.5179-5187,2015), Youtube (S.E.F.De Avila, A.P.B.Lopes, A.da luz Jr, and A.de Albuquerque Ara ujo, "Vssum: a mechanism designed to product static video summary and a novel evaluation method," Pattern Recognit.Lett., vol.32, No.1, pp.56-68,2011) and OVP (open video project,http://www.open-video.org/.)。

to explore the generalization performance of the model, SumMe or TVSum was used as the test set, followed by three as the training and validation sets. When SumMe is used as a test set, TVSum, Youtube and OVP are used as training and verification sets; when SumMe is used as a test set, Youtube and OVP are used as training sets, and TVSum is used as a verification set; when TVSum is taken as the test set, Youtube and OVP are taken as the training set, and SumMe is taken as the verification set.

2) Extracting video frame features

The method of the invention uses two types of features, depth and tradition, respectively, to verify the validity of the model. The video frame is input to *** lenet (c.szegdy, w.liu, y.jia, p.scannet, s.reed, d.angueloy, d.erhan, v.vanhoucke, a.rabinovich et al, "going decoder with contents," in port.ieee conf.com.vis.pattern recognitions, 2015) network model, where the second-to-last layer output is taken as its depth Feature, and the conventional features use color histograms, GIST, HOG (Histogram of organized Gradient) and dense SIFT (Scale inversion Feature), where the color histograms are taken in RGB form of the video frame and the other conventional features are taken from the corresponding grey Scale map of the video frame.

3) Training video abstract model

The method provides a two-stage network training algorithm MetaL-VS based on a meta-learning thought, each iteration in the training is composed of a two-stage random gradient descent algorithm, the two-stage training algorithm is used as a meta-learner model to guide the training of the learner model, and vsLSTM is used as the learner model to explore a video abstraction mechanism.

As shown in FIG. 1, the method is based on the idea of meta-learning, the abstract problem of each video is regarded as an independent video abstract task, and the model learns in the video abstract task spaceFinally, by treating the summarization problem of the test video as a new task, the model can obtain the corresponding summary of the video. Specifically, the method provides a two-stage network training algorithm based on meta-learning thought to perform a learner model (the method uses a vsLSTM network as the learner model) f_θAnd learning the parameter theta. As shown in FIG. 2, let the ith iteration make the model parameters from θ_i-1Is updated to theta_i(random initialization of model parameters to θ before training)₀) Each iteration in the training consists of a two-stage random gradient descent process.

The first stage is to change the parameter from theta_i-1Is updated to

(in the illustrated case n-2): randomly selecting a task from a training set

And loss function

To find

To theta_i-1And updating the learner parameter theta_i-1To

The learner model may then be recalculated

Rendering on the task and updating its parameters

Theoretically, the parameter update can be performed n (n is a positive integer) times, as shown in formula (1)Showing:

where α represents the learning rate, which is taken as a hyper-parameter in the method of the invention;

and

is a learner model

And

at task

L₁the loss function is defined as:

wherein y represents the output vector of the model, x represents the ground truth vector, and N represents the number of elements in the vector.

The second stage is to make the parameters from

Is updated to theta_i: randomly selecting a task from the training set

Calculating learner's present parameters

Performance on the task under State

And a loss function

To find

To theta_i-1And updating the learner parameter to θ_iAs shown in formula (3):

is a learner model

At task

L of₁A loss function in which the parameters of the learner model are

This two-stage training algorithm acts as a meta-learner model to guide the training of the learner model (vsLSTM) for the exploration of the video summarization mechanism. By maximizing the generalization ability of the learner model over the test set (minimizing the expected generalization error of the learner model over the test set), the parameter θ of the learner model may be found over multiple iterations.

4) Outputting video summaries

The input of the video summarization model is the video frame characteristics (depth or traditional characteristics), and the output is the probability of each frame in the video being selected into the summary (the output is a vector, each element in the vector is greater than or equal to 0 and less than or equal to 1, the length of the vector is equal to the frame number, that is, each primitive in the vector represents the probability of the corresponding video frame being selected into the video summary, and can also be understood as the importance score of the frame). The results of the method can be converted into abstract results according to the methods in the documents k.zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in proc.eur.conf.comput.vis., pp.766-782,2016. Inputting the characteristics of each frame of the test video into the trained learner model, and obtaining a video abstract result after processing.

The method comprises the following specific steps: first, based on the probability or score of vsLSTM output, the video is divided into temporally disjoint segments using a Kernel Temporal Segmentation (KTS) (according to the document k.zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in proc.eur.conf.comp.vis., pp.766-782,2016.); then taking the average value of the video frame scores in each segment as the score of the video segment, and sequencing the video segments in a descending order according to the scores of the video segments; and (4) starting to sequentially reserve the video segments with the highest probability (according to the sequence of the scores of the video segments from high to low), and stopping when the total length of the reserved video segments reaches 15% of the length of the original video in order to avoid overlong selected summarization results, wherein the selected video segments are taken as the summarization results of the original video.

1) Simulation conditions

The invention is in the central processing unit

The simulation of the python program is performed on the i 5-34703.2 GHz GPU and the memory 16G, Centos operating system by using Anaconda software. The data sets used in the experiments were obtained from public databases:

SumMe dataset(http://classif.ai/dataset/ethz-cvl-video-summe)

TVSum dataset(https://github.com/yalesong/tvsum)

Youtube dataset(http://www.npdi.dcc.ufmg.br/VSUMM)

OVP dataset(http://www.open-video.org)

wherein the SumMe dataset contains 25 annotation videos, 50 annotation videos in TVSum, Youtube and OVP, respectively. When training the learner model, the training set contains a group route, and the group route of the testing set is hidden. When SumMe is used as a test set, randomly selecting 10 videos from TVSum as a verification set, and forming a training set by the rest videos in TVSum, Youtube and videos in OVP; when TVSum is taken as a test set, 25 videos are randomly selected from the TVSum as the test set, wherein the rest videos are taken as a verification set, and the other three data sets form a training set. In our experiments, a test set was used to verify the effectiveness of our method. The performance evaluation index is F-score F:

where P denotes precision (precision), R denotes recall (recall):

wherein A represents the summary result generated by the model, and B represents the ground route.

2) Emulated content

(1) In order to show the process of searching for the super-parameters (learning rate lr, meta-learning rate mlr and the first-stage parameter update times n) which make the performance of the method of the present invention better, in one experiment, we performed an evaluation experiment of model performance under different super-parameters.

FIG. 3 shows the performance of the model under different hyper-parameters. It can be seen from the figure that when lr takes 0.0001 and mlr takes 0.001, the model performs best on both datasets.

Table 1 shows the F-score of the model on both datasets with different values of the hyper-parameter n, with bold numbers being the best indicators. Because the experiment uses the video memory limit of the video card, the maximum value of n is 2, and when the value of n is more than 2, the error of insufficient memory occurs. It can be seen from the table that when the value of the hyperparameter n is 1, the model performs best on both data sets.

TABLE 1 Performance of the model on two datasets with different values of the hyperparameter n (F-score)

n	1	2
			SumMe	44.1％	42.5％
TVSum	58.2％	58.1％

(2) To demonstrate the effectiveness of the present algorithm, we compared the algorithm herein with a typical method in recent years in experiment 2. The first comparative method was proposed in 2015 by Gygli et al, a detailed introduction reference paper: m.gygli, h.grabner, and l.van Gool, "video rendering by sub-modulation schemes of objects," in proc.ieee conf.com.vis.pattern recognit, 2015, pp.3090-3098. the second comparative method is vsLSTM, a detailed introduction reference: zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term momry, "in proc.eur.conf.comput.vis.,2016, pp.766-782. the third comparative method was proposed by Zhang et al, 2016, with detailed introduction into the reference: zhang, W.L.Chao, F.Sha, and K.Grauman, "summer transfer," expplan-based subset selection for video summation, "in Proc.IEEE Conf.Comp.Vis.Pattern recognit, 2016, pp.1059-1067. fourth comparison method is SUM-GAN_supReference is made to the paper: mahassei, m.lam, and s.todorovic, "unsupervised video summary with adaptive lstm networks," in proc.ieee conf.com.vis.pattern recognition, 2017. a fifth comparative method is DR-DSN_supReference is made to the paper: K.Zhou and Y.Qiao, "deep discovery for unsupervised video summary," arXiv:1801.00054,2017. sixth comparative method was proposed by Li et al in 2017, with detailed description of the reference: li, B.ZHao, and X.Lu, "a general frame for the improved and raw video summary," IEEE trans. image Process, vol.26, No.8, pp.3652-3664,2017. Table 2 is a comparison of the quantization index F-score, with the bold numbers being the best index. As can be seen from the table, the process MetaL-VS presented herein performed best in the comparison. The advancement of the invention is thus further demonstrated by comparison with a recent method representative of this field.

FIG. 4 is a visualization of MetaL-VS where the Air _ Force _ One and car _ over _ camera videos are from the SumMe dataset; AwmHb44_ ouw and qqR6AEXwxoQ video is from TVSum dataset. The blue part of the histogram is the ground route, i.e. the probability that each frame labeled manually is a summary frame, the red part is the result of MetaL-VS, and the picture below the histogram is part of the example picture in the summary result of MetaL-VS. It can be seen from the figure that although there is some deviation, MetaL-VS can select the frames with high importance from the original video, and ignore the frames with insufficient importance. The effectiveness of the invention can be seen from the visualization.

TABLE 2.7 methods video summary result index F-score comparison

Method	SumMe	TVSum
			Gygli et al.	39.7％	-
vsLSTM	40.7％	56.9％
			Zhang et al.	40.9％	-
SUM-GAN_sup	41.7％	56.3％
			DR-DSN_sup	42.1％	58.1％
Li et al.	43.1％	52.7％
			MetaL-VS (invention)	44.1％	58.2％

(3) In order to test the robustness of the MetaL-VS method on the traditional characteristics, a comparison experiment of video summary performance on the traditional characteristics is carried out with two methods which are representative in the last two years. The first comparative method is SUM-GAN_supReference is made to the paper: mahassei, m.lam, and s.todorovic, "unsupervised video rendering with adaptive lstm networks," in proc.ieee conf.com.vis.pattern recognition, 2017. the second comparative method is dppLSTM, a detailed introduction reference: zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in proc.eur.conf.comput.vis.,2016, pp.766-782. table 3 is a comparison of the quantitative indicators F-score, with bold numbers indicating the best indicators. As can be seen from the table, MetaL-VS achieved performance comparable to the last two years classical approach and also exceeded two comparative approach 4 and 2.8 percentage points, respectively, on the SumMe dataset. As can be seen from the representation of MetaL-VS on the traditional characteristics, the invention has certain robustness and generalization capability on the traditional characteristics.

TABLE 3 comparison of F-score Performance Using conventional characteristics

Method	SumMe	TVSum
			SUM-GAN_sup	39.5％	59.5％
dppLSTM	40.7％	57.9％
			MetaL-VS (invention)	43.5％	57.9％

The method is a method for applying the first exploration meta-learning in the field of video abstraction. Based on the idea of meta-learning, the video abstract model learns in the video abstract task space, and the method is beneficial to the model to pay more attention to the video abstract task, not only to structured and serialized video data, but also to the exploration of the video abstract mechanism by the model, and is beneficial to improving the generalization capability of the model. Qualitative and quantitative experiment comparison proves that the algorithm has the characteristics of advancement, effectiveness and the like.

Claims

1. A video abstraction method based on meta-learning is characterized by comprising the following steps:

step 1: preparing a data set

step 2: extracting video frame features

Inputting the video frame into a GoogleLeNet network, and taking the output of the penultimate layer of the network as the depth characteristic of the video frame; using a color histogram, GIST, HOG and dense SIFT as traditional features, wherein the color histogram is extracted from an RGB form of a video frame, and other traditional features are extracted from a corresponding gray map of the video frame;

and step 3: training video abstract model

Two-stage network training algorithm based on meta-learning thought is adopted to carry out learner model vsLSTM network f_θLearning parameter theta, and randomly initializing model parameter theta to theta before training₀The ith iteration thereof willModel parameters are represented by_i-1Is updated to theta_iEach iteration in the training consists of a two-stage random gradient descent process:

the first stage is to change the parameter from theta_i-1Is updated to theta_i ⁿ: randomly selecting a task from a training set

And loss function

To find

To theta_i-1And updating the learner parameter theta_i-1To theta_i ^lThe learner model may then be calculated again

Rendering on the task and updating its parameters

wherein, alpha represents the learning rate,

and

is a learner model

And

at task

L on₁Loss functions in which the parameters of the learner model are each θ_i-1And theta_i ^j-1；L₁The loss function is defined as:

the second stage is to make the parameters from

Is updated to theta_i: randomly selecting a task from a training set

Calculating learner's present parameters

Performance on the task under State

And loss function

To find

To theta_i-1And updating the learner parameter to θ_iAs shown in formula (3):

wherein β represents the meta-learning rate, as a hyper-parameter in the above method;

is a learner model

At task

L of₁A loss function in which the parameters of the learner model are

2. The method for abstracting a video based on meta-learning as claimed in claim 1, wherein the specific steps of step 4 are as follows: firstly, dividing a video into time-sequence disjoint segments according to the probability or score output by vsLSTM; then taking the average value of the video frame scores in each segment as the score of the video segment, and sequencing the video segments in a descending order according to the scores of the video segments; and (4) reserving the video segments with the maximum probability in sequence, stopping when the total length of the reserved video segments reaches 15% of the length of the original video in order to avoid overlong selected summarization results, and taking the selected video segments as the summarization results of the original video.