CN109885728B - Video abstraction method based on meta-learning - Google Patents

Video abstraction method based on meta-learning Download PDF

Info

Publication number
CN109885728B
CN109885728B CN201910037959.5A CN201910037959A CN109885728B CN 109885728 B CN109885728 B CN 109885728B CN 201910037959 A CN201910037959 A CN 201910037959A CN 109885728 B CN109885728 B CN 109885728B
Authority
CN
China
Prior art keywords
video
model
theta
learner
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910037959.5A
Other languages
Chinese (zh)
Other versions
CN109885728A (en
Inventor
李学龙
李红丽
董永生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201910037959.5A priority Critical patent/CN109885728B/en
Publication of CN109885728A publication Critical patent/CN109885728A/en
Application granted granted Critical
Publication of CN109885728B publication Critical patent/CN109885728B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention relates to a video abstraction method based on meta-learning. The method is based on the thought of meta-learning, the abstract problem of each video is regarded as an independent video abstract task, and a learner model learns in the video abstract task space so as to improve the generalization performance of the model and explore a video abstract mechanism. Specifically, the invention uses a video abstraction Long Short Term Memory neural network (vsLSTM) as a learner model. The method mainly comprises the following steps: (1) randomly dividing all tasks (summary problem for each video in the data set) into a training set and a testing set; (2) the learner model learns among the tasks of the training set according to the two-stage learning scheme provided by the method, and explores a video summarization mechanism; (3) the performance of the test model on the test set completes the performance assessment.

Description

Video abstraction method based on meta-learning
Technical Field
The invention belongs to the technical field of computer vision, and is also one of the key problems of machine learning and pattern recognition. The invention abstracts the video and extracts the key frames in the video, thereby reducing the time of people for browsing the video and being applied to the aspects of video retrieval, video management and the like.
Background
With the wide spread of photographable devices such as mobile phones and mobile cameras, a great amount of video data is generated and spread every day. On one hand, these data provide people with rich information, and on the other hand, the time consumed for browsing and retrieving these video data is very considerable. In this context, video summarization has received much attention from researchers in the field of computer vision as a video compression technique.
The video summarization is to analyze the temporal-spatial redundancy existing in the video structure and content, remove redundant segments (frames) in the original video and extract meaningful segments (frames) in the original video in a semi-automatic or automatic mode. The method can improve the efficiency of browsing videos by people, lay a foundation for subsequent video analysis and processing, and be widely applied to aspects of video retrieval, video management and the like. Since its generation, it has been receiving much attention and many representative methods are emerging. However, because different people have different attention points when browsing videos, a video summarization method which is universal or can completely meet the needs of people does not exist so far, and therefore, the research of the video summarization algorithm has a wide exploration space.
Most recent methods use LSTM as a basic model because of the inherent structure and serialization characteristics of video data and the excellent sequence modeling performance of Long Short Term Memory neural networks (LSTM). Video summary long and short term memory networks (vsLSTM) and determinant point process long and short term memory networks (dppLSTM) proposed in Zhang et al in documents k.zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in pro.eur.conf.com.vis., pp.766-782,2016 are two typical video summary network models improved in recent years based on the basic LSTM model, and can well model the time dependence of different lengths in video; zhou and Qiao in the document k.zhou and y.qiao, "Deep learning for unsupervised video summary with depth representation perceived," arXiv:18.01.00054,2017. the various unsupervised and supervised version Depth Summary Networks (DSNs) proposed in Zhou and Qiao integrate the idea of Deep reinforcement learning into the learning process of LSTM networks to better capture the structural characteristics of video data; an Attention-based codec Video summary network Architecture (AVS) proposed in Ji et al, Z.Ji, K.Xiong, Y.Pang, and X.Li, "Video summary with annotation-based encoder-decoder networks," arXiv:1708.09545,2017, combines an encoder with LSTM as the basic model and an Attention-based decoder to extract Video key frames.
The problems that exist are as follows:
1) much attention is paid to the structural or serialization nature of video data, not the video summarization task itself;
2) the mechanism of exploring the video abstract by the model is not explicitly required, and the generalization capability of the model is not good enough.
Disclosure of Invention
Technical problem to be solved
In view of the above-mentioned shortcomings of the existing methods, the present invention provides a video summarization method based on meta-learning. The method is based on the thought of meta-learning, the abstract problem of each video is regarded as an independent video abstract task, and the learning of the model is carried out in the video abstract task space, so that more attention is paid to the video abstract task; through learning in the video abstract task space, the method requires the model to explore a video abstract mechanism on the display ground so as to improve the generalization performance of the model.
Technical scheme
A video abstraction method based on meta-learning is characterized by comprising the following steps:
step 1: preparing a data set
Using the open source video summary datasets SumMe, TVSum, Youtube and OVP: when SumMe is used as a test set, Youtube and OVP are used as training sets, and TVSum is used as a verification set; when TVSum is taken as a test set, Youtube and OVP are taken as training sets, and SumMe is taken as a verification set;
step 2: extracting video frame features
Inputting the video frame into a GoogleLeNet network, and taking the output of the second last layer of the network as the depth characteristic of the video frame; using a color histogram, GIST, HOG and dense SIFT as traditional features, wherein the color histogram is extracted from an RGB form of a video frame, and other traditional features are extracted from a corresponding gray map of the video frame;
and step 3: training video abstract model
Learning of a learner model vsLSTM network f theta parameter theta by adopting a two-stage network training algorithm based on a meta-learning thought, and randomly initializing the model parameter theta to theta before training0The ith iteration of the method is to change the model parameters from thetai-1Is updated to thetaiEach iteration in the training consists of a two-stage random gradient descent process:
the first stage is to change the parameter from thetai-1Is updated to
Figure BDA0001946541880000031
Randomly selecting a task from a training set
Figure BDA0001946541880000032
Calculating the current parameter theta of the learneri-1Performance on the task under State
Figure BDA0001946541880000033
And loss function
Figure BDA0001946541880000034
To find
Figure BDA0001946541880000035
To thetai-1Derivative and update the learner parameter thetai-1To is that
Figure BDA0001946541880000036
The learner model may then be recalculated
Figure BDA0001946541880000037
Rendering on the task and updating its parameters
Figure BDA0001946541880000038
This parameter update may be performed n times, where n is a positive integer as shown in the following equation:
Figure BDA0001946541880000039
wherein, alpha represents the learning rate,
Figure BDA00019465418800000310
and
Figure BDA00019465418800000311
is a learner model
Figure BDA00019465418800000312
And
Figure BDA00019465418800000313
at task
Figure BDA00019465418800000314
L of1Loss function in which the parameters of the learner model are each θi-1And
Figure BDA00019465418800000315
L1the loss function is defined as:
Figure BDA00019465418800000316
wherein y represents the output vector of the model, x represents the ground truth vector, and N represents the number of elements in the vector;
the second stage is to make the parameters from
Figure BDA00019465418800000317
Is updated to thetai: randomly selecting a task from a training set
Figure BDA00019465418800000318
Calculating learner's present parameters
Figure BDA00019465418800000319
Performance on the task under State
Figure BDA00019465418800000320
And loss function
Figure BDA00019465418800000321
To find
Figure BDA00019465418800000322
To thetai-1And updating the learner parameter to θiAs shown in formula (3):
Figure BDA0001946541880000041
wherein β represents the meta-learning rate, which is used as a hyper-parameter in the method of the invention;
Figure BDA0001946541880000042
is a learner model
Figure BDA0001946541880000043
At task
Figure BDA0001946541880000044
L of1A loss function in which the parameters of the learner model are
Figure BDA0001946541880000045
The two-stage training algorithm is used as a meta learner model to guide the training of the learner model vsLSTM so as to explore a video abstraction mechanism, and the parameter theta of the learner model can be obtained in multiple iterations by maximizing the generalization ability of the learner model on a test set, namely minimizing the expected generalization error of the learner model on the test set;
and 4, step 4: and (3) inputting the video frame characteristics in the step (2) into the learner model vsLSTM network trained in the step (3) to obtain the probability of each frame of the selected video abstract.
The specific steps of step 4 are as follows: firstly, dividing a video into time-sequence disjoint segments according to the probability or score output by vsLSTM; then taking the average value of the video frame scores in each segment as the score of the video segment, and sequencing the video segments in a descending order according to the scores of the video segments; and (4) reserving the video segments with the maximum probability in sequence, stopping when the total length of the reserved video segments reaches 15% of the length of the original video in order to avoid overlong selected summarization results, and taking the selected video segments as the summarization results of the original video.
Advantageous effects
The video abstraction method based on meta-learning provided by the invention has the following beneficial effects:
1) the concept of meta-learning is applied for the first time to solve the video abstraction problem;
2) a simple and effective video abstract model training method is provided, so that the model focuses more on the video abstract task;
3) aiming at improving the generalization capability of the model and definitely requiring the video abstraction model to explore a video abstraction mechanism;
4) qualitative and quantitative experiment comparison proves that the algorithm has the characteristics of advancement, effectiveness and the like, and has high practical application value.
Drawings
FIG. 1 is a conceptual overall flow diagram of the present invention
FIG. 2 is a schematic diagram of an iterative process of the training method of the present invention
FIG. 3 is a graph showing the performance of the present invention under different hyper-parameters
FIG. 4 is a visualization result chart of the present invention
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the technical scheme for realizing the invention comprises the following steps:
1) preparing a data set
The method uses an open source video summary data set SumMe (M.Gygli, H.Grabner, H.Riemenschneider, and L.Van Gool, "creating summary from user video," in Proc. Eur. Conf. Compout. Vis., pp.505-520,2014), TVSum (Y.Song, J.Vallmitjana, A.Stent, and A.James, "TvSum: creating web video using title," inIEEE Conf.Comp.Vis.Pattern Recognit, pp.5179-5187,2015), Youtube (S.E.F.De Avila, A.P.B.Lopes, A.da luz Jr, and A.de Albuquerque Ara ujo, "Vssum: a mechanism designed to product static video summary and a novel evaluation method," Pattern Recognit.Lett., vol.32, No.1, pp.56-68,2011) and OVP (open video project,http://www.open-video.org/.)。
to explore the generalization performance of the model, SumMe or TVSum was used as the test set, followed by three as the training and validation sets. When SumMe is used as a test set, TVSum, Youtube and OVP are used as training and verification sets; when SumMe is used as a test set, Youtube and OVP are used as training sets, and TVSum is used as a verification set; when TVSum is taken as the test set, Youtube and OVP are taken as the training set, and SumMe is taken as the verification set.
2) Extracting video frame features
The method of the invention uses two types of features, depth and tradition, respectively, to verify the validity of the model. The video frame is input to *** lenet (c.szegdy, w.liu, y.jia, p.scannet, s.reed, d.angueloy, d.erhan, v.vanhoucke, a.rabinovich et al, "going decoder with contents," in port.ieee conf.com.vis.pattern recognitions, 2015) network model, where the second-to-last layer output is taken as its depth Feature, and the conventional features use color histograms, GIST, HOG (Histogram of organized Gradient) and dense SIFT (Scale inversion Feature), where the color histograms are taken in RGB form of the video frame and the other conventional features are taken from the corresponding grey Scale map of the video frame.
3) Training video abstract model
The method provides a two-stage network training algorithm MetaL-VS based on a meta-learning thought, each iteration in the training is composed of a two-stage random gradient descent algorithm, the two-stage training algorithm is used as a meta-learner model to guide the training of the learner model, and vsLSTM is used as the learner model to explore a video abstraction mechanism.
As shown in FIG. 1, the method is based on the idea of meta-learning, the abstract problem of each video is regarded as an independent video abstract task, and the model learns in the video abstract task spaceFinally, by treating the summarization problem of the test video as a new task, the model can obtain the corresponding summary of the video. Specifically, the method provides a two-stage network training algorithm based on meta-learning thought to perform a learner model (the method uses a vsLSTM network as the learner model) fθAnd learning the parameter theta. As shown in FIG. 2, let the ith iteration make the model parameters from θi-1Is updated to thetai(random initialization of model parameters to θ before training)0) Each iteration in the training consists of a two-stage random gradient descent process.
The first stage is to change the parameter from thetai-1Is updated to
Figure BDA0001946541880000061
(in the illustrated case n-2): randomly selecting a task from a training set
Figure BDA0001946541880000062
Calculating the current parameter theta of the learneri-1Performance on the task under State
Figure BDA0001946541880000063
And loss function
Figure BDA0001946541880000064
To find
Figure BDA0001946541880000065
To thetai-1And updating the learner parameter thetai-1To
Figure BDA0001946541880000066
The learner model may then be recalculated
Figure BDA0001946541880000067
Rendering on the task and updating its parameters
Figure BDA0001946541880000068
Theoretically, the parameter update can be performed n (n is a positive integer) times, as shown in formula (1)Showing:
Figure BDA0001946541880000069
where α represents the learning rate, which is taken as a hyper-parameter in the method of the invention;
Figure BDA0001946541880000071
and
Figure BDA0001946541880000072
is a learner model
Figure BDA0001946541880000073
And
Figure BDA0001946541880000074
at task
Figure BDA0001946541880000075
L of1Loss function in which the parameters of the learner model are each θi-1And
Figure BDA0001946541880000076
L1the loss function is defined as:
Figure BDA0001946541880000077
wherein y represents the output vector of the model, x represents the ground truth vector, and N represents the number of elements in the vector.
The second stage is to make the parameters from
Figure BDA0001946541880000078
Is updated to thetai: randomly selecting a task from the training set
Figure BDA0001946541880000079
Calculating learner's present parameters
Figure BDA00019465418800000710
Performance on the task under State
Figure BDA00019465418800000711
And a loss function
Figure BDA00019465418800000712
To find
Figure BDA00019465418800000713
To thetai-1And updating the learner parameter to θiAs shown in formula (3):
Figure BDA00019465418800000714
wherein β represents the meta-learning rate, which is used as a hyper-parameter in the method of the invention;
Figure BDA00019465418800000715
is a learner model
Figure BDA00019465418800000716
At task
Figure BDA00019465418800000717
L of1A loss function in which the parameters of the learner model are
Figure BDA00019465418800000718
This two-stage training algorithm acts as a meta-learner model to guide the training of the learner model (vsLSTM) for the exploration of the video summarization mechanism. By maximizing the generalization ability of the learner model over the test set (minimizing the expected generalization error of the learner model over the test set), the parameter θ of the learner model may be found over multiple iterations.
4) Outputting video summaries
The input of the video summarization model is the video frame characteristics (depth or traditional characteristics), and the output is the probability of each frame in the video being selected into the summary (the output is a vector, each element in the vector is greater than or equal to 0 and less than or equal to 1, the length of the vector is equal to the frame number, that is, each primitive in the vector represents the probability of the corresponding video frame being selected into the video summary, and can also be understood as the importance score of the frame). The results of the method can be converted into abstract results according to the methods in the documents k.zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in proc.eur.conf.comput.vis., pp.766-782,2016. Inputting the characteristics of each frame of the test video into the trained learner model, and obtaining a video abstract result after processing.
The method comprises the following specific steps: first, based on the probability or score of vsLSTM output, the video is divided into temporally disjoint segments using a Kernel Temporal Segmentation (KTS) (according to the document k.zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in proc.eur.conf.comp.vis., pp.766-782,2016.); then taking the average value of the video frame scores in each segment as the score of the video segment, and sequencing the video segments in a descending order according to the scores of the video segments; and (4) starting to sequentially reserve the video segments with the highest probability (according to the sequence of the scores of the video segments from high to low), and stopping when the total length of the reserved video segments reaches 15% of the length of the original video in order to avoid overlong selected summarization results, wherein the selected video segments are taken as the summarization results of the original video.
1) Simulation conditions
The invention is in the central processing unit
Figure BDA0001946541880000081
The simulation of the python program is performed on the i 5-34703.2 GHz GPU and the memory 16G, Centos operating system by using Anaconda software. The data sets used in the experiments were obtained from public databases:
SumMe dataset(http://classif.ai/dataset/ethz-cvl-video-summe)
TVSum dataset(https://github.com/yalesong/tvsum)
Youtube dataset(http://www.npdi.dcc.ufmg.br/VSUMM)
OVP dataset(http://www.open-video.org)
wherein the SumMe dataset contains 25 annotation videos, 50 annotation videos in TVSum, Youtube and OVP, respectively. When training the learner model, the training set contains a group route, and the group route of the testing set is hidden. When SumMe is used as a test set, randomly selecting 10 videos from TVSum as a verification set, and forming a training set by the rest videos in TVSum, Youtube and videos in OVP; when TVSum is taken as a test set, 25 videos are randomly selected from the TVSum as the test set, wherein the rest videos are taken as a verification set, and the other three data sets form a training set. In our experiments, a test set was used to verify the effectiveness of our method. The performance evaluation index is F-score F:
Figure BDA0001946541880000082
where P denotes precision (precision), R denotes recall (recall):
Figure BDA0001946541880000091
Figure BDA0001946541880000092
wherein A represents the summary result generated by the model, and B represents the ground route.
2) Emulated content
(1) In order to show the process of searching for the super-parameters (learning rate lr, meta-learning rate mlr and the first-stage parameter update times n) which make the performance of the method of the present invention better, in one experiment, we performed an evaluation experiment of model performance under different super-parameters.
FIG. 3 shows the performance of the model under different hyper-parameters. It can be seen from the figure that when lr takes 0.0001 and mlr takes 0.001, the model performs best on both datasets.
Table 1 shows the F-score of the model on both datasets with different values of the hyper-parameter n, with bold numbers being the best indicators. Because the experiment uses the video memory limit of the video card, the maximum value of n is 2, and when the value of n is more than 2, the error of insufficient memory occurs. It can be seen from the table that when the value of the hyperparameter n is 1, the model performs best on both data sets.
TABLE 1 Performance of the model on two datasets with different values of the hyperparameter n (F-score)
n 1 2
SumMe 44.1% 42.5%
TVSum 58.2% 58.1%
(2) To demonstrate the effectiveness of the present algorithm, we compared the algorithm herein with a typical method in recent years in experiment 2. The first comparative method was proposed in 2015 by Gygli et al, a detailed introduction reference paper: m.gygli, h.grabner, and l.van Gool, "video rendering by sub-modulation schemes of objects," in proc.ieee conf.com.vis.pattern recognit, 2015, pp.3090-3098. the second comparative method is vsLSTM, a detailed introduction reference: zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term momry, "in proc.eur.conf.comput.vis.,2016, pp.766-782. the third comparative method was proposed by Zhang et al, 2016, with detailed introduction into the reference: zhang, W.L.Chao, F.Sha, and K.Grauman, "summer transfer," expplan-based subset selection for video summation, "in Proc.IEEE Conf.Comp.Vis.Pattern recognit, 2016, pp.1059-1067. fourth comparison method is SUM-GANsupReference is made to the paper: mahassei, m.lam, and s.todorovic, "unsupervised video summary with adaptive lstm networks," in proc.ieee conf.com.vis.pattern recognition, 2017. a fifth comparative method is DR-DSNsupReference is made to the paper: K.Zhou and Y.Qiao, "deep discovery for unsupervised video summary," arXiv:1801.00054,2017. sixth comparative method was proposed by Li et al in 2017, with detailed description of the reference: li, B.ZHao, and X.Lu, "a general frame for the improved and raw video summary," IEEE trans. image Process, vol.26, No.8, pp.3652-3664,2017. Table 2 is a comparison of the quantization index F-score, with the bold numbers being the best index. As can be seen from the table, the process MetaL-VS presented herein performed best in the comparison. The advancement of the invention is thus further demonstrated by comparison with a recent method representative of this field.
FIG. 4 is a visualization of MetaL-VS where the Air _ Force _ One and car _ over _ camera videos are from the SumMe dataset; AwmHb44_ ouw and qqR6AEXwxoQ video is from TVSum dataset. The blue part of the histogram is the ground route, i.e. the probability that each frame labeled manually is a summary frame, the red part is the result of MetaL-VS, and the picture below the histogram is part of the example picture in the summary result of MetaL-VS. It can be seen from the figure that although there is some deviation, MetaL-VS can select the frames with high importance from the original video, and ignore the frames with insufficient importance. The effectiveness of the invention can be seen from the visualization.
TABLE 2.7 methods video summary result index F-score comparison
Method SumMe TVSum
Gygli et al. 39.7% -
vsLSTM 40.7% 56.9%
Zhang et al. 40.9% -
SUM-GANsup 41.7% 56.3%
DR-DSNsup 42.1% 58.1%
Li et al. 43.1% 52.7%
MetaL-VS (invention) 44.1% 58.2%
(3) In order to test the robustness of the MetaL-VS method on the traditional characteristics, a comparison experiment of video summary performance on the traditional characteristics is carried out with two methods which are representative in the last two years. The first comparative method is SUM-GANsupReference is made to the paper: mahassei, m.lam, and s.todorovic, "unsupervised video rendering with adaptive lstm networks," in proc.ieee conf.com.vis.pattern recognition, 2017. the second comparative method is dppLSTM, a detailed introduction reference: zhang, w.l.chao, f.sha, and k.grauman, "video summary with long short term memory," in proc.eur.conf.comput.vis.,2016, pp.766-782. table 3 is a comparison of the quantitative indicators F-score, with bold numbers indicating the best indicators. As can be seen from the table, MetaL-VS achieved performance comparable to the last two years classical approach and also exceeded two comparative approach 4 and 2.8 percentage points, respectively, on the SumMe dataset. As can be seen from the representation of MetaL-VS on the traditional characteristics, the invention has certain robustness and generalization capability on the traditional characteristics.
TABLE 3 comparison of F-score Performance Using conventional characteristics
Method SumMe TVSum
SUM-GANsup 39.5% 59.5%
dppLSTM 40.7% 57.9%
MetaL-VS (invention) 43.5% 57.9%
The method is a method for applying the first exploration meta-learning in the field of video abstraction. Based on the idea of meta-learning, the video abstract model learns in the video abstract task space, and the method is beneficial to the model to pay more attention to the video abstract task, not only to structured and serialized video data, but also to the exploration of the video abstract mechanism by the model, and is beneficial to improving the generalization capability of the model. Qualitative and quantitative experiment comparison proves that the algorithm has the characteristics of advancement, effectiveness and the like.

Claims (2)

1. A video abstraction method based on meta-learning is characterized by comprising the following steps:
step 1: preparing a data set
Using the open source video summary datasets SumMe, TVSum, Youtube and OVP: when SumMe is used as a test set, Youtube and OVP are used as training sets, and TVSum is used as a verification set; when TVSum is taken as a test set, Youtube and OVP are taken as training sets, and SumMe is taken as a verification set;
step 2: extracting video frame features
Inputting the video frame into a GoogleLeNet network, and taking the output of the penultimate layer of the network as the depth characteristic of the video frame; using a color histogram, GIST, HOG and dense SIFT as traditional features, wherein the color histogram is extracted from an RGB form of a video frame, and other traditional features are extracted from a corresponding gray map of the video frame;
and step 3: training video abstract model
Two-stage network training algorithm based on meta-learning thought is adopted to carry out learner model vsLSTM network fθLearning parameter theta, and randomly initializing model parameter theta to theta before training0The ith iteration thereof willModel parameters are represented byi-1Is updated to thetaiEach iteration in the training consists of a two-stage random gradient descent process:
the first stage is to change the parameter from thetai-1Is updated to thetai n: randomly selecting a task from a training set
Figure FDA0003591471900000011
Calculating the current parameter theta of the learneri-1Performance on the task under State
Figure FDA0003591471900000012
And loss function
Figure FDA0003591471900000013
To find
Figure FDA0003591471900000014
To thetai-1And updating the learner parameter thetai-1To thetai lThe learner model may then be calculated again
Figure FDA0003591471900000015
Rendering on the task and updating its parameters
Figure FDA00035914719000000115
This parameter update may be performed n times, where n is a positive integer as shown in the following equation:
Figure FDA0003591471900000016
wherein, alpha represents the learning rate,
Figure FDA0003591471900000017
and
Figure FDA0003591471900000018
is a learner model
Figure FDA0003591471900000019
And
Figure FDA00035914719000000110
at task
Figure FDA00035914719000000111
L on1Loss functions in which the parameters of the learner model are each θi-1And thetai j-1;L1The loss function is defined as:
Figure FDA00035914719000000112
wherein y represents the output vector of the model, x represents the ground truth vector, and N represents the number of elements in the vector;
the second stage is to make the parameters from
Figure FDA0003591471900000029
Is updated to thetai: randomly selecting a task from a training set
Figure FDA0003591471900000021
Calculating learner's present parameters
Figure FDA00035914719000000210
Performance on the task under State
Figure FDA0003591471900000022
And loss function
Figure FDA0003591471900000023
To find
Figure FDA0003591471900000024
To thetai-1And updating the learner parameter to θiAs shown in formula (3):
Figure FDA0003591471900000025
wherein β represents the meta-learning rate, as a hyper-parameter in the above method;
Figure FDA0003591471900000026
is a learner model
Figure FDA0003591471900000027
At task
Figure FDA0003591471900000028
L of1A loss function in which the parameters of the learner model are
Figure FDA00035914719000000211
The two-stage training algorithm is used as a meta learner model to guide the training of the learner model vsLSTM so as to explore a video abstraction mechanism, and the parameter theta of the learner model can be obtained in multiple iterations by maximizing the generalization ability of the learner model on a test set, namely minimizing the expected generalization error of the learner model on the test set;
and 4, step 4: and (3) inputting the video frame characteristics in the step (2) into the learner model vsLSTM network trained in the step (3) to obtain the probability of each frame of the selected video abstract.
2. The method for abstracting a video based on meta-learning as claimed in claim 1, wherein the specific steps of step 4 are as follows: firstly, dividing a video into time-sequence disjoint segments according to the probability or score output by vsLSTM; then taking the average value of the video frame scores in each segment as the score of the video segment, and sequencing the video segments in a descending order according to the scores of the video segments; and (4) reserving the video segments with the maximum probability in sequence, stopping when the total length of the reserved video segments reaches 15% of the length of the original video in order to avoid overlong selected summarization results, and taking the selected video segments as the summarization results of the original video.
CN201910037959.5A 2019-01-16 2019-01-16 Video abstraction method based on meta-learning Active CN109885728B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910037959.5A CN109885728B (en) 2019-01-16 2019-01-16 Video abstraction method based on meta-learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910037959.5A CN109885728B (en) 2019-01-16 2019-01-16 Video abstraction method based on meta-learning

Publications (2)

Publication Number Publication Date
CN109885728A CN109885728A (en) 2019-06-14
CN109885728B true CN109885728B (en) 2022-06-07

Family

ID=66926054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910037959.5A Active CN109885728B (en) 2019-01-16 2019-01-16 Video abstraction method based on meta-learning

Country Status (1)

Country Link
CN (1) CN109885728B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062284B (en) * 2019-12-06 2023-09-29 浙江工业大学 Visual understanding and diagnosis method for interactive video abstract model
CN111031390B (en) * 2019-12-17 2022-10-21 南京航空航天大学 Method for summarizing process video of outputting determinant point with fixed size
CN111526434B (en) * 2020-04-24 2021-05-18 西北工业大学 Converter-based video abstraction method
CN112884160B (en) * 2020-12-31 2024-03-12 北京爱笔科技有限公司 Meta learning method and related device
KR20230129724A (en) 2022-02-03 2023-09-11 인하대학교 산학협력단 Method and Apparatus for Summarization of Unsupervised Video with Efficient Keyframe Selection Reward Functions

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239501A (en) * 2014-09-10 2014-12-24 中国电子科技集团公司第二十八研究所 Mass video semantic annotation method based on Spark
CN107103614A (en) * 2017-04-12 2017-08-29 合肥工业大学 The dyskinesia detection method encoded based on level independent element
CN107590505A (en) * 2017-08-01 2018-01-16 天津大学 The learning method of joint low-rank representation and sparse regression
CN109064493A (en) * 2018-08-01 2018-12-21 北京飞搜科技有限公司 A kind of method for tracking target and device based on meta learning
CN109213896A (en) * 2018-08-06 2019-01-15 杭州电子科技大学 Underwater video abstraction generating method based on shot and long term memory network intensified learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357543A1 (en) * 2016-01-27 2018-12-13 Bonsai AI, Inc. Artificial intelligence system configured to measure performance of artificial intelligence over time

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239501A (en) * 2014-09-10 2014-12-24 中国电子科技集团公司第二十八研究所 Mass video semantic annotation method based on Spark
CN107103614A (en) * 2017-04-12 2017-08-29 合肥工业大学 The dyskinesia detection method encoded based on level independent element
CN107590505A (en) * 2017-08-01 2018-01-16 天津大学 The learning method of joint low-rank representation and sparse regression
CN109064493A (en) * 2018-08-01 2018-12-21 北京飞搜科技有限公司 A kind of method for tracking target and device based on meta learning
CN109213896A (en) * 2018-08-06 2019-01-15 杭州电子科技大学 Underwater video abstraction generating method based on shot and long term memory network intensified learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Meta Learning for Task-Driven Video;Xuelong Li等;《 IEEE Transactions on Industrial Electronics》;20200731;第67卷(第7期);第5778-5786页 *
Optimization AS A Model For Few-Short Learning;Sachin Ravi等;《Published as a Coference paper at ICLR 2017》;20170301;第1-11页 *
Unsupervised Video Summarization with Adversarial LSTM Networks;Behrooz Mahasseni等;《2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)》;20170726;第2982-2991页 *
Video summarization with long short-term memory;K. Zhang等;《in Proceeding European Conference on Computer Vision》;20160816;第766–782页 *
基于元学习推荐的优化算法自动选择框架与实证分析;崔建双等;《计算机应用》;20170410;第37卷(第04期);第1105-1110页 *

Also Published As

Publication number Publication date
CN109885728A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
CN109885728B (en) Video abstraction method based on meta-learning
Kang et al. Shakeout: A new approach to regularized deep neural network training
CN107239565B (en) Image retrieval method based on saliency region
CN106033426A (en) A latent semantic min-Hash-based image retrieval method
CN109783691B (en) Video retrieval method for deep learning and Hash coding
Tan et al. Selective dependency aggregation for action classification
CN113177141B (en) Multi-label video hash retrieval method and device based on semantic embedded soft similarity
CN110598022B (en) Image retrieval system and method based on robust deep hash network
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
Li et al. DAHP: Deep attention-guided hashing with pairwise labels
Song et al. Deep and fast: Deep learning hashing with semi-supervised graph construction
CN111008224A (en) Time sequence classification and retrieval method based on deep multitask representation learning
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
Kim et al. Temporal attention mechanism with conditional inference for large-scale multi-label video classification
CN114596456B (en) Image set classification method based on aggregated hash learning
CN115731498A (en) Video abstract generation method combining reinforcement learning and contrast learning
CN114299362A (en) Small sample image classification method based on k-means clustering
Lin et al. Scene recognition using multiple representation network
Zeng et al. Pyramid hybrid pooling quantization for efficient fine-grained image retrieval
CN115795065A (en) Multimedia data cross-modal retrieval method and system based on weighted hash code
Xu et al. Idhashgan: deep hashing with generative adversarial nets for incomplete data retrieval
Dai et al. Multi-granularity association learning for on-the-fly fine-grained sketch-based image retrieval
Mithun et al. Generating diverse image datasets with limited labeling
CN114333062A (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN114168773A (en) Semi-supervised sketch image retrieval method based on pseudo label and reordering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant