CN111526434A - Converter-based video abstraction method - Google Patents

Converter-based video abstraction method Download PDF

Info

Publication number
CN111526434A
CN111526434A CN202010329511.3A CN202010329511A CN111526434A CN 111526434 A CN111526434 A CN 111526434A CN 202010329511 A CN202010329511 A CN 202010329511A CN 111526434 A CN111526434 A CN 111526434A
Authority
CN
China
Prior art keywords
video
frame
encoder
attention mechanism
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010329511.3A
Other languages
Chinese (zh)
Other versions
CN111526434B (en
Inventor
梁国强
张艳宁
吕艳兵
李书成
吉时雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010329511.3A priority Critical patent/CN111526434B/en
Publication of CN111526434A publication Critical patent/CN111526434A/en
Application granted granted Critical
Publication of CN111526434B publication Critical patent/CN111526434B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a converter-based video abstract extraction method. Firstly, processing a selected data set to obtain a training data set of a model; then, constructing a video abstract converter neural network model comprising a self-attention mechanism, calculating the similarity between video frames by using the self-attention mechanism, enhancing the capability of the model for capturing the global dependency relationship of the video frame sequence by adding the importance score of a previous frame, and training the model by using a training data set; and finally, processing the video data to be processed by using the trained model to obtain the importance score of each frame, and selecting to obtain the video abstract according to the score. The invention can capture the time sequence information between the video frame sequences well, and then predict the importance degree of the video frames well in a scoring mode, and the model network of the invention can train the frame sequences in a parallelization mode, and has the advantages of fast training time efficiency and complete and short obtained video abstract.

Description

Converter-based video abstraction method
Technical Field
The invention belongs to the technical field of computer vision and deep learning representation, and particularly relates to a converter-based video summarization method.
Background
With the rapid development of camera and video sharing technologies, the number of videos is increasing explosively. In the face of massive video data, how to efficiently extract useful information from video becomes an important problem. As an important technology for solving the problem, the video summarization technology aims to generate a complete and short summarized video for an original video, and the summarized video can transmit information to be expressed by the original video on the basis of short duration, and has become a hot spot in the fields of multimedia, computer vision and the like. The video abstraction technology comprehensively utilizes various technologies such as machine learning, artificial intelligence and the like, and has important functions in aspects such as video retrieval, storage, recommendation and the like.
Currently, most video summarization methods are divided into two stages, the first stage is to predict the importance scores of all video frames, and the second stage is to select the key shots of the video by using the results of the first stage, so as to obtain the final summarization result. The first stage is a key stage of a video summarization method, most of research of the current methods aims at predicting the importance scores of video frames, and many methods have good performance. For example, in the document "Ke Zhang, Wei-Lun Chao, FeiSha, et al, video summary with Long Short-Term Memory [ C ]// european conference on Computer vision. springer, Cham, 2016", two LSTM networks are used, one from front to back and one from back to front to extract the sequence information of the video frame and predict the video frame importance score, the network structure is simple, the key sequence information can be extracted, but the recurrent neural network is difficult to capture the Long-Term dependency, and the early sequence dependency is easy to lose when processing Long video information; in the document "Ji, Zhong, Xiong, kalin, Pang, Yanwei, et.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a video summarization method based on a converter. And optimizing an information flow route from the features to the decoder by using a converter based on an attention mechanism, weighting the importance scores output by the decoder and the original features, predicting the importance scores of the next frame, enhancing the relation between model input and output, realizing the complete parallelization of training, and better capturing global dependency information.
A video summarization method based on converter includes the following steps:
step 1: downsampling the video in the selected data set, and extracting the feature vector h of each frame of the video by using a pre-trained neural networkf∈RdF is a frame number, F is 1,2,., F is a total length of the downsampled video, and d represents a length of the feature vector; the feature vectors of all frames of a video and the corresponding importance scores form a sample in a training set; said selected data set comprises TvSum and SumMe;
step 2: a position vector for the video frame is generated using:
Figure BDA0002464441890000021
wherein, PEf(i) An ith element value, i ═ 1,2, …, d, representing the position vector of the f-th frame of the video;
then, the position vector of each frame of the video is added with the characteristic vector element by element, and a new vector x added with the position vector is obtained for each framef
And step 3: constructing a video abstract converter neural network model, which comprises an encoder and a decoder, wherein the encoder is formed by sequentially connecting two encoder units with the same structure, each encoder unit sequentially comprises a multi-head self-attention mechanism module, a residual connection and normalization module 1, a two-layer feedforward network and a residual connection and normalization module 2, a video frame sequence added with a position vector is input into a first encoder unit, and a middle variable Y with sequence information and the same dimension as the input is obtained by the output of a second encoder unit;
the decoder is formed by sequentially connecting two decoder units with the same structure, wherein each decoder unit sequentially comprises a multi-head self-attention mechanism module with a mask, a residual connection and normalization module 1, a multi-head self-attention mechanism module, a residual connection and normalization module 2, a two-layer feedforward network and a residual connection and normalization module 3; the decoder has two inputs, when predicting the importance score of the kth frame, the product of the importance scores of the first k-1 predicted video frames and the feature vectors thereof is the input of the multi-head self-attention mechanism module with the mask in the first decoder unit, and the intermediate variable output by the encoder is input into the multi-head self-attention mechanism module of each decoder unit; connecting a linear layer and a sigmoid function behind the second decoder unit, and outputting an importance value prediction result of each frame;
initializing the input of the network model, specifically including: the input of the nth head of the multi-head self-attention mechanism module in the encoder unit is initialized as follows: qn=Q0Wn Q∈RF×d、Kn=K0Wn K∈RF×d、Vn=V0Wn V∈RF×dWhere n is 1,2,3,4, Q in the first encoder unit0=K0=V0X is the position vector added video frame feature obtained in step 2, Q in the second encoder unit0、K0、V0Is the output of the first encoder unit,
Figure BDA0002464441890000031
is a randomly generated matrix with size d × d to be learned in the training process, and input Q of nth head of multi-head self-attention mechanism module with mask in decoder unitn、KnAnd VnInitialization method and encoderThe multiple-head self-attention mechanism module is the same, except in the first decoder unit
Figure BDA0002464441890000032
Wherein h isfFor the feature vector, s, of the f-th frame obtained in step 1fFor the predicted importance score corresponding to the f-th frame, Q in the second decoder unit0、K0、V0Is the output of the first decoder unit; input Q of nth head of multi-head self-attention mechanism module in decoder unitn、KnAnd VnThe initialization method of the encoder is the same as that of a multi-head self-attention mechanism module in the encoder, and the difference is K0=V0=Y,Q0Z, wherein Y is an intermediate variable output by the encoder, and Z is a variable output by the residual error connection and normalization module 1 in the decoder unit;
and 4, step 4: training the neural network model of the video abstract converter constructed in the step 3 by using the training data set obtained in the step 1, and setting the loss function of the network as a mean square loss function
Figure BDA0002464441890000033
Wherein L represents the network loss, sfAnd s'fRespectively predicting the importance scores of the f frame of the video predicted by the model and the importance scores of the artificial labels in the data set;
and 5: preprocessing a video data set to be processed, including segment extraction, down-sampling, feature extraction and position vector addition, to obtain feature representation of each frame; then, extracting and obtaining the importance score of each frame of video by using the network model trained in the step 4; dividing the video into a plurality of scene shots by using a KTS algorithm, selecting an important video shot as a video abstract according to the importance score of the video frame by using a knapsack algorithm, wherein the length of the selected video abstract is not more than 15% of the length of the original video.
The invention has the beneficial effects that: because a cyclic neural network is abandoned, a multi-head self-attention mechanism is used in a coder-decoder, the association between video frames is realized, and in the training process, a mask is added to the input of a multi-head self-attention mechanism module in the decoder, namely the product of the artificial annotation score and the feature vector, so that the complete parallelization of the training of the video frame sequence is realized, and the advantage of quick training timeliness is realized; the designed decoder bottom end input adopts a mode of multiplying the feature vector and the importance score, namely when the importance score of the kth frame is predicted, the product of the importance score of the first k-1 video frames and the feature vector thereof which is predicted is used as the input of the multi-head self-attention mechanism module with the mask in the first decoder unit, thereby realizing the association of different time sequence outputs in the decoding process, improving the output result of the next moment through the output of the previous moment, further leading the sequence information to be more complete and obtaining better importance score prediction performance; in the overall view, the whole model constructed by the method is completely based on a self-attention mechanism, a cycle structure and excessive convolution operation are avoided, and the model is simple and easy to realize; and the use of the self-attention mechanism can enable the model to better pay attention to the detail information between the sequences, and the structure of the codec enables the global information of the sequences to be more complete.
Drawings
FIG. 1 is a flow chart of a converter-based video summarization method of the present invention.
Detailed Description
The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.
As shown in fig. 1, the present invention provides a converter-based video summarization method, which is implemented as follows:
1. data processing
Downsampling the video in the selected data set, and extracting the feature vector h of each frame of the video by using a pre-trained neural networkf∈RdF is a frame number, F is 1,2,., F is a total length of the downsampled video, and d represents a length of the feature vector; the feature vectors of all frames of a video and the corresponding importance scores form a sample in a training set; the selected data set includes TvSum and SumMe, which compriseImportance scores s 'of several videos and manually labeled to each frame thereof'f
2. Adding position vectors
In order to represent the position information of each frame, a position representation vector needs to be added. A position vector for the video frame is generated using:
Figure BDA0002464441890000041
wherein, PEf(i) The ith element value, i ═ 1,2, …, d, of the position vector representing the f-th frame of the video.
Then, the position vector of each frame of the video is added to the feature vector element by element, and each frame obtains a new vector x after the position vector is addedf=hf+PEfAnd taking the obtained vector as the input of a subsequent network model encoder.
3. Constructing a video abstract converter neural network model
The present invention designs a converter model for video summarization, including an encoder and a decoder, using which importance scores of video frames are obtained.
The encoder is formed by sequentially connecting two encoder units with the same structure, each unit consists of a multi-head self-attention mechanism module, a residual connection and normalization module 1, a two-layer feedforward network and a residual connection and normalization module 2 in sequence, and the video frame characteristic sequence added with the position expression vector obtained in the step 2 is subjected to feature sequence analysis
Figure BDA0002464441890000042
And inputting the input data into a first encoder unit, and finally outputting an intermediate variable Y with sequence information and the same dimension as the input X by a second encoder unit. Among them, the multi-head attention mechanism is described in the document "Ashish Vaswani, Noam Shazeer, NikiParmar, et al]2017.
The decoder is formed by sequentially connecting two decoder units with the same structure, each unit sequentially consists of a multi-head self-attention machine system module with a mask, a residual error connection and normalization module 1, a multi-head self-attention machine system module, a residual error connection and normalization module 2, a two-layer feedforward network and a further residual error connection and normalization module 3, the decoder is provided with two inputs, when the importance score of the kth (k is 1,2,.., F) frame is predicted, the product of the importance score which is predicted by the first k-1 video frames and the characteristic vector of the frame is used as the input of the multi-head self-attention machine system module with the mask in the first decoder unit, and the intermediate variable output by the encoder is input into the multi-head self-attention machine system module of each decoder unit; and adding a linear layer and a sigmoid function behind the last decoder unit, and outputting an importance score prediction result of each frame.
The processing procedure of the encoder is as follows: first, the encoder receives X, initializes the input of the nth (n is 1,2,3,4) th head of the multi-head attention mechanism module:
Figure BDA0002464441890000051
Figure BDA0002464441890000052
Figure BDA0002464441890000053
wherein Q is0=K0=V0=X,
Figure BDA0002464441890000054
Is a randomly generated matrix of size d × d to be learned during training, and then pair Q according to a multi-headed self-attention mechanismn,Kn,VnAnd (3) processing:
Figure BDA0002464441890000055
M(Q0,K0,V0)=Concat(H1,...,H4)WO(7)
where Concat is the splicing function, WOIs a matrix of size 4d × d, M (Q), randomly generated and to be learned during training0,K0,V0) The final output of the multi-head self-attention mechanism module; then, carrying out residual error connection and normalization operation; finally, a two-layer feedforward network and a residual connection and normalization module are used for further mapping the characteristics, the obtained variables are continuously input into a second encoder unit, and finally an intermediate variable Y with sequence information and the same dimensionality as the input variable is obtained through output;
the decoder processes as follows: first, when predicting the importance score of the kth frame, the product of the importance scores of the first k-1 predicted video frames and their feature vectors is used as the input of the first decoder unit, i.e.:
Figure BDA0002464441890000056
it should be noted that the first decoder unit uses the product of all the artificial labeling scores and the feature vectors as input in the training process to implement parallelization of training, so a mask needs to be added in the self-attention mechanism module to ensure that the prediction of the importance score of the current frame only depends on the output before the frame, and the processing process of the self-attention mechanism module is the same as that in the above-mentioned encoder;
then, residual error connection and normalization operation are carried out on the output of the self-attention mechanism module with the mask to obtain Z, and the Z and the intermediate variable Y obtained by the encoder are input into the self-attention mechanism module in the decoder unit together:
K0=V0=Y,Q0=Z (9)
then, adding and normalizing the features output by the self-attention mechanism module in the previous step and the original features, inputting the features into a two-layer feedforward network, finally performing residual connection and normalization operation again, and inputting the obtained variables into a second decoder unit;
finally, the output of the second decoder unit is processed by a linear layer and sigmoid function to obtain the prediction result of the importance value of the frame.
4. Training network model
Training the neural network model of the video abstract converter introduced in the step 3 by using the training data set obtained in the step 1, and setting the loss function of the network as a mean square loss function
Figure BDA0002464441890000061
Wherein L represents the network loss, sfAnd s'fRespectively carrying out iterative training for the importance scores of the f-th frame of the video predicted by the model and the importance scores of the manual labels in the data set for multiple times to obtain a trained model;
5. obtaining video summary using model
Preprocessing a video data set to be processed, including segment extraction, down-sampling, feature extraction and position vector addition, to obtain feature representation of each frame; and then, extracting and obtaining the importance score of each frame of video by using the network model trained in the step 4. And finally, dividing the video into a plurality of scene shots by utilizing a KTS algorithm, and selecting an important video shot, namely a video abstract according to the importance score of the video frame by utilizing a knapsack algorithm. The length of the selected video summary can not exceed 15% of the original video length.

Claims (1)

1. A video summarization method based on converter includes the following steps:
step 1: downsampling the video in the selected data set, and extracting the feature vector h of each frame of the video by using a pre-trained neural networkf∈RdF is a frame number, F is 1,2,., F is a total length of the downsampled video, and d represents a length of the feature vector; the feature vectors of all frames of a video and the corresponding importance scores form a sample in a training set; said selected data set comprises TvSum and SumMe;
step 2: a position vector for the video frame is generated using:
Figure FDA0002464441880000011
wherein, PEf(i) An ith element value, i ═ 1,2, …, d, representing the position vector of the f-th frame of the video;
then, the position vector of each frame of the video is added with the characteristic vector element by element, and a new vector x added with the position vector is obtained for each framef
And step 3: constructing a video abstract converter neural network model, which comprises an encoder and a decoder, wherein the encoder is formed by sequentially connecting two encoder units with the same structure, each encoder unit sequentially comprises a multi-head self-attention mechanism module, a residual connection and normalization module 1, a two-layer feedforward network and a residual connection and normalization module 2, a video frame sequence added with a position vector is input into a first encoder unit, and a middle variable Y with sequence information and the same dimension as the input is obtained by the output of a second encoder unit;
the decoder is formed by sequentially connecting two decoder units with the same structure, wherein each decoder unit sequentially comprises a multi-head self-attention mechanism module with a mask, a residual connection and normalization module 1, a multi-head self-attention mechanism module, a residual connection and normalization module 2, a two-layer feedforward network and a residual connection and normalization module 3; the decoder has two inputs, when predicting the importance score of the kth frame, the product of the importance scores of the first k-1 predicted video frames and the feature vectors thereof is the input of the multi-head self-attention mechanism module with the mask in the first decoder unit, and the intermediate variable output by the encoder is input into the multi-head self-attention mechanism module of each decoder unit; connecting a linear layer and a sigmoid function behind the second decoder unit, and outputting an importance value prediction result of each frame;
initializing the input of the network model, specifically including: the input of the nth head of the multi-head self-attention mechanism module in the encoder unit is initialized as follows:
Figure FDA0002464441880000013
where n is 1,2,3,4, Q in the first encoder unit0=K0=V0X is the position vector added video frame feature obtained in step 2, Q in the second encoder unit0、K0、V0Is the output of the first encoder unit,
Figure FDA0002464441880000012
is a randomly generated matrix with size d × d to be learned in the training process, and input Q of nth head of multi-head self-attention mechanism module with mask in decoder unitn、KnAnd VnThe initialization method of (2) is the same as that of the multi-headed self-attention mechanism module in the encoder, except that in the first decoder unit
Figure FDA0002464441880000021
Wherein h isfFor the feature vector, s, of the f-th frame obtained in step 1fFor the predicted importance score corresponding to the f-th frame, Q in the second decoder unit0、K0、V0Is the output of the first decoder unit; input Q of nth head of multi-head self-attention mechanism module in decoder unitn、KnAnd VnThe initialization method of the encoder is the same as that of a multi-head self-attention mechanism module in the encoder, and the difference is K0=V0=Y,Q0Z, wherein Y is an intermediate variable output by the encoder, and Z is a variable output by the residual error connection and normalization module 1 in the decoder unit;
and 4, step 4: training the neural network model of the video abstract converter constructed in the step 3 by using the training data set obtained in the step 1, and setting the loss function of the network as a mean square loss function
Figure FDA0002464441880000022
Wherein L represents the network loss, sfAnd s'fViews predicted for the model respectivelyThe importance scores of the f-th frame and the importance scores of the artificial labels in the data set are obtained;
and 5: preprocessing a video data set to be processed, including segment extraction, down-sampling, feature extraction and position vector addition, to obtain feature representation of each frame; then, extracting and obtaining the importance score of each frame of video by using the network model trained in the step 4; dividing the video into a plurality of scene shots by using a KTS algorithm, selecting an important video shot as a video abstract according to the importance score of the video frame by using a knapsack algorithm, wherein the length of the selected video abstract is not more than 15% of the length of the original video.
CN202010329511.3A 2020-04-24 2020-04-24 Converter-based video abstraction method Expired - Fee Related CN111526434B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010329511.3A CN111526434B (en) 2020-04-24 2020-04-24 Converter-based video abstraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010329511.3A CN111526434B (en) 2020-04-24 2020-04-24 Converter-based video abstraction method

Publications (2)

Publication Number Publication Date
CN111526434A true CN111526434A (en) 2020-08-11
CN111526434B CN111526434B (en) 2021-05-18

Family

ID=71903775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010329511.3A Expired - Fee Related CN111526434B (en) 2020-04-24 2020-04-24 Converter-based video abstraction method

Country Status (1)

Country Link
CN (1) CN111526434B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986181A (en) * 2020-08-24 2020-11-24 中国科学院自动化研究所 Intravascular stent image segmentation method and system based on double-attention machine system
CN112231516A (en) * 2020-09-29 2021-01-15 北京三快在线科技有限公司 Training method of video abstract generation model, video abstract generation method and device
CN112257572A (en) * 2020-10-20 2021-01-22 神思电子技术股份有限公司 Behavior identification method based on self-attention mechanism
CN112380949A (en) * 2020-11-10 2021-02-19 大连理工大学 Microseismic wave arrival time point detection method and system
CN113438509A (en) * 2021-06-23 2021-09-24 腾讯音乐娱乐科技(深圳)有限公司 Video abstract generation method, device and storage medium
CN113657257A (en) * 2021-08-16 2021-11-16 浙江大学 End-to-end sign language translation method and system
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105357594A (en) * 2015-11-19 2016-02-24 南京云创大数据科技股份有限公司 Massive video abstraction generation method based on cluster and H264 video concentration algorithm
CN105530554A (en) * 2014-10-23 2016-04-27 中兴通讯股份有限公司 Video abstraction generation method and device
CN107484017A (en) * 2017-07-25 2017-12-15 天津大学 Supervision video abstraction generating method is had based on attention model
CN108427713A (en) * 2018-02-01 2018-08-21 宁波诺丁汉大学 A kind of video summarization method and system for homemade video
US10289912B1 (en) * 2015-04-29 2019-05-14 Google Llc Classifying videos using neural networks
CN109889923A (en) * 2019-02-28 2019-06-14 杭州一知智能科技有限公司 Utilize the method for combining the layering of video presentation to summarize video from attention network
CN109885728A (en) * 2019-01-16 2019-06-14 西北工业大学 Video summarization method based on meta learning
CN110110140A (en) * 2019-04-19 2019-08-09 天津大学 Video summarization method based on attention expansion coding and decoding network
CN110287374A (en) * 2019-06-14 2019-09-27 天津大学 It is a kind of based on distribution consistency from attention video summarization method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105530554A (en) * 2014-10-23 2016-04-27 中兴通讯股份有限公司 Video abstraction generation method and device
US10289912B1 (en) * 2015-04-29 2019-05-14 Google Llc Classifying videos using neural networks
CN105357594A (en) * 2015-11-19 2016-02-24 南京云创大数据科技股份有限公司 Massive video abstraction generation method based on cluster and H264 video concentration algorithm
CN107484017A (en) * 2017-07-25 2017-12-15 天津大学 Supervision video abstraction generating method is had based on attention model
CN108427713A (en) * 2018-02-01 2018-08-21 宁波诺丁汉大学 A kind of video summarization method and system for homemade video
CN109885728A (en) * 2019-01-16 2019-06-14 西北工业大学 Video summarization method based on meta learning
CN109889923A (en) * 2019-02-28 2019-06-14 杭州一知智能科技有限公司 Utilize the method for combining the layering of video presentation to summarize video from attention network
CN110110140A (en) * 2019-04-19 2019-08-09 天津大学 Video summarization method based on attention expansion coding and decoding network
CN110287374A (en) * 2019-06-14 2019-09-27 天津大学 It is a kind of based on distribution consistency from attention video summarization method

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986181A (en) * 2020-08-24 2020-11-24 中国科学院自动化研究所 Intravascular stent image segmentation method and system based on double-attention machine system
CN112231516A (en) * 2020-09-29 2021-01-15 北京三快在线科技有限公司 Training method of video abstract generation model, video abstract generation method and device
CN112231516B (en) * 2020-09-29 2024-02-27 北京三快在线科技有限公司 Training method of video abstract generation model, video abstract generation method and device
CN112257572B (en) * 2020-10-20 2022-02-01 神思电子技术股份有限公司 Behavior identification method based on self-attention mechanism
CN112257572A (en) * 2020-10-20 2021-01-22 神思电子技术股份有限公司 Behavior identification method based on self-attention mechanism
WO2022083335A1 (en) * 2020-10-20 2022-04-28 神思电子技术股份有限公司 Self-attention mechanism-based behavior recognition method
CN112380949A (en) * 2020-11-10 2021-02-19 大连理工大学 Microseismic wave arrival time point detection method and system
CN112380949B (en) * 2020-11-10 2024-03-26 大连理工大学 Method and system for detecting arrival time of micro-seismic waves
CN113438509A (en) * 2021-06-23 2021-09-24 腾讯音乐娱乐科技(深圳)有限公司 Video abstract generation method, device and storage medium
CN113657257A (en) * 2021-08-16 2021-11-16 浙江大学 End-to-end sign language translation method and system
CN113657257B (en) * 2021-08-16 2023-12-19 浙江大学 End-to-end sign language translation method and system
CN115002559A (en) * 2022-05-10 2022-09-02 上海大学 Video abstraction algorithm and system based on gated multi-head position attention mechanism
CN115002559B (en) * 2022-05-10 2024-01-05 上海大学 Video abstraction algorithm and system based on gating multi-head position attention mechanism

Also Published As

Publication number Publication date
CN111526434B (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN111526434B (en) Converter-based video abstraction method
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
Wu et al. 3-D PersonVLAD: Learning deep global representations for video-based person reidentification
Wu et al. A compact dnn: approaching ***net-level accuracy of classification and domain adaptation
Lorre et al. Temporal contrastive pretraining for video action recognition
CN111626245B (en) Human behavior identification method based on video key frame
CN111144448A (en) Video barrage emotion analysis method based on multi-scale attention convolutional coding network
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN111104555A (en) Video hash retrieval method based on attention mechanism
Gao et al. Co-saliency detection with co-attention fully convolutional network
CN111242033A (en) Video feature learning method based on discriminant analysis of video and character pairs
Zhang et al. Learning implicit class knowledge for RGB-D co-salient object detection with transformers
CN110852295B (en) Video behavior recognition method based on multitasking supervised learning
CN114493014A (en) Multivariate time series prediction method, multivariate time series prediction system, computer product and storage medium
CN112115796A (en) Attention mechanism-based three-dimensional convolution micro-expression recognition algorithm
CN112200096B (en) Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video
CN114037930A (en) Video action recognition method based on space-time enhanced network
CN115022711B (en) System and method for ordering shot videos in movie scene
CN112766378A (en) Cross-domain small sample image classification model method focusing on fine-grained identification
Xu et al. Graphical modeling for multi-source domain adaptation
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
Savadi Hosseini et al. A hybrid deep learning architecture using 3d cnns and grus for human action recognition
CN110942463B (en) Video target segmentation method based on generation countermeasure network
Lee et al. Capturing long-range dependencies in video captioning
CN115797827A (en) ViT human body behavior identification method based on double-current network architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210518

CF01 Termination of patent right due to non-payment of annual fee