CN110929092A - Multi-event video description method based on dynamic attention mechanism - Google Patents

Multi-event video description method based on dynamic attention mechanism Download PDF

Info

Publication number
CN110929092A
CN110929092A CN201911136308.8A CN201911136308A CN110929092A CN 110929092 A CN110929092 A CN 110929092A CN 201911136308 A CN201911136308 A CN 201911136308A CN 110929092 A CN110929092 A CN 110929092A
Authority
CN
China
Prior art keywords
event
layer
video
prediction
description
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911136308.8A
Other languages
Chinese (zh)
Other versions
CN110929092B (en
Inventor
谢洪平
刘迪
诸雅琴
黄涛
陈勇
杜长青
吴威
王昊
林东阳
陈喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinmao New Energy Group Co Ltd
Jiangsu Electric Power Engineering Consulting Co Ltd
Southeast University
State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Jinmao New Energy Group Co Ltd
Jiangsu Electric Power Engineering Consulting Co Ltd
Southeast University
State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinmao New Energy Group Co Ltd, Jiangsu Electric Power Engineering Consulting Co Ltd, Southeast University, State Grid Jiangsu Electric Power Co Ltd filed Critical Jinmao New Energy Group Co Ltd
Priority to CN201911136308.8A priority Critical patent/CN110929092B/en
Publication of CN110929092A publication Critical patent/CN110929092A/en
Application granted granted Critical
Publication of CN110929092B publication Critical patent/CN110929092B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a multi-event video description method based on a dynamic attention mechanism, which comprises the following steps: inputting the video sequence into a three-dimensional convolution neural network, and extracting visual features of the video; coding the visual characteristics by adopting a video coding layer based on an attention mechanism, and inputting the characteristic codes into an event prediction layer; the event prediction layer predicts each event according to the video coding information; and the event description layer acquires visual characteristics of each event according to the event prediction result and dynamically combines the context information of the event description layer to generate the text description of each event. The method overcomes the defects of poor parallelism and low efficiency of the conventional multi-event video description method, ensures the accuracy of video description generation, and can train the model in an end-to-end mode.

Description

Multi-event video description method based on dynamic attention mechanism
Technical Field
The invention relates to a multi-event video description method based on a dynamic attention mechanism, and belongs to the field of video description in computer vision.
Background
Video tagging is a technology for analyzing video content and forming a classification tag, and the video tag can effectively extract key information of a video and is widely applied to the field of video storage and retrieval. But the video tag cannot represent more detailed information of the video. Video description (videotaping) is a process of automatically generating natural language description of a video through a computer, and not only can key elements in the video be extracted through the video description, but also the association among the elements can be embodied through sentence description, so that the video description has important application value and development prospect in the fields of video storage and retrieval, human-computer interaction, knowledge extraction and the like.
Unlike image description (imagecapturing), video contains a great deal of space-time information which changes constantly, and how to efficiently acquire useful information to accurately describe video is a great challenge in the field of computer vision. The S2VT (Sequence to Sequence-Video to Text) algorithm proposed by Venugopalan et al is the first successful application of a deep learning method in the field of Video description. The method extracts 2D convolution characteristics and optical flow characteristics of the video, and inputs a two-layer stacked LSTM network to generate description of the video, thereby laying the foundation for adopting an Encoder-Decoder (Encoder-Decoder) architecture to carry out a video description algorithm. Currently, the video description field has many research results, but most of the results are improvements based on the S2VT algorithm, such as extracting video features by using 3DCNN, using features of multi-modal fusion, decoding by using an improved GRU network, and the like.
The event contained in a long video may be multiple, and the traditional video description method generates a sentence to describe the video too coarse to describe a part of information, and in order to solve the problem, a dense video description (densevideoCaption) should be generated. The dense Video description is proposed by z.shen et al in the article of weak superviseddepth Video capturing, where for a piece of Video, different region sequences are extracted, and then a sentence description is generated for each region sequence, which is a prototype of event prediction (EventProposal) -description generation (CaptionGeneration) architecture commonly adopted by dense Video description now. Compared with the traditional video description algorithm, the description of the region sequence provided by the algorithm is more refined and richer in information content, and a brand new research direction is developed.
Recent research on dense video descriptions has been mainly to efficiently extract and represent information in video and to improve the accuracy of event prediction. In view of the first problem, an attention mechanism (such as a description video by explicit temporal Structure) replaces the original average pooling method to generate the video information representation, and the problem that the video timing information is lost in the encoding process is solved well. Wang et al (bidirectional adaptive Fusion with Context filtering for depth Video capturing) indicates that most methods only extract Context information in the backward direction of a Video sequence during Video encoding, and ignore Context information in the forward direction, thereby causing the event prediction method to be unable to distinguish highly overlapped events. Therefore, they have proposed a two-way video coding method, which uses two layers of LSTM networks to code the forward and backward context information of the video, and performs event prediction according to the fused context information, thereby improving the accuracy of event prediction.
However, the existing intensive video description generation methods still have problems, and most methods simply connect the context features and the visual features to obtain the input of a decoder when performing video decoding, so that the generated description is not accurate. Meanwhile, the widely adopted LSTM video encoder has the problem of poor parallelism. Therefore, an efficient dense video description generation method is needed, which can quickly and accurately locate and describe events in a video.
Disclosure of Invention
The invention provides a multi-event video description method based on a dynamic attention mechanism, aiming at solving the problems of poor parallelism and low accuracy in the existing dense video description generation algorithm, and realizing accurate positioning and description of events in a video. In order to achieve the purpose, the technical scheme provided by the invention is as follows: a multi-event video description method based on a dynamic attention mechanism is characterized by comprising the following steps:
step one, extracting visual characteristics V of a target video sequence X by adopting a convolutional neural network;
step two, inputting the visual characteristics V of the video into an L-layer self-attention mechanism video coding layer to obtain the coding F of the videoi
Step three, utilizing an event prediction layer to code F according to the videoiGenerating a pairPrediction of events phiiAnd selecting the layer prediction with the highest prediction confidence coefficient as the final prediction result phik
Step four, predicting the result based on the event prediction layer
Figure BDA0002279689890000021
Generating a mask for event j
Figure BDA0002279689890000022
Intercepting visual features of event j using a mask
Figure BDA0002279689890000023
The sequence is as follows:
Figure BDA0002279689890000024
wherein ⊙ denotes the matrix elements multiplied one by one;
obtaining visual feature vector C of event j by average poolingj
Figure BDA0002279689890000025
Wherein
Figure BDA0002279689890000026
n is the length of the characteristic sequence;
fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector
Figure BDA0002279689890000027
Suppose a description S of an event jjFrom TsA word is formed, i.e.
Figure BDA0002279689890000028
The encoder generates a word w as a time period, SjGeneration of (2) requires TsA period of time, then
Figure BDA0002279689890000029
Combining visual and contextual characteristics h of an eventt-1Mapping to the same feature space:
Figure BDA00022796898900000210
h′t-1=tanh(Wcht-1),Wvand WcIs a mapping matrix of visual features and contextual features,
Figure BDA00022796898900000211
context feature ht-1Is the hidden state of the LSTM unit at the last instant. h istBy the feature vector E of the currently input wordtInputting visual feature vectors
Figure BDA00022796898900000212
Hidden state h of previous momentt-1Jointly determining:
Figure BDA0002279689890000031
wherein Et=E[wt-1]The amount of the solvent to be used is, in particular,
Figure BDA0002279689890000032
E0=E[<BOS>];
threshold value for computing context feature
Figure BDA0002279689890000033
EtFor the input word w of the decoder at time tt-1The embedded vector of (2);
fusing visual features and contextual features by adopting a threshold mechanism:
Figure BDA0002279689890000034
final feature representation of event j
Figure BDA0002279689890000035
Will be of event jFinal feature representation
Figure BDA0002279689890000036
Input LSTM decoder decodes to obtain description S of event jj
The encoding step of the video in the second step comprises the following steps:
the visual feature V is taken as input to the first encoder layer, the output of which is F1E (v), the encoder of the other layer takes the output of the previous layer as input, and the encoded output is Fl+1=E(Fl)。
Each encoder layer comprises a multi-head attention layer and a point type feedforward layer;
the calculation formula of the multi-head attention layer is as follows:
Figure BDA0002279689890000037
the point type feedforward layer calculation formula is as follows:
E(Fl)=LN(FF(Ω(Fl)),Ω(Fl))
Figure BDA0002279689890000038
where LN (p, q) ═ LayerNorm (p + q), indicating the normalization operation on the residual output, FF (-) indicates the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,
Figure BDA0002279689890000039
is a weight matrix of the network and,
Figure BDA00022796898900000310
for the bias factor, the definition of Ω (-) uses the self-attention mechanism, during the coding process of step t, ft lThe resulting output is f as a query to the attention layeri l(i ═ 1,2, …, T) by weighted summation.
The event prediction layer in step three is according to video coding FiThe specific method of generating a prediction of an event is as follows:
step 3.1, first encode the video FiInputting to a base layer of an event prediction layer;
step 3.2, inputting the output characteristics of the basic layer into an anchor layer of the event prediction layer, and gradually reducing the time dimension of the characteristics;
and 3.3, inputting the output of each anchor layer into a prediction layer, and generating a set of fixed event predictions at one time.
Prediction of jth event at ith layer in the third step
Figure BDA00022796898900000311
The calculation method comprises the following steps:
calculating the boundary of an event by
Figure BDA00022796898900000312
Figure BDA00022796898900000313
Figure BDA0002279689890000041
Figure BDA0002279689890000042
Figure BDA0002279689890000043
Figure BDA0002279689890000044
And
Figure BDA0002279689890000045
respectively representing the central position and width of the anchor before optimization,
Figure BDA0002279689890000046
to optimize the timing offset of the center position of the previous anchor,
Figure BDA0002279689890000047
to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,
Figure BDA0002279689890000048
and
Figure BDA0002279689890000049
respectively representing the central position and width of the anchor after optimization;
calculating a confidence in the prediction of the event s byj
Figure BDA00022796898900000410
Figure BDA00022796898900000411
And
Figure BDA00022796898900000412
respectively representing the classification confidence level and the language description confidence level of the event, and lambda is a hyper-parameter.
Final predicted result phi in the third stepkThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:
Figure BDA00022796898900000413
the invention has the beneficial effects that:
the invention provides a multi-event video description method based on a dynamic attention mechanism, which adopts a self-attention mechanism to encode visual characteristics and inputs the characteristic codes into an event prediction layer; the event prediction layer predicts each event in the video according to the video coding information. And the event description layer generates the text description of each event according to the result of the event prediction and the video characteristics and dynamically fusing the context information of the decoder. The method combines a self-attention mechanism and a feedforward neural network to replace an LSTM network-based video encoder, overcomes the defects of poor encoding parallelism and low efficiency of the LSTM network, and ensures the accuracy of video description generation.
The invention intercepts the visual characteristics corresponding to the event according to the mask matrix generated by the event prediction result when the visual characteristics of the event are acquired, thereby leading the decoder to acquire effective information, eliminating the interference of other information and having high robustness and stability.
The invention dynamically fuses the visual characteristics of the event and the context information of the decoder during video decoding, and adopts the threshold to dynamically adjust the proportion of the visual characteristics and the context characteristics in the input of the decoder, thereby generating more accurate and coherent event description.
The method provided by the invention performs end-to-end model training by minimizing total loss (including event prediction loss and sentence generation loss), has high training efficiency and stability, and reduces training cost.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a block diagram of a method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a dynamic fusion mechanism of visual features and context information in an embodiment of the present invention;
FIG. 5 shows the result of the method of the present invention operating on ActiviTyNet Captions data sets.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Examples
As shown in fig. 1, fig. 2 and fig. 3, the present invention designs a multiple-event video description method based on a dynamic attention mechanism, which specifically includes the following steps:
step 1: using convolutional neural networks (ben Shi)In the example using 3D-CNN) to extract a video sequence X ═ X1,x2,…xLV ═ V1,v2,…vT}。
Video sequence X for one L frame { X ═ X1,x2,…xLAnd performing feature extraction on the video frame of the Sports video by adopting the 3DCNN pre-trained on the Sports-1M video data set. The temporal resolution of the extracted C3D features is δ ═ 16 frames, so the input video stream can be discretized into T ═ L/δ steps, so the final generated feature sequence is V ═ V/δ steps1,v2,…vT}。
Step 2: inputting visual characteristics V of the video into an L-layer self-attention video coding layer to obtain a coded representation { F } of the video1,F2,…Fl}。
And (3) taking the characteristic sequence obtained in the step (1) as input of L encoder layers, wherein each encoder layer consists of a multi-head attention layer and a dot type feedforward layer. Each layer of coder takes the output of the previous layer as input and obtains the output F of the current layer by codingl+1=E(Fl) In particular, F1E (v). The specific method comprises the following steps:
multi-head attention layer:
Figure BDA0002279689890000051
point type feed-forward layer:
E(Fl)=LN(FF(Ω(Fl)),Ω(Fl))
Figure BDA0002279689890000052
wherein LN (p, q) ═ LayerNorm (p + q), meaning that normalization (LayerNormalization) operations are performed on the residual output, FF (-) denotes the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,
Figure BDA0002279689890000053
is a weight matrix of the network and,
Figure BDA0002279689890000054
is a paradox factor. The definition of omega (·) uses a self-attention mechanism, and in the coding process of the t step, ft lThe resulting output is f as a query to the attention layeri lAnd (i is the weighted sum of 1,2, … and T), so that the T-th step not only encodes the information of the current step but also encodes the information of other steps, and thus each step of encoding through the self-attention mechanism contains all the context information.
The multi-head attention mechanism consists of N scaling dot product attention layers:
Figure BDA0002279689890000061
each layer is a "head", which is defined as:
headi=Attention(Wi QQ,Wi KK,Wi VV)
wherein Wi Q,Wi K,Wi VIs a mapping matrix. The definition of scaled dot product attention is:
Figure BDA0002279689890000062
wherein
Figure BDA0002279689890000063
Respectively, a query matrix, a key matrix, and a value matrix. q. q.si,ki,viAll have the dimension d. It queries q and k by computingt(T is 1, …, T) to obtain a value vtThen to vtAnd weighting and summing to obtain an output.
And step 3: video coding F of event prediction layer according to layersiPredicting events in the video to obtain the prediction phi of each coding layer to the eventsiAnd selecting k with the highest prediction confidence coefficient as argMax (phi)i) Layer prediction asAnd finally predicting the result. The concrete implementation is as follows:
(1) video coding FiFirst input to the base layer (conv) of the event prediction layer1And conv2) The time dimension of video coding is reduced, the time receiving domain is increased, and the final output characteristic dimension is T/2 multiplied by 1024.
(2) The output characteristics of the base layer are then input to nine anchor layers (conv)3To conv11Kernel size of 3 for each anchor layer, step size of 2, filter number of 512), the time dimension of the feature is reduced step by step to facilitate event prediction over multiple time scales.
(3) The output of each anchor layer is received by the prediction layer and a fixed set of event predictions is generated at once. In particular, for a size of Tj×DjInput characteristic fjIts event prediction result
Figure BDA0002279689890000064
By a 1 XD through the full link layerjIs generated by the feature unit of (1), wherein
Figure BDA0002279689890000065
Is the classification score when event/background classification is performed,
Figure BDA0002279689890000066
is a descriptive score that represents the confidence with which the predicted event can be well described.
Figure BDA0002279689890000067
Respectively, the central position of event j relative to the anchor associated therewith
Figure BDA0002279689890000068
And width
Figure BDA0002279689890000069
The amount of offset of (c).
(4) Wherein the predicted representation of the jth event at the ith layer
Figure BDA00022796898900000610
The calculation method comprises the following steps:
a. boundary of event
Figure BDA00022796898900000611
Calculated by the following formula:
Figure BDA00022796898900000612
Figure BDA00022796898900000613
Figure BDA0002279689890000071
Figure BDA0002279689890000072
Figure BDA0002279689890000073
and
Figure BDA0002279689890000074
respectively representing the central position and width of the anchor before optimization,
Figure BDA0002279689890000075
to optimize the timing offset of the center position of the previous anchor,
Figure BDA0002279689890000076
to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,
Figure BDA0002279689890000077
and
Figure BDA0002279689890000078
representing the central position and width of the anchor after optimization, respectively.
b. Confidence in prediction of event
Figure BDA0002279689890000079
sjIn combination with the confidence of classification of the event in vision
Figure BDA00022796898900000724
And confidence of description in language
Figure BDA00022796898900000710
λ is a hyperparameter.
(5) Final prediction result phikThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:
Figure BDA00022796898900000711
and 4, step 4: the event description layer predicts the prediction result of the layer according to the event
Figure BDA00022796898900000712
Generating a mask for event j
Figure BDA00022796898900000713
Intercepting the visual feature sequence V by using a mask to obtain the visual feature corresponding to the event j
Figure BDA00022796898900000714
Then to
Figure BDA00022796898900000715
Applying average pooling to obtain visual feature vector C of videojAnd fusing the visual feature vector and the context vector H to obtain a final feature vector
Figure BDA00022796898900000716
Finally, the event j is input into an LSTM decoder for decoding to obtain the description S of the event jj. The specific method comprises the following steps:
(1) event based predictionPrediction result of layer
Figure BDA00022796898900000717
Generating a mask for event j
Figure BDA00022796898900000718
The coverage of this mask is determined by the boundaries of event j:
Figure BDA00022796898900000719
(2) intercepting visual features of event j using a mask
Figure BDA00022796898900000720
The sequence is as follows:
Figure BDA00022796898900000721
where ⊙ denotes matrix element multiplication.
(3) Obtaining visual feature vector C of event j by average poolingj
Figure BDA00022796898900000722
Wherein
Figure BDA00022796898900000723
n is the length of the signature sequence.
(4) Fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector
Figure BDA0002279689890000081
Suppose a description S of an event jjFrom TsA word is formed, i.e.
Figure BDA0002279689890000082
Considering the encoder generating a word w as a time period, SjGeneration of (2)To TsA period of time, then
Figure BDA0002279689890000083
the output of the video decoding at time t is related not only to the visual characteristics of the event but also to the context information of the decoder. Visual feature vector C of eventjAnd the context characteristic h of the decodert-1And dynamic fusion, namely adjusting the proportion of the visual characteristics and the context characteristics in the input of the decoder, thereby generating more accurate event description. The specific method is as shown in FIG. 4:
a. mapping visual and contextual features of an event to the same feature space:
Figure BDA0002279689890000084
h′t-1=tanh(Wcht-1),Wvand WcIs a mapping matrix of visual features and contextual features,
Figure BDA0002279689890000085
b. threshold value for computing context feature
Figure BDA0002279689890000086
EtFor input word w at time t of LSTM decodert-1The embedded vector of (2).
c. Fusing visual features and contextual features by adopting a threshold mechanism:
Figure BDA0002279689890000087
final feature representation of event j
Figure BDA0002279689890000088
(5) The final characteristics of event j
Figure BDA0002279689890000089
Representing input LSTM decoder decoding to get description S of event jj
a.LSTMIs a basic building block of a video decoder, whose hidden state htBy the feature vector E of the currently input wordtInputting visual feature vectors
Figure BDA00022796898900000810
Hidden state h of previous momentt-1Jointly determining:
Figure BDA00022796898900000811
wherein Et=E[wt-1]The amount of the solvent to be used is, in particular,
Figure BDA00022796898900000812
E0=E[<BOS>]。
in particular, the LSTM cell is formed by the input gate GitForgetting gate GftAnd an output gate GotAnd an input unit gtThe specific calculation mode is as follows:
Figure BDA00022796898900000813
wherein σ represents a sigmoid activation function, tanh represents a hyperbolic tangent function, and W is a transformation matrix and needs to be determined through model training. Storage unit c of LSTMtAnd hidden state htThe update is performed by:
ct=Gft⊙ct-1+it⊙gt
ht=Got⊙tanh(ct)
where ⊙ denotes a matrix element multiplication operation.
b. Hidden state h according to LSTM at time ttCalculating a probability distribution P of a set of possible wordst=softmax(Upψ(Wpht+bp)+d),Up,Wp,bpAnd the d parameter needs to be determined by training of the model.
c. Probability distribution P of decoder softmax layer outputtAs a probability distribution of individual words
Figure BDA0002279689890000091
θ represents the parameters of the entire model. The learning of the model parameters θ is done by minimizing the negative logarithm of the probability distribution of the words:
Figure BDA0002279689890000092
Tsis the length of the video description.
d. After the model parameters are determined, the description of the event is generated by adopting a beam search (BeamSearch) algorithm, namely the first k best descriptions before the time t are selected as candidates for the description at the time t, and the process is iterated until the description is completed. Final selection of Sj=argMaxS(-Lc(S)) is a description of event j.
Training of the model:
corresponding to two modules of event prediction and event description, the model has two loss functions: event prediction loss LpAnd event description loss Lc
For a packet containing NEVideo feature sequence V ═ V for an event1,v2,…vTY, a real tag sequence V ═ y1,y2,…yTCorresponds to it. Each ytAre all one NEDimension vectors, each element of which takes the value 0 or 1. When v istWhen the time domain intersection ratio (tIoU) of the corresponding event prediction boundary and the real boundary of the event j is more than 0.5, the method will be used
Figure BDA0002279689890000093
Set to 1, otherwise set to 0.
The invention adopts weighted multi-label cross entropy as event prediction loss function:
Figure BDA0002279689890000094
wherein
Figure BDA0002279689890000095
By correct prediction, error respectivelyThe number of predictions.
Figure BDA0002279689890000096
Is the prediction confidence for event i at time t. L is obtained by averaging the event prediction losses of all video sequencesp
Loss of single event description LcThe definition of (S) has been given in step 4- (5) -b, L being obtained by averaging the loss of description of all events of all video sequencesc
Total loss function L ═ λpLpcLc。λp、λcAre coefficients for balancing the contribution of the individual losses in the total loss. The entire model is trained in an end-to-end fashion by minimizing a loss function.
Fig. 5 shows the operation result of the method on the ActivityNetCaptions data set, and the result shows that compared with the traditional attention-based dense video description algorithm, the method generates more specific event description, makes the front and rear sentences more coherent, and can express more abundant information. Meanwhile, the method also effectively overcomes the defects of poor parallelism and low efficiency of the existing video intensive description generation algorithm, and has stronger practicability and robustness.
The technical solutions of the present invention are not limited to the above embodiments, and all technical solutions obtained by using equivalent substitution modes fall within the scope of the present invention.

Claims (6)

1. A multi-event video description method based on a dynamic attention mechanism is characterized by comprising the following steps:
step one, extracting visual characteristics V of a target video sequence X by adopting a convolutional neural network;
step two, inputting the visual characteristics V of the video into an L-layer self-attention mechanism video coding layer to obtain the coding F of the videoi
Step three, utilizing an event prediction layer to code F according to the videoiGenerating a prediction of an event phiiAnd selecting the layer with the highest prediction confidencePrediction as final prediction result phik
Step four, predicting the result based on the event prediction layer
Figure FDA0002279689880000011
Generating a mask for event j
Figure FDA0002279689880000012
Intercepting visual features of event j using a mask
Figure FDA0002279689880000013
The sequence is as follows:
Figure FDA0002279689880000014
wherein ⊙ denotes the matrix elements multiplied one by one;
obtaining visual feature vector C of event j by average poolingj
Figure FDA0002279689880000015
Wherein
Figure FDA0002279689880000016
n is the length of the characteristic sequence;
fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector
Figure FDA0002279689880000017
Suppose a description S of an event jjFrom TsA word is formed, i.e.
Figure FDA0002279689880000018
The encoder generates a word w as a time period, SjGeneration of (2) requires TsA period of time, then
Figure FDA0002279689880000019
Combining visual and contextual characteristics h of an eventt-1Mapping to the same feature space:
Figure FDA00022796898800000110
h′t-1=tanh(Wcht-1),Wvand WcIs a mapping matrix of visual features and contextual features,
Figure FDA00022796898800000111
context feature ht-1Is the hidden state of the LSTM unit at the last instant. h istBy the feature vector E of the currently input wordtInputting visual feature vectors
Figure FDA00022796898800000112
Hidden state h of previous momentt-1Jointly determining:
Figure FDA00022796898800000113
wherein Et=E[wt-1]The amount of the solvent to be used is, in particular,
Figure FDA00022796898800000114
E0=E[<BOS>];
threshold value for computing context feature
Figure FDA00022796898800000115
EtFor the input word w of the decoder at time tt-1The embedded vector of (2);
fusing visual features and contextual features by adopting a threshold mechanism:
Figure FDA00022796898800000116
final feature representation of event j
Figure FDA00022796898800000117
Representing the final characteristics of event j
Figure FDA00022796898800000118
Input LSTM decoder decodes to obtain description S of event jj
2. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 1, wherein: the encoding step of the video in the second step comprises the following steps:
the visual feature V is taken as input to the first encoder layer, the output of which is F1E (v), the encoder of the other layer takes the output of the previous layer as input, and the encoded output is Fl+1=E(Fl)。
3. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 2, wherein: each encoder layer comprises a multi-head attention layer and a point type feedforward layer;
the calculation formula of the multi-head attention layer is as follows:
Figure FDA0002279689880000021
the point type feedforward layer calculation formula is as follows:
E(Fl)=LN(FF(Ω(Fl)),Ω(Fl))
Figure FDA0002279689880000022
where LN (p, q) ═ LayerNorm (p + q), indicating the normalization operation on the residual output, FF (-) indicates the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,
Figure FDA0002279689880000023
is a weight matrix of the network and,
Figure FDA0002279689880000024
for the bias factor, the definition of Ω (-) uses a self-attention mechanism, in the coding process of step t,
Figure FDA00022796898800000211
the resulting output is f as a query to the attention layeri lA weighted sum of (i ═ 1, 2.., T).
4. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 1, wherein: the event prediction layer in step three is according to video coding FiThe specific method of generating a prediction of an event is as follows:
step 3.1, encode the video FiInputting to a base layer of an event prediction layer;
step 3.2, inputting the output characteristics of the basic layer into an anchor layer of the event prediction layer, and gradually reducing the time dimension of the characteristics;
and 3.3, inputting the output of each anchor layer into a prediction layer, and generating a set of fixed event predictions at one time.
5. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 4, wherein: prediction of jth event at ith layer in the third step
Figure FDA0002279689880000025
The calculation method comprises the following steps:
calculating the boundary of an event by
Figure FDA0002279689880000026
Figure FDA0002279689880000027
Figure FDA0002279689880000028
Figure FDA0002279689880000029
Figure FDA00022796898800000210
Figure FDA0002279689880000031
And
Figure FDA0002279689880000032
respectively representing the central position and width of the anchor before optimization,
Figure FDA0002279689880000033
to optimize the timing offset of the center position of the previous anchor,
Figure FDA0002279689880000034
to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,
Figure FDA0002279689880000035
and
Figure FDA0002279689880000036
respectively representing the central position and width of the anchor after optimization;
calculating a confidence in the prediction of the event s byj
Figure FDA0002279689880000037
Figure FDA0002279689880000038
And
Figure FDA0002279689880000039
respectively representing the classification confidence level and the language description confidence level of the event, and lambda is a hyper-parameter.
6. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 5, wherein:
final predicted result phi in the third stepkThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:
Figure FDA00022796898800000310
CN201911136308.8A 2019-11-19 2019-11-19 Multi-event video description method based on dynamic attention mechanism Active CN110929092B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911136308.8A CN110929092B (en) 2019-11-19 2019-11-19 Multi-event video description method based on dynamic attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911136308.8A CN110929092B (en) 2019-11-19 2019-11-19 Multi-event video description method based on dynamic attention mechanism

Publications (2)

Publication Number Publication Date
CN110929092A true CN110929092A (en) 2020-03-27
CN110929092B CN110929092B (en) 2023-07-04

Family

ID=69850335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911136308.8A Active CN110929092B (en) 2019-11-19 2019-11-19 Multi-event video description method based on dynamic attention mechanism

Country Status (1)

Country Link
CN (1) CN110929092B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488815A (en) * 2020-04-07 2020-08-04 中山大学 Basketball game goal event prediction method based on graph convolution network and long-time and short-time memory network
CN111652357A (en) * 2020-08-10 2020-09-11 浙江大学 Method and system for solving video question-answer problem by using specific target network based on graph
CN112308402A (en) * 2020-10-29 2021-02-02 复旦大学 Power time series data abnormity detection method based on long and short term memory network
CN113312980A (en) * 2021-05-06 2021-08-27 华南理工大学 Video intensive description method, device and medium
CN113392717A (en) * 2021-05-21 2021-09-14 杭州电子科技大学 Video dense description generation method based on time sequence characteristic pyramid
CN113469260A (en) * 2021-07-12 2021-10-01 天津理工大学 Visual description method based on convolutional neural network, attention mechanism and self-attention converter
CN113505659A (en) * 2021-02-02 2021-10-15 黑芝麻智能科技有限公司 Method for describing time event
CN113593603A (en) * 2021-07-27 2021-11-02 浙江大华技术股份有限公司 Audio category determination method and device, storage medium and electronic device
WO2022116420A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech event detection method and apparatus, electronic device, and computer storage medium
CN114627413A (en) * 2022-03-11 2022-06-14 电子科技大学 Video intensive event content understanding method
CN114661953A (en) * 2022-03-18 2022-06-24 北京百度网讯科技有限公司 Video description generation method, device, equipment and storage medium
CN114998673A (en) * 2022-05-11 2022-09-02 河海大学 Dam defect time sequence image description method based on local self-attention mechanism
CN115190332A (en) * 2022-07-08 2022-10-14 西安交通大学医学院第二附属医院 Dense video subtitle generation method based on global video characteristics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120008868A1 (en) * 2010-07-08 2012-01-12 Compusensor Technology Corp. Video Image Event Attention and Analysis System and Method
CN108960063A (en) * 2018-06-01 2018-12-07 清华大学深圳研究生院 It is a kind of towards event relation coding video in multiple affair natural language description algorithm
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110072142A (en) * 2018-01-24 2019-07-30 腾讯科技(深圳)有限公司 Video presentation generation method, device, video broadcasting method, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120008868A1 (en) * 2010-07-08 2012-01-12 Compusensor Technology Corp. Video Image Event Attention and Analysis System and Method
CN110072142A (en) * 2018-01-24 2019-07-30 腾讯科技(深圳)有限公司 Video presentation generation method, device, video broadcasting method, device and storage medium
CN108960063A (en) * 2018-06-01 2018-12-07 清华大学深圳研究生院 It is a kind of towards event relation coding video in multiple affair natural language description algorithm
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LIANLI GAO 等: "Video Captioning with Attention-based LSTM and Semantic Consistency", 《IEEE TRANSACTION ON MULTIMEDIA》 *
NING XU 等: "Attention-In-Attention Networks for Surveillance Video Understanding in IoT", 《IEEE INTERNET OF THINGS JOURNAL》 *
冀中 等: "基于解码器注意力机制的视频摘要", 《天津大学学报》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488815B (en) * 2020-04-07 2023-05-09 中山大学 Event prediction method based on graph convolution network and long-short-time memory network
CN111488815A (en) * 2020-04-07 2020-08-04 中山大学 Basketball game goal event prediction method based on graph convolution network and long-time and short-time memory network
CN111652357A (en) * 2020-08-10 2020-09-11 浙江大学 Method and system for solving video question-answer problem by using specific target network based on graph
CN112308402B (en) * 2020-10-29 2022-04-12 复旦大学 Power time series data abnormity detection method based on long and short term memory network
CN112308402A (en) * 2020-10-29 2021-02-02 复旦大学 Power time series data abnormity detection method based on long and short term memory network
WO2022116420A1 (en) * 2020-12-01 2022-06-09 平安科技(深圳)有限公司 Speech event detection method and apparatus, electronic device, and computer storage medium
CN113505659A (en) * 2021-02-02 2021-10-15 黑芝麻智能科技有限公司 Method for describing time event
US11887384B2 (en) 2021-02-02 2024-01-30 Black Sesame Technologies Inc. In-cabin occupant behavoir description
CN113312980B (en) * 2021-05-06 2022-10-14 华南理工大学 Video intensive description method, device and medium
CN113312980A (en) * 2021-05-06 2021-08-27 华南理工大学 Video intensive description method, device and medium
CN113392717A (en) * 2021-05-21 2021-09-14 杭州电子科技大学 Video dense description generation method based on time sequence characteristic pyramid
CN113392717B (en) * 2021-05-21 2024-02-13 杭州电子科技大学 Video dense description generation method based on time sequence feature pyramid
CN113469260A (en) * 2021-07-12 2021-10-01 天津理工大学 Visual description method based on convolutional neural network, attention mechanism and self-attention converter
CN113593603A (en) * 2021-07-27 2021-11-02 浙江大华技术股份有限公司 Audio category determination method and device, storage medium and electronic device
CN114627413A (en) * 2022-03-11 2022-06-14 电子科技大学 Video intensive event content understanding method
CN114627413B (en) * 2022-03-11 2022-09-13 电子科技大学 Video intensive event content understanding method
CN114661953A (en) * 2022-03-18 2022-06-24 北京百度网讯科技有限公司 Video description generation method, device, equipment and storage medium
CN114998673B (en) * 2022-05-11 2023-10-13 河海大学 Dam defect time sequence image description method based on local self-attention mechanism
WO2023217163A1 (en) * 2022-05-11 2023-11-16 华能澜沧江水电股份有限公司 Dam defect time-sequence image description method based on local self-attention mechanism
CN114998673A (en) * 2022-05-11 2022-09-02 河海大学 Dam defect time sequence image description method based on local self-attention mechanism
CN115190332A (en) * 2022-07-08 2022-10-14 西安交通大学医学院第二附属医院 Dense video subtitle generation method based on global video characteristics

Also Published As

Publication number Publication date
CN110929092B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN110929092A (en) Multi-event video description method based on dynamic attention mechanism
CN109389091B (en) Character recognition system and method based on combination of neural network and attention mechanism
Kim et al. Efficient dialogue state tracking by selectively overwriting memory
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN107484017B (en) Supervised video abstract generation method based on attention model
EP4073787B1 (en) System and method for streaming end-to-end speech recognition with asynchronous decoders
CN111488807B (en) Video description generation system based on graph rolling network
CN110309732B (en) Behavior identification method based on skeleton video
CN112329760B (en) Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network
CN109543820B (en) Image description generation method based on architecture phrase constraint vector and double vision attention mechanism
CN109919174A (en) A kind of character recognition method based on gate cascade attention mechanism
CN111079532A (en) Video content description method based on text self-encoder
CN112329794B (en) Image description method based on dual self-attention mechanism
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN109829495A (en) Timing image prediction method based on LSTM and DCGAN
CN111814844A (en) Intensive video description method based on position coding fusion
CN115690152A (en) Target tracking method based on attention mechanism
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
Chen et al. Training full spike neural networks via auxiliary accumulation pathway
CN113868451B (en) Cross-modal conversation method and device for social network based on up-down Wen Jilian perception
CN114973416A (en) Sign language recognition algorithm based on three-dimensional convolution network
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN109918484B (en) Dialog generation method and device
CN116189284A (en) Human motion prediction method, device, equipment and storage medium
CN115311598A (en) Video description generation system based on relation perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant