CN110929092A

CN110929092A - Multi-event video description method based on dynamic attention mechanism

Info

Publication number: CN110929092A
Application number: CN201911136308.8A
Authority: CN
Inventors: 谢洪平; 刘迪; 诸雅琴; 黄涛; 陈勇; 杜长青; 吴威; 王昊; 林东阳; 陈喆
Original assignee: Jinmao New Energy Group Co Ltd; Jiangsu Electric Power Engineering Consulting Co Ltd; Southeast University; State Grid Jiangsu Electric Power Co Ltd
Current assignee: Jinmao New Energy Group Co Ltd; Jiangsu Electric Power Engineering Consulting Co Ltd; Southeast University; State Grid Jiangsu Electric Power Co Ltd
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-03-27
Anticipated expiration: 2039-11-19
Also published as: CN110929092B

Abstract

The invention discloses a multi-event video description method based on a dynamic attention mechanism, which comprises the following steps: inputting the video sequence into a three-dimensional convolution neural network, and extracting visual features of the video; coding the visual characteristics by adopting a video coding layer based on an attention mechanism, and inputting the characteristic codes into an event prediction layer; the event prediction layer predicts each event according to the video coding information; and the event description layer acquires visual characteristics of each event according to the event prediction result and dynamically combines the context information of the event description layer to generate the text description of each event. The method overcomes the defects of poor parallelism and low efficiency of the conventional multi-event video description method, ensures the accuracy of video description generation, and can train the model in an end-to-end mode.

Description

Multi-event video description method based on dynamic attention mechanism

Technical Field

The invention relates to a multi-event video description method based on a dynamic attention mechanism, and belongs to the field of video description in computer vision.

Background

Video tagging is a technology for analyzing video content and forming a classification tag, and the video tag can effectively extract key information of a video and is widely applied to the field of video storage and retrieval. But the video tag cannot represent more detailed information of the video. Video description (videotaping) is a process of automatically generating natural language description of a video through a computer, and not only can key elements in the video be extracted through the video description, but also the association among the elements can be embodied through sentence description, so that the video description has important application value and development prospect in the fields of video storage and retrieval, human-computer interaction, knowledge extraction and the like.

Unlike image description (imagecapturing), video contains a great deal of space-time information which changes constantly, and how to efficiently acquire useful information to accurately describe video is a great challenge in the field of computer vision. The S2VT (Sequence to Sequence-Video to Text) algorithm proposed by Venugopalan et al is the first successful application of a deep learning method in the field of Video description. The method extracts 2D convolution characteristics and optical flow characteristics of the video, and inputs a two-layer stacked LSTM network to generate description of the video, thereby laying the foundation for adopting an Encoder-Decoder (Encoder-Decoder) architecture to carry out a video description algorithm. Currently, the video description field has many research results, but most of the results are improvements based on the S2VT algorithm, such as extracting video features by using 3DCNN, using features of multi-modal fusion, decoding by using an improved GRU network, and the like.

The event contained in a long video may be multiple, and the traditional video description method generates a sentence to describe the video too coarse to describe a part of information, and in order to solve the problem, a dense video description (densevideoCaption) should be generated. The dense Video description is proposed by z.shen et al in the article of weak superviseddepth Video capturing, where for a piece of Video, different region sequences are extracted, and then a sentence description is generated for each region sequence, which is a prototype of event prediction (EventProposal) -description generation (CaptionGeneration) architecture commonly adopted by dense Video description now. Compared with the traditional video description algorithm, the description of the region sequence provided by the algorithm is more refined and richer in information content, and a brand new research direction is developed.

Recent research on dense video descriptions has been mainly to efficiently extract and represent information in video and to improve the accuracy of event prediction. In view of the first problem, an attention mechanism (such as a description video by explicit temporal Structure) replaces the original average pooling method to generate the video information representation, and the problem that the video timing information is lost in the encoding process is solved well. Wang et al (bidirectional adaptive Fusion with Context filtering for depth Video capturing) indicates that most methods only extract Context information in the backward direction of a Video sequence during Video encoding, and ignore Context information in the forward direction, thereby causing the event prediction method to be unable to distinguish highly overlapped events. Therefore, they have proposed a two-way video coding method, which uses two layers of LSTM networks to code the forward and backward context information of the video, and performs event prediction according to the fused context information, thereby improving the accuracy of event prediction.

However, the existing intensive video description generation methods still have problems, and most methods simply connect the context features and the visual features to obtain the input of a decoder when performing video decoding, so that the generated description is not accurate. Meanwhile, the widely adopted LSTM video encoder has the problem of poor parallelism. Therefore, an efficient dense video description generation method is needed, which can quickly and accurately locate and describe events in a video.

Disclosure of Invention

The invention provides a multi-event video description method based on a dynamic attention mechanism, aiming at solving the problems of poor parallelism and low accuracy in the existing dense video description generation algorithm, and realizing accurate positioning and description of events in a video. In order to achieve the purpose, the technical scheme provided by the invention is as follows: a multi-event video description method based on a dynamic attention mechanism is characterized by comprising the following steps:

step one, extracting visual characteristics V of a target video sequence X by adopting a convolutional neural network;

step two, inputting the visual characteristics V of the video into an L-layer self-attention mechanism video coding layer to obtain the coding F of the videoⁱ；

Step three, utilizing an event prediction layer to code F according to the videoⁱGenerating a pairPrediction of events phi_iAnd selecting the layer prediction with the highest prediction confidence coefficient as the final prediction result phi_k；

Step four, predicting the result based on the event prediction layer

Generating a mask for event j

Intercepting visual features of event j using a mask

The sequence is as follows:

wherein ⊙ denotes the matrix elements multiplied one by one;

obtaining visual feature vector C of event j by average pooling_j：

Wherein

n is the length of the characteristic sequence;

fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector

Suppose a description S of an event j_jFrom T_sA word is formed, i.e.

The encoder generates a word w as a time period, S_jGeneration of (2) requires T_sA period of time, then

Combining visual and contextual characteristics h of an event_t-1Mapping to the same feature space:

h′_t-1＝tanh(W_ch_t-1)，W_vand W_cIs a mapping matrix of visual features and contextual features,

context feature h_t-1Is the hidden state of the LSTM unit at the last instant. h is_tBy the feature vector E of the currently input word_tInputting visual feature vectors

Hidden state h of previous moment_t-1Jointly determining:

wherein E_t＝E[w_t-1]The amount of the solvent to be used is, in particular,

E₀＝E[<BOS>]；

threshold value for computing context feature

E_tFor the input word w of the decoder at time t_t-1The embedded vector of (2);

fusing visual features and contextual features by adopting a threshold mechanism:

final feature representation of event j

Will be of event jFinal feature representation

Input LSTM decoder decodes to obtain description S of event j_j。

The encoding step of the video in the second step comprises the following steps:

the visual feature V is taken as input to the first encoder layer, the output of which is F¹E (v), the encoder of the other layer takes the output of the previous layer as input, and the encoded output is F^l+1＝E(F^l)。

Each encoder layer comprises a multi-head attention layer and a point type feedforward layer;

the calculation formula of the multi-head attention layer is as follows:

the point type feedforward layer calculation formula is as follows:

E(F^l)＝LN(FF(Ω(F^l)),Ω(F^l))

where LN (p, q) ═ LayerNorm (p + q), indicating the normalization operation on the residual output, FF (-) indicates the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,

is a weight matrix of the network and,

for the bias factor, the definition of Ω (-) uses the self-attention mechanism, during the coding process of step t, f_t ^lThe resulting output is f as a query to the attention layer_i ^l(i ═ 1,2, …, T) by weighted summation.

The event prediction layer in step three is according to video coding FⁱThe specific method of generating a prediction of an event is as follows:

step 3.1, first encode the video FⁱInputting to a base layer of an event prediction layer;

step 3.2, inputting the output characteristics of the basic layer into an anchor layer of the event prediction layer, and gradually reducing the time dimension of the characteristics;

and 3.3, inputting the output of each anchor layer into a prediction layer, and generating a set of fixed event predictions at one time.

Prediction of jth event at ith layer in the third step

The calculation method comprises the following steps:

calculating the boundary of an event by

And

respectively representing the central position and width of the anchor before optimization,

to optimize the timing offset of the center position of the previous anchor,

to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,

and

respectively representing the central position and width of the anchor after optimization;

calculating a confidence in the prediction of the event s by^j：

And

respectively representing the classification confidence level and the language description confidence level of the event, and lambda is a hyper-parameter.

Final predicted result phi in the third step_kThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:

the invention has the beneficial effects that:

the invention provides a multi-event video description method based on a dynamic attention mechanism, which adopts a self-attention mechanism to encode visual characteristics and inputs the characteristic codes into an event prediction layer; the event prediction layer predicts each event in the video according to the video coding information. And the event description layer generates the text description of each event according to the result of the event prediction and the video characteristics and dynamically fusing the context information of the decoder. The method combines a self-attention mechanism and a feedforward neural network to replace an LSTM network-based video encoder, overcomes the defects of poor encoding parallelism and low efficiency of the LSTM network, and ensures the accuracy of video description generation.

The invention intercepts the visual characteristics corresponding to the event according to the mask matrix generated by the event prediction result when the visual characteristics of the event are acquired, thereby leading the decoder to acquire effective information, eliminating the interference of other information and having high robustness and stability.

The invention dynamically fuses the visual characteristics of the event and the context information of the decoder during video decoding, and adopts the threshold to dynamically adjust the proportion of the visual characteristics and the context characteristics in the input of the decoder, thereby generating more accurate and coherent event description.

The method provided by the invention performs end-to-end model training by minimizing total loss (including event prediction loss and sentence generation loss), has high training efficiency and stability, and reduces training cost.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a block diagram of a method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a dynamic fusion mechanism of visual features and context information in an embodiment of the present invention;

FIG. 5 shows the result of the method of the present invention operating on ActiviTyNet Captions data sets.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Examples

As shown in fig. 1, fig. 2 and fig. 3, the present invention designs a multiple-event video description method based on a dynamic attention mechanism, which specifically includes the following steps:

step 1: using convolutional neural networks (ben Shi)In the example using 3D-CNN) to extract a video sequence X ═ X₁,x₂,…x_LV ═ V₁,v₂,…v_T}。

Video sequence X for one L frame { X ═ X₁,x₂,…x_LAnd performing feature extraction on the video frame of the Sports video by adopting the 3DCNN pre-trained on the Sports-1M video data set. The temporal resolution of the extracted C3D features is δ ═ 16 frames, so the input video stream can be discretized into T ═ L/δ steps, so the final generated feature sequence is V ═ V/δ steps₁,v₂,…v_T}。

Step 2: inputting visual characteristics V of the video into an L-layer self-attention video coding layer to obtain a coded representation { F } of the video¹,F²,…F^l}。

And (3) taking the characteristic sequence obtained in the step (1) as input of L encoder layers, wherein each encoder layer consists of a multi-head attention layer and a dot type feedforward layer. Each layer of coder takes the output of the previous layer as input and obtains the output F of the current layer by coding^l+1＝E(F^l) In particular, F¹E (v). The specific method comprises the following steps:

multi-head attention layer:

point type feed-forward layer:

E(F^l)＝LN(FF(Ω(F^l)),Ω(F^l))

wherein LN (p, q) ═ LayerNorm (p + q), meaning that normalization (LayerNormalization) operations are performed on the residual output, FF (-) denotes the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,

is a weight matrix of the network and,

is a paradox factor. The definition of omega (·) uses a self-attention mechanism, and in the coding process of the t step, f_t ^lThe resulting output is f as a query to the attention layer_i ^lAnd (i is the weighted sum of 1,2, … and T), so that the T-th step not only encodes the information of the current step but also encodes the information of other steps, and thus each step of encoding through the self-attention mechanism contains all the context information.

The multi-head attention mechanism consists of N scaling dot product attention layers:

each layer is a "head", which is defined as:

head_i＝Attention(W_i ^QQ,W_i ^KK,W_i ^VV)

wherein W_i ^Q，W_i ^K，W_i ^VIs a mapping matrix. The definition of scaled dot product attention is:

wherein

Respectively, a query matrix, a key matrix, and a value matrix. q. q.s_i，k_i，v_iAll have the dimension d. It queries q and k by computing_t(T is 1, …, T) to obtain a value v_tThen to v_tAnd weighting and summing to obtain an output.

And step 3: video coding F of event prediction layer according to layersⁱPredicting events in the video to obtain the prediction phi of each coding layer to the events_iAnd selecting k with the highest prediction confidence coefficient as argMax (phi)_i) Layer prediction asAnd finally predicting the result. The concrete implementation is as follows:

(1) video coding FⁱFirst input to the base layer (conv) of the event prediction layer₁And conv₂) The time dimension of video coding is reduced, the time receiving domain is increased, and the final output characteristic dimension is T/2 multiplied by 1024.

(2) The output characteristics of the base layer are then input to nine anchor layers (conv)₃To conv₁₁Kernel size of 3 for each anchor layer, step size of 2, filter number of 512), the time dimension of the feature is reduced step by step to facilitate event prediction over multiple time scales.

(3) The output of each anchor layer is received by the prediction layer and a fixed set of event predictions is generated at once. In particular, for a size of T_j×D_jInput characteristic f_jIts event prediction result

By a 1 XD through the full link layer_jIs generated by the feature unit of (1), wherein

Is the classification score when event/background classification is performed,

is a descriptive score that represents the confidence with which the predicted event can be well described.

Respectively, the central position of event j relative to the anchor associated therewith

And width

The amount of offset of (c).

(4) Wherein the predicted representation of the jth event at the ith layer

The calculation method comprises the following steps:

a. boundary of event

Calculated by the following formula:

and

to optimize the timing offset of the center position of the previous anchor,

and

representing the central position and width of the anchor after optimization, respectively.

b. Confidence in prediction of event

s^jIn combination with the confidence of classification of the event in vision

And confidence of description in language

λ is a hyperparameter.

(5) Final prediction result phi_kThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:

and 4, step 4: the event description layer predicts the prediction result of the layer according to the event

Generating a mask for event j

Intercepting the visual feature sequence V by using a mask to obtain the visual feature corresponding to the event j

Then to

Applying average pooling to obtain visual feature vector C of video_jAnd fusing the visual feature vector and the context vector H to obtain a final feature vector

Finally, the event j is input into an LSTM decoder for decoding to obtain the description S of the event j_j. The specific method comprises the following steps:

(1) event based predictionPrediction result of layer

Generating a mask for event j

The coverage of this mask is determined by the boundaries of event j:

(2) intercepting visual features of event j using a mask

The sequence is as follows:

where ⊙ denotes matrix element multiplication.

(3) Obtaining visual feature vector C of event j by average pooling_j：

Wherein

n is the length of the signature sequence.

(4) Fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector

Suppose a description S of an event j_jFrom T_sA word is formed, i.e.

Considering the encoder generating a word w as a time period, S_jGeneration of (2)To T_sA period of time, then

the output of the video decoding at time t is related not only to the visual characteristics of the event but also to the context information of the decoder. Visual feature vector C of event_jAnd the context characteristic h of the decoder_t-1And dynamic fusion, namely adjusting the proportion of the visual characteristics and the context characteristics in the input of the decoder, thereby generating more accurate event description. The specific method is as shown in FIG. 4:

a. mapping visual and contextual features of an event to the same feature space:

b. threshold value for computing context feature

E_tFor input word w at time t of LSTM decoder_t-1The embedded vector of (2).

c. Fusing visual features and contextual features by adopting a threshold mechanism:

final feature representation of event j

(5) The final characteristics of event j

Representing input LSTM decoder decoding to get description S of event j_j：

a.LSTMIs a basic building block of a video decoder, whose hidden state h_tBy the feature vector E of the currently input word_tInputting visual feature vectors

Hidden state h of previous moment_t-1Jointly determining:

wherein E_t＝E[w_t-1]The amount of the solvent to be used is, in particular,

E₀＝E[<BOS>]。

in particular, the LSTM cell is formed by the input gate Gi_tForgetting gate Gf_tAnd an output gate Go_tAnd an input unit g_tThe specific calculation mode is as follows:

wherein σ represents a sigmoid activation function, tanh represents a hyperbolic tangent function, and W is a transformation matrix and needs to be determined through model training. Storage unit c of LSTM_tAnd hidden state h_tThe update is performed by:

c_t＝Gf_t⊙c_t-1+i_t⊙g_t

h_t＝Go_t⊙tanh(c_t)

where ⊙ denotes a matrix element multiplication operation.

b. Hidden state h according to LSTM at time t_tCalculating a probability distribution P of a set of possible words_t＝softmax(U_pψ(W_ph_t+b_p)+d)，U_p，W_p，b_pAnd the d parameter needs to be determined by training of the model.

c. Probability distribution P of decoder softmax layer output_tAs a probability distribution of individual words

θ represents the parameters of the entire model. The learning of the model parameters θ is done by minimizing the negative logarithm of the probability distribution of the words:

T_sis the length of the video description.

d. After the model parameters are determined, the description of the event is generated by adopting a beam search (BeamSearch) algorithm, namely the first k best descriptions before the time t are selected as candidates for the description at the time t, and the process is iterated until the description is completed. Final selection of S_j＝argMax_S(-L_c(S)) is a description of event j.

Training of the model:

corresponding to two modules of event prediction and event description, the model has two loss functions: event prediction loss L_pAnd event description loss L_c。

For a packet containing N_EVideo feature sequence V ═ V for an event₁,v₂,…v_TY, a real tag sequence V ═ y₁,y₂,…y_TCorresponds to it. Each y_tAre all one N_EDimension vectors, each element of which takes the value 0 or 1. When v is_tWhen the time domain intersection ratio (tIoU) of the corresponding event prediction boundary and the real boundary of the event j is more than 0.5, the method will be used

Set to 1, otherwise set to 0.

The invention adopts weighted multi-label cross entropy as event prediction loss function:

wherein

By correct prediction, error respectivelyThe number of predictions.

Is the prediction confidence for event i at time t. L is obtained by averaging the event prediction losses of all video sequences_p。

Loss of single event description L_cThe definition of (S) has been given in step 4- (5) -b, L being obtained by averaging the loss of description of all events of all video sequences_c。

Total loss function L ═ λ_pL_p+λ_cL_c。λ_p、λ_cAre coefficients for balancing the contribution of the individual losses in the total loss. The entire model is trained in an end-to-end fashion by minimizing a loss function.

Fig. 5 shows the operation result of the method on the ActivityNetCaptions data set, and the result shows that compared with the traditional attention-based dense video description algorithm, the method generates more specific event description, makes the front and rear sentences more coherent, and can express more abundant information. Meanwhile, the method also effectively overcomes the defects of poor parallelism and low efficiency of the existing video intensive description generation algorithm, and has stronger practicability and robustness.

The technical solutions of the present invention are not limited to the above embodiments, and all technical solutions obtained by using equivalent substitution modes fall within the scope of the present invention.

Claims

1. A multi-event video description method based on a dynamic attention mechanism is characterized by comprising the following steps:

Step three, utilizing an event prediction layer to code F according to the videoⁱGenerating a prediction of an event phi_iAnd selecting the layer with the highest prediction confidencePrediction as final prediction result phi_k；

Step four, predicting the result based on the event prediction layer

Generating a mask for event j

Intercepting visual features of event j using a mask

The sequence is as follows:

wherein ⊙ denotes the matrix elements multiplied one by one;

obtaining visual feature vector C of event j by average pooling_j：

Wherein

n is the length of the characteristic sequence;

Suppose a description S of an event j_jFrom T_sA word is formed, i.e.

Hidden state h of previous moment_t-1Jointly determining:

wherein E_t＝E[w_t-1]The amount of the solvent to be used is, in particular,

E₀＝E[＜BOS＞]；

threshold value for computing context feature

E_tFor the input word w of the decoder at time t_t-1The embedded vector of (2);

final feature representation of event j

Representing the final characteristics of event j

Input LSTM decoder decodes to obtain description S of event j_j。

2. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 1, wherein: the encoding step of the video in the second step comprises the following steps:

3. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 2, wherein: each encoder layer comprises a multi-head attention layer and a point type feedforward layer;

the calculation formula of the multi-head attention layer is as follows:

the point type feedforward layer calculation formula is as follows:

E(F^l)＝LN(FF(Ω(F^l))，Ω(F^l))

is a weight matrix of the network and,

for the bias factor, the definition of Ω (-) uses a self-attention mechanism, in the coding process of step t,

the resulting output is f as a query to the attention layer_i ^lA weighted sum of (i ═ 1, 2.., T).

4. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 1, wherein: the event prediction layer in step three is according to video coding FⁱThe specific method of generating a prediction of an event is as follows:

step 3.1, encode the video FⁱInputting to a base layer of an event prediction layer;

5. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 4, wherein: prediction of jth event at ith layer in the third step