CN110929092A - Multi-event video description method based on dynamic attention mechanism - Google Patents
Multi-event video description method based on dynamic attention mechanism Download PDFInfo
- Publication number
- CN110929092A CN110929092A CN201911136308.8A CN201911136308A CN110929092A CN 110929092 A CN110929092 A CN 110929092A CN 201911136308 A CN201911136308 A CN 201911136308A CN 110929092 A CN110929092 A CN 110929092A
- Authority
- CN
- China
- Prior art keywords
- event
- layer
- video
- prediction
- description
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Library & Information Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a multi-event video description method based on a dynamic attention mechanism, which comprises the following steps: inputting the video sequence into a three-dimensional convolution neural network, and extracting visual features of the video; coding the visual characteristics by adopting a video coding layer based on an attention mechanism, and inputting the characteristic codes into an event prediction layer; the event prediction layer predicts each event according to the video coding information; and the event description layer acquires visual characteristics of each event according to the event prediction result and dynamically combines the context information of the event description layer to generate the text description of each event. The method overcomes the defects of poor parallelism and low efficiency of the conventional multi-event video description method, ensures the accuracy of video description generation, and can train the model in an end-to-end mode.
Description
Technical Field
The invention relates to a multi-event video description method based on a dynamic attention mechanism, and belongs to the field of video description in computer vision.
Background
Video tagging is a technology for analyzing video content and forming a classification tag, and the video tag can effectively extract key information of a video and is widely applied to the field of video storage and retrieval. But the video tag cannot represent more detailed information of the video. Video description (videotaping) is a process of automatically generating natural language description of a video through a computer, and not only can key elements in the video be extracted through the video description, but also the association among the elements can be embodied through sentence description, so that the video description has important application value and development prospect in the fields of video storage and retrieval, human-computer interaction, knowledge extraction and the like.
Unlike image description (imagecapturing), video contains a great deal of space-time information which changes constantly, and how to efficiently acquire useful information to accurately describe video is a great challenge in the field of computer vision. The S2VT (Sequence to Sequence-Video to Text) algorithm proposed by Venugopalan et al is the first successful application of a deep learning method in the field of Video description. The method extracts 2D convolution characteristics and optical flow characteristics of the video, and inputs a two-layer stacked LSTM network to generate description of the video, thereby laying the foundation for adopting an Encoder-Decoder (Encoder-Decoder) architecture to carry out a video description algorithm. Currently, the video description field has many research results, but most of the results are improvements based on the S2VT algorithm, such as extracting video features by using 3DCNN, using features of multi-modal fusion, decoding by using an improved GRU network, and the like.
The event contained in a long video may be multiple, and the traditional video description method generates a sentence to describe the video too coarse to describe a part of information, and in order to solve the problem, a dense video description (densevideoCaption) should be generated. The dense Video description is proposed by z.shen et al in the article of weak superviseddepth Video capturing, where for a piece of Video, different region sequences are extracted, and then a sentence description is generated for each region sequence, which is a prototype of event prediction (EventProposal) -description generation (CaptionGeneration) architecture commonly adopted by dense Video description now. Compared with the traditional video description algorithm, the description of the region sequence provided by the algorithm is more refined and richer in information content, and a brand new research direction is developed.
Recent research on dense video descriptions has been mainly to efficiently extract and represent information in video and to improve the accuracy of event prediction. In view of the first problem, an attention mechanism (such as a description video by explicit temporal Structure) replaces the original average pooling method to generate the video information representation, and the problem that the video timing information is lost in the encoding process is solved well. Wang et al (bidirectional adaptive Fusion with Context filtering for depth Video capturing) indicates that most methods only extract Context information in the backward direction of a Video sequence during Video encoding, and ignore Context information in the forward direction, thereby causing the event prediction method to be unable to distinguish highly overlapped events. Therefore, they have proposed a two-way video coding method, which uses two layers of LSTM networks to code the forward and backward context information of the video, and performs event prediction according to the fused context information, thereby improving the accuracy of event prediction.
However, the existing intensive video description generation methods still have problems, and most methods simply connect the context features and the visual features to obtain the input of a decoder when performing video decoding, so that the generated description is not accurate. Meanwhile, the widely adopted LSTM video encoder has the problem of poor parallelism. Therefore, an efficient dense video description generation method is needed, which can quickly and accurately locate and describe events in a video.
Disclosure of Invention
The invention provides a multi-event video description method based on a dynamic attention mechanism, aiming at solving the problems of poor parallelism and low accuracy in the existing dense video description generation algorithm, and realizing accurate positioning and description of events in a video. In order to achieve the purpose, the technical scheme provided by the invention is as follows: a multi-event video description method based on a dynamic attention mechanism is characterized by comprising the following steps:
step one, extracting visual characteristics V of a target video sequence X by adopting a convolutional neural network;
step two, inputting the visual characteristics V of the video into an L-layer self-attention mechanism video coding layer to obtain the coding F of the videoi;
Step three, utilizing an event prediction layer to code F according to the videoiGenerating a pairPrediction of events phiiAnd selecting the layer prediction with the highest prediction confidence coefficient as the final prediction result phik;
Step four, predicting the result based on the event prediction layerGenerating a mask for event jIntercepting visual features of event j using a maskThe sequence is as follows:
wherein ⊙ denotes the matrix elements multiplied one by one;
obtaining visual feature vector C of event j by average poolingj:
fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector
Suppose a description S of an event jjFrom TsA word is formed, i.e.The encoder generates a word w as a time period, SjGeneration of (2) requires TsA period of time, then
Combining visual and contextual characteristics h of an eventt-1Mapping to the same feature space:h′t-1=tanh(Wcht-1),Wvand WcIs a mapping matrix of visual features and contextual features,context feature ht-1Is the hidden state of the LSTM unit at the last instant. h istBy the feature vector E of the currently input wordtInputting visual feature vectorsHidden state h of previous momentt-1Jointly determining:wherein Et=E[wt-1]The amount of the solvent to be used is, in particular,E0=E[<BOS>];
threshold value for computing context featureEtFor the input word w of the decoder at time tt-1The embedded vector of (2);
fusing visual features and contextual features by adopting a threshold mechanism:final feature representation of event j
Will be of event jFinal feature representationInput LSTM decoder decodes to obtain description S of event jj。
The encoding step of the video in the second step comprises the following steps:
the visual feature V is taken as input to the first encoder layer, the output of which is F1E (v), the encoder of the other layer takes the output of the previous layer as input, and the encoded output is Fl+1=E(Fl)。
Each encoder layer comprises a multi-head attention layer and a point type feedforward layer;
the calculation formula of the multi-head attention layer is as follows:
the point type feedforward layer calculation formula is as follows:
E(Fl)=LN(FF(Ω(Fl)),Ω(Fl))
where LN (p, q) ═ LayerNorm (p + q), indicating the normalization operation on the residual output, FF (-) indicates the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,is a weight matrix of the network and,for the bias factor, the definition of Ω (-) uses the self-attention mechanism, during the coding process of step t, ft lThe resulting output is f as a query to the attention layeri l(i ═ 1,2, …, T) by weighted summation.
The event prediction layer in step three is according to video coding FiThe specific method of generating a prediction of an event is as follows:
step 3.1, first encode the video FiInputting to a base layer of an event prediction layer;
step 3.2, inputting the output characteristics of the basic layer into an anchor layer of the event prediction layer, and gradually reducing the time dimension of the characteristics;
and 3.3, inputting the output of each anchor layer into a prediction layer, and generating a set of fixed event predictions at one time.
Prediction of jth event at ith layer in the third stepThe calculation method comprises the following steps:
Andrespectively representing the central position and width of the anchor before optimization,to optimize the timing offset of the center position of the previous anchor,to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,andrespectively representing the central position and width of the anchor after optimization;
calculating a confidence in the prediction of the event s byj:
Andrespectively representing the classification confidence level and the language description confidence level of the event, and lambda is a hyper-parameter.
Final predicted result phi in the third stepkThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:
the invention has the beneficial effects that:
the invention provides a multi-event video description method based on a dynamic attention mechanism, which adopts a self-attention mechanism to encode visual characteristics and inputs the characteristic codes into an event prediction layer; the event prediction layer predicts each event in the video according to the video coding information. And the event description layer generates the text description of each event according to the result of the event prediction and the video characteristics and dynamically fusing the context information of the decoder. The method combines a self-attention mechanism and a feedforward neural network to replace an LSTM network-based video encoder, overcomes the defects of poor encoding parallelism and low efficiency of the LSTM network, and ensures the accuracy of video description generation.
The invention intercepts the visual characteristics corresponding to the event according to the mask matrix generated by the event prediction result when the visual characteristics of the event are acquired, thereby leading the decoder to acquire effective information, eliminating the interference of other information and having high robustness and stability.
The invention dynamically fuses the visual characteristics of the event and the context information of the decoder during video decoding, and adopts the threshold to dynamically adjust the proportion of the visual characteristics and the context characteristics in the input of the decoder, thereby generating more accurate and coherent event description.
The method provided by the invention performs end-to-end model training by minimizing total loss (including event prediction loss and sentence generation loss), has high training efficiency and stability, and reduces training cost.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present invention;
FIG. 2 is a block diagram of a method according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a method according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a dynamic fusion mechanism of visual features and context information in an embodiment of the present invention;
FIG. 5 shows the result of the method of the present invention operating on ActiviTyNet Captions data sets.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Examples
As shown in fig. 1, fig. 2 and fig. 3, the present invention designs a multiple-event video description method based on a dynamic attention mechanism, which specifically includes the following steps:
step 1: using convolutional neural networks (ben Shi)In the example using 3D-CNN) to extract a video sequence X ═ X1,x2,…xLV ═ V1,v2,…vT}。
Video sequence X for one L frame { X ═ X1,x2,…xLAnd performing feature extraction on the video frame of the Sports video by adopting the 3DCNN pre-trained on the Sports-1M video data set. The temporal resolution of the extracted C3D features is δ ═ 16 frames, so the input video stream can be discretized into T ═ L/δ steps, so the final generated feature sequence is V ═ V/δ steps1,v2,…vT}。
Step 2: inputting visual characteristics V of the video into an L-layer self-attention video coding layer to obtain a coded representation { F } of the video1,F2,…Fl}。
And (3) taking the characteristic sequence obtained in the step (1) as input of L encoder layers, wherein each encoder layer consists of a multi-head attention layer and a dot type feedforward layer. Each layer of coder takes the output of the previous layer as input and obtains the output F of the current layer by codingl+1=E(Fl) In particular, F1E (v). The specific method comprises the following steps:
multi-head attention layer:
point type feed-forward layer:
E(Fl)=LN(FF(Ω(Fl)),Ω(Fl))
wherein LN (p, q) ═ LayerNorm (p + q), meaning that normalization (LayerNormalization) operations are performed on the residual output, FF (-) denotes the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,is a weight matrix of the network and,is a paradox factor. The definition of omega (·) uses a self-attention mechanism, and in the coding process of the t step, ft lThe resulting output is f as a query to the attention layeri lAnd (i is the weighted sum of 1,2, … and T), so that the T-th step not only encodes the information of the current step but also encodes the information of other steps, and thus each step of encoding through the self-attention mechanism contains all the context information.
The multi-head attention mechanism consists of N scaling dot product attention layers:
each layer is a "head", which is defined as:
headi=Attention(Wi QQ,Wi KK,Wi VV)
wherein Wi Q,Wi K,Wi VIs a mapping matrix. The definition of scaled dot product attention is:
whereinRespectively, a query matrix, a key matrix, and a value matrix. q. q.si,ki,viAll have the dimension d. It queries q and k by computingt(T is 1, …, T) to obtain a value vtThen to vtAnd weighting and summing to obtain an output.
And step 3: video coding F of event prediction layer according to layersiPredicting events in the video to obtain the prediction phi of each coding layer to the eventsiAnd selecting k with the highest prediction confidence coefficient as argMax (phi)i) Layer prediction asAnd finally predicting the result. The concrete implementation is as follows:
(1) video coding FiFirst input to the base layer (conv) of the event prediction layer1And conv2) The time dimension of video coding is reduced, the time receiving domain is increased, and the final output characteristic dimension is T/2 multiplied by 1024.
(2) The output characteristics of the base layer are then input to nine anchor layers (conv)3To conv11Kernel size of 3 for each anchor layer, step size of 2, filter number of 512), the time dimension of the feature is reduced step by step to facilitate event prediction over multiple time scales.
(3) The output of each anchor layer is received by the prediction layer and a fixed set of event predictions is generated at once. In particular, for a size of Tj×DjInput characteristic fjIts event prediction resultBy a 1 XD through the full link layerjIs generated by the feature unit of (1), whereinIs the classification score when event/background classification is performed,is a descriptive score that represents the confidence with which the predicted event can be well described.Respectively, the central position of event j relative to the anchor associated therewithAnd widthThe amount of offset of (c).
(4) Wherein the predicted representation of the jth event at the ith layerThe calculation method comprises the following steps:
andrespectively representing the central position and width of the anchor before optimization,to optimize the timing offset of the center position of the previous anchor,to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,andrepresenting the central position and width of the anchor after optimization, respectively.
b. Confidence in prediction of eventsjIn combination with the confidence of classification of the event in visionAnd confidence of description in languageλ is a hyperparameter.
(5) Final prediction result phikThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:
and 4, step 4: the event description layer predicts the prediction result of the layer according to the eventGenerating a mask for event jIntercepting the visual feature sequence V by using a mask to obtain the visual feature corresponding to the event jThen toApplying average pooling to obtain visual feature vector C of videojAnd fusing the visual feature vector and the context vector H to obtain a final feature vectorFinally, the event j is input into an LSTM decoder for decoding to obtain the description S of the event jj. The specific method comprises the following steps:
(1) event based predictionPrediction result of layerGenerating a mask for event jThe coverage of this mask is determined by the boundaries of event j:
where ⊙ denotes matrix element multiplication.
(3) Obtaining visual feature vector C of event j by average poolingj:
(4) Fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector
Suppose a description S of an event jjFrom TsA word is formed, i.e.Considering the encoder generating a word w as a time period, SjGeneration of (2)To TsA period of time, thenthe output of the video decoding at time t is related not only to the visual characteristics of the event but also to the context information of the decoder. Visual feature vector C of eventjAnd the context characteristic h of the decodert-1And dynamic fusion, namely adjusting the proportion of the visual characteristics and the context characteristics in the input of the decoder, thereby generating more accurate event description. The specific method is as shown in FIG. 4:
a. mapping visual and contextual features of an event to the same feature space:
b. threshold value for computing context featureEtFor input word w at time t of LSTM decodert-1The embedded vector of (2).
c. Fusing visual features and contextual features by adopting a threshold mechanism:final feature representation of event j
(5) The final characteristics of event jRepresenting input LSTM decoder decoding to get description S of event jj:
a.LSTMIs a basic building block of a video decoder, whose hidden state htBy the feature vector E of the currently input wordtInputting visual feature vectorsHidden state h of previous momentt-1Jointly determining:wherein Et=E[wt-1]The amount of the solvent to be used is, in particular,E0=E[<BOS>]。
in particular, the LSTM cell is formed by the input gate GitForgetting gate GftAnd an output gate GotAnd an input unit gtThe specific calculation mode is as follows:
wherein σ represents a sigmoid activation function, tanh represents a hyperbolic tangent function, and W is a transformation matrix and needs to be determined through model training. Storage unit c of LSTMtAnd hidden state htThe update is performed by:
ct=Gft⊙ct-1+it⊙gt
ht=Got⊙tanh(ct)
where ⊙ denotes a matrix element multiplication operation.
b. Hidden state h according to LSTM at time ttCalculating a probability distribution P of a set of possible wordst=softmax(Upψ(Wpht+bp)+d),Up,Wp,bpAnd the d parameter needs to be determined by training of the model.
c. Probability distribution P of decoder softmax layer outputtAs a probability distribution of individual wordsθ represents the parameters of the entire model. The learning of the model parameters θ is done by minimizing the negative logarithm of the probability distribution of the words:Tsis the length of the video description.
d. After the model parameters are determined, the description of the event is generated by adopting a beam search (BeamSearch) algorithm, namely the first k best descriptions before the time t are selected as candidates for the description at the time t, and the process is iterated until the description is completed. Final selection of Sj=argMaxS(-Lc(S)) is a description of event j.
Training of the model:
corresponding to two modules of event prediction and event description, the model has two loss functions: event prediction loss LpAnd event description loss Lc。
For a packet containing NEVideo feature sequence V ═ V for an event1,v2,…vTY, a real tag sequence V ═ y1,y2,…yTCorresponds to it. Each ytAre all one NEDimension vectors, each element of which takes the value 0 or 1. When v istWhen the time domain intersection ratio (tIoU) of the corresponding event prediction boundary and the real boundary of the event j is more than 0.5, the method will be usedSet to 1, otherwise set to 0.
The invention adopts weighted multi-label cross entropy as event prediction loss function:
whereinBy correct prediction, error respectivelyThe number of predictions.Is the prediction confidence for event i at time t. L is obtained by averaging the event prediction losses of all video sequencesp。
Loss of single event description LcThe definition of (S) has been given in step 4- (5) -b, L being obtained by averaging the loss of description of all events of all video sequencesc。
Total loss function L ═ λpLp+λcLc。λp、λcAre coefficients for balancing the contribution of the individual losses in the total loss. The entire model is trained in an end-to-end fashion by minimizing a loss function.
Fig. 5 shows the operation result of the method on the ActivityNetCaptions data set, and the result shows that compared with the traditional attention-based dense video description algorithm, the method generates more specific event description, makes the front and rear sentences more coherent, and can express more abundant information. Meanwhile, the method also effectively overcomes the defects of poor parallelism and low efficiency of the existing video intensive description generation algorithm, and has stronger practicability and robustness.
The technical solutions of the present invention are not limited to the above embodiments, and all technical solutions obtained by using equivalent substitution modes fall within the scope of the present invention.
Claims (6)
1. A multi-event video description method based on a dynamic attention mechanism is characterized by comprising the following steps:
step one, extracting visual characteristics V of a target video sequence X by adopting a convolutional neural network;
step two, inputting the visual characteristics V of the video into an L-layer self-attention mechanism video coding layer to obtain the coding F of the videoi;
Step three, utilizing an event prediction layer to code F according to the videoiGenerating a prediction of an event phiiAnd selecting the layer with the highest prediction confidencePrediction as final prediction result phik;
Step four, predicting the result based on the event prediction layerGenerating a mask for event jIntercepting visual features of event j using a maskThe sequence is as follows:
wherein ⊙ denotes the matrix elements multiplied one by one;
obtaining visual feature vector C of event j by average poolingj:
fusing the visual characteristic vector and the context vector H of the event to obtain an adjusted final characteristic vector
Suppose a description S of an event jjFrom TsA word is formed, i.e.The encoder generates a word w as a time period, SjGeneration of (2) requires TsA period of time, then
Combining visual and contextual characteristics h of an eventt-1Mapping to the same feature space:h′t-1=tanh(Wcht-1),Wvand WcIs a mapping matrix of visual features and contextual features,context feature ht-1Is the hidden state of the LSTM unit at the last instant. h istBy the feature vector E of the currently input wordtInputting visual feature vectorsHidden state h of previous momentt-1Jointly determining:wherein Et=E[wt-1]The amount of the solvent to be used is, in particular,E0=E[<BOS>];
threshold value for computing context featureEtFor the input word w of the decoder at time tt-1The embedded vector of (2);
fusing visual features and contextual features by adopting a threshold mechanism:final feature representation of event j
2. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 1, wherein: the encoding step of the video in the second step comprises the following steps:
the visual feature V is taken as input to the first encoder layer, the output of which is F1E (v), the encoder of the other layer takes the output of the previous layer as input, and the encoded output is Fl+1=E(Fl)。
3. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 2, wherein: each encoder layer comprises a multi-head attention layer and a point type feedforward layer;
the calculation formula of the multi-head attention layer is as follows:
the point type feedforward layer calculation formula is as follows:
E(Fl)=LN(FF(Ω(Fl)),Ω(Fl))
where LN (p, q) ═ LayerNorm (p + q), indicating the normalization operation on the residual output, FF (-) indicates the first layer of a two-layer feedforward neural network with a nonlinear ReLU activation function,is a weight matrix of the network and,for the bias factor, the definition of Ω (-) uses a self-attention mechanism, in the coding process of step t,the resulting output is f as a query to the attention layeri lA weighted sum of (i ═ 1, 2.., T).
4. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 1, wherein: the event prediction layer in step three is according to video coding FiThe specific method of generating a prediction of an event is as follows:
step 3.1, encode the video FiInputting to a base layer of an event prediction layer;
step 3.2, inputting the output characteristics of the basic layer into an anchor layer of the event prediction layer, and gradually reducing the time dimension of the characteristics;
and 3.3, inputting the output of each anchor layer into a prediction layer, and generating a set of fixed event predictions at one time.
5. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 4, wherein: prediction of jth event at ith layer in the third stepThe calculation method comprises the following steps:
Andrespectively representing the central position and width of the anchor before optimization,to optimize the timing offset of the center position of the previous anchor,to optimize the timing offset of the width of the previous anchor, exp (-) is an exponential function,andrespectively representing the central position and width of the anchor after optimization;
calculating a confidence in the prediction of the event s byj:
6. The method for describing multiple event videos based on the dynamic attention mechanism as claimed in claim 5, wherein:
final predicted result phi in the third stepkThe selection method comprises the following steps of selecting a layer of prediction with the largest sum of the confidence degrees of the event prediction as a final prediction result:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911136308.8A CN110929092B (en) | 2019-11-19 | 2019-11-19 | Multi-event video description method based on dynamic attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911136308.8A CN110929092B (en) | 2019-11-19 | 2019-11-19 | Multi-event video description method based on dynamic attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110929092A true CN110929092A (en) | 2020-03-27 |
CN110929092B CN110929092B (en) | 2023-07-04 |
Family
ID=69850335
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911136308.8A Active CN110929092B (en) | 2019-11-19 | 2019-11-19 | Multi-event video description method based on dynamic attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110929092B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488815A (en) * | 2020-04-07 | 2020-08-04 | 中山大学 | Basketball game goal event prediction method based on graph convolution network and long-time and short-time memory network |
CN111652357A (en) * | 2020-08-10 | 2020-09-11 | 浙江大学 | Method and system for solving video question-answer problem by using specific target network based on graph |
CN112308402A (en) * | 2020-10-29 | 2021-02-02 | 复旦大学 | Power time series data abnormity detection method based on long and short term memory network |
CN113312980A (en) * | 2021-05-06 | 2021-08-27 | 华南理工大学 | Video intensive description method, device and medium |
CN113392717A (en) * | 2021-05-21 | 2021-09-14 | 杭州电子科技大学 | Video dense description generation method based on time sequence characteristic pyramid |
CN113469260A (en) * | 2021-07-12 | 2021-10-01 | 天津理工大学 | Visual description method based on convolutional neural network, attention mechanism and self-attention converter |
CN113505659A (en) * | 2021-02-02 | 2021-10-15 | 黑芝麻智能科技有限公司 | Method for describing time event |
CN113593603A (en) * | 2021-07-27 | 2021-11-02 | 浙江大华技术股份有限公司 | Audio category determination method and device, storage medium and electronic device |
WO2022116420A1 (en) * | 2020-12-01 | 2022-06-09 | 平安科技(深圳)有限公司 | Speech event detection method and apparatus, electronic device, and computer storage medium |
CN114627413A (en) * | 2022-03-11 | 2022-06-14 | 电子科技大学 | Video intensive event content understanding method |
CN114661953A (en) * | 2022-03-18 | 2022-06-24 | 北京百度网讯科技有限公司 | Video description generation method, device, equipment and storage medium |
CN114998673A (en) * | 2022-05-11 | 2022-09-02 | 河海大学 | Dam defect time sequence image description method based on local self-attention mechanism |
CN115190332A (en) * | 2022-07-08 | 2022-10-14 | 西安交通大学医学院第二附属医院 | Dense video subtitle generation method based on global video characteristics |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120008868A1 (en) * | 2010-07-08 | 2012-01-12 | Compusensor Technology Corp. | Video Image Event Attention and Analysis System and Method |
CN108960063A (en) * | 2018-06-01 | 2018-12-07 | 清华大学深圳研究生院 | It is a kind of towards event relation coding video in multiple affair natural language description algorithm |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110072142A (en) * | 2018-01-24 | 2019-07-30 | 腾讯科技(深圳)有限公司 | Video presentation generation method, device, video broadcasting method, device and storage medium |
-
2019
- 2019-11-19 CN CN201911136308.8A patent/CN110929092B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120008868A1 (en) * | 2010-07-08 | 2012-01-12 | Compusensor Technology Corp. | Video Image Event Attention and Analysis System and Method |
CN110072142A (en) * | 2018-01-24 | 2019-07-30 | 腾讯科技(深圳)有限公司 | Video presentation generation method, device, video broadcasting method, device and storage medium |
CN108960063A (en) * | 2018-06-01 | 2018-12-07 | 清华大学深圳研究生院 | It is a kind of towards event relation coding video in multiple affair natural language description algorithm |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
Non-Patent Citations (3)
Title |
---|
LIANLI GAO 等: "Video Captioning with Attention-based LSTM and Semantic Consistency", 《IEEE TRANSACTION ON MULTIMEDIA》 * |
NING XU 等: "Attention-In-Attention Networks for Surveillance Video Understanding in IoT", 《IEEE INTERNET OF THINGS JOURNAL》 * |
冀中 等: "基于解码器注意力机制的视频摘要", 《天津大学学报》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111488815B (en) * | 2020-04-07 | 2023-05-09 | 中山大学 | Event prediction method based on graph convolution network and long-short-time memory network |
CN111488815A (en) * | 2020-04-07 | 2020-08-04 | 中山大学 | Basketball game goal event prediction method based on graph convolution network and long-time and short-time memory network |
CN111652357A (en) * | 2020-08-10 | 2020-09-11 | 浙江大学 | Method and system for solving video question-answer problem by using specific target network based on graph |
CN112308402B (en) * | 2020-10-29 | 2022-04-12 | 复旦大学 | Power time series data abnormity detection method based on long and short term memory network |
CN112308402A (en) * | 2020-10-29 | 2021-02-02 | 复旦大学 | Power time series data abnormity detection method based on long and short term memory network |
WO2022116420A1 (en) * | 2020-12-01 | 2022-06-09 | 平安科技(深圳)有限公司 | Speech event detection method and apparatus, electronic device, and computer storage medium |
CN113505659A (en) * | 2021-02-02 | 2021-10-15 | 黑芝麻智能科技有限公司 | Method for describing time event |
US11887384B2 (en) | 2021-02-02 | 2024-01-30 | Black Sesame Technologies Inc. | In-cabin occupant behavoir description |
CN113312980B (en) * | 2021-05-06 | 2022-10-14 | 华南理工大学 | Video intensive description method, device and medium |
CN113312980A (en) * | 2021-05-06 | 2021-08-27 | 华南理工大学 | Video intensive description method, device and medium |
CN113392717A (en) * | 2021-05-21 | 2021-09-14 | 杭州电子科技大学 | Video dense description generation method based on time sequence characteristic pyramid |
CN113392717B (en) * | 2021-05-21 | 2024-02-13 | 杭州电子科技大学 | Video dense description generation method based on time sequence feature pyramid |
CN113469260A (en) * | 2021-07-12 | 2021-10-01 | 天津理工大学 | Visual description method based on convolutional neural network, attention mechanism and self-attention converter |
CN113593603A (en) * | 2021-07-27 | 2021-11-02 | 浙江大华技术股份有限公司 | Audio category determination method and device, storage medium and electronic device |
CN114627413A (en) * | 2022-03-11 | 2022-06-14 | 电子科技大学 | Video intensive event content understanding method |
CN114627413B (en) * | 2022-03-11 | 2022-09-13 | 电子科技大学 | Video intensive event content understanding method |
CN114661953A (en) * | 2022-03-18 | 2022-06-24 | 北京百度网讯科技有限公司 | Video description generation method, device, equipment and storage medium |
CN114998673B (en) * | 2022-05-11 | 2023-10-13 | 河海大学 | Dam defect time sequence image description method based on local self-attention mechanism |
WO2023217163A1 (en) * | 2022-05-11 | 2023-11-16 | 华能澜沧江水电股份有限公司 | Dam defect time-sequence image description method based on local self-attention mechanism |
CN114998673A (en) * | 2022-05-11 | 2022-09-02 | 河海大学 | Dam defect time sequence image description method based on local self-attention mechanism |
CN115190332A (en) * | 2022-07-08 | 2022-10-14 | 西安交通大学医学院第二附属医院 | Dense video subtitle generation method based on global video characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN110929092B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110929092A (en) | Multi-event video description method based on dynamic attention mechanism | |
CN109389091B (en) | Character recognition system and method based on combination of neural network and attention mechanism | |
Kim et al. | Efficient dialogue state tracking by selectively overwriting memory | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN107484017B (en) | Supervised video abstract generation method based on attention model | |
EP4073787B1 (en) | System and method for streaming end-to-end speech recognition with asynchronous decoders | |
CN111488807B (en) | Video description generation system based on graph rolling network | |
CN110309732B (en) | Behavior identification method based on skeleton video | |
CN112329760B (en) | Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network | |
CN109543820B (en) | Image description generation method based on architecture phrase constraint vector and double vision attention mechanism | |
CN109919174A (en) | A kind of character recognition method based on gate cascade attention mechanism | |
CN111079532A (en) | Video content description method based on text self-encoder | |
CN112329794B (en) | Image description method based on dual self-attention mechanism | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN109829495A (en) | Timing image prediction method based on LSTM and DCGAN | |
CN111814844A (en) | Intensive video description method based on position coding fusion | |
CN115690152A (en) | Target tracking method based on attention mechanism | |
CN116524593A (en) | Dynamic gesture recognition method, system, equipment and medium | |
Chen et al. | Training full spike neural networks via auxiliary accumulation pathway | |
CN113868451B (en) | Cross-modal conversation method and device for social network based on up-down Wen Jilian perception | |
CN114973416A (en) | Sign language recognition algorithm based on three-dimensional convolution network | |
CN115422369B (en) | Knowledge graph completion method and device based on improved TextRank | |
CN109918484B (en) | Dialog generation method and device | |
CN116189284A (en) | Human motion prediction method, device, equipment and storage medium | |
CN115311598A (en) | Video description generation system based on relation perception |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |