CN108647255A - The video sequential sentence localization method and device returned based on attention - Google Patents
The video sequential sentence localization method and device returned based on attention Download PDFInfo
- Publication number
- CN108647255A CN108647255A CN201810367989.8A CN201810367989A CN108647255A CN 108647255 A CN108647255 A CN 108647255A CN 201810367989 A CN201810367989 A CN 201810367989A CN 108647255 A CN108647255 A CN 108647255A
- Authority
- CN
- China
- Prior art keywords
- sentence
- attention
- video
- content
- sequential
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a kind of video sequential sentence localization methods and device returned based on attention, wherein method includes the following steps:According to Three dimensional convolution neural network and Glove term vector mechanism, and using two-way length, memory network encodes video clip and sentence in short-term on this basis, to characterize video clip content and content of the sentence;According to video clip content and content of the sentence by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain the attention weight vector and attention weighted feature of video and sentence;According to the attention weight vector or attention weighted feature of video and sentence, export to obtain the positioning result of video sequential sentence by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature.This method can keep the contextual information in video and sentence, improve the efficiency of sentence position fixing process, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
Description
Technical field
The present invention relates to technical field of computer vision, more particularly to a kind of video sequential sentence returned based on attention
Localization method and device.
Background technology
In the prior art, video sequential sentence localization method is mainly:The Unified Characterization built between video and sentence is empty
Between, it is scanned generates several positioning video sections to be selected in video, sentence and positioning video section to be selected are projected into unified table
Sign space is compared and positions;It is scanned in video and generates several positioning video sections to be selected, by positioning video section to be selected
Visual signature merged with the text feature of sentence and generate multi-modal feature.Sequential is carried out on the basis of multi-modal feature to return
Return, generate positioning video section to be selected and predict the reporting between positioning video section, and positioning video section to be selected is moved
To predicted position.
The method used in the prior art has the disadvantage that:Be scanned in video generate positioning video section to be selected this
One way calculating cost is higher, can not adapt to the processing of long video, thus the above video sequential sentence localization method is expansible
Property is not strong;Positioning video section to be selected is separated into independent process from global video, has obstructed particular video frequency content and video
The interaction of contextual information, and video contextual information is most important to the positioning of sentence.Therefore, the above video sequential sentence is fixed
The accuracy rate of position method is not high;All directly using general length, memory network extracts sentence characteristics to above method in short-term, has ignored
For the key message of sequential positioning in sentence, therefore the excavation of their distich sub-informations is abundant not enough.
Invention content
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, an object of the present invention is to provide a kind of promotion sentence locating speed, positioning accuracy and positioning Shandongs
The video sequential sentence localization method that the purpose of stick is returned based on attention.
It is another object of the present invention to propose a kind of video sequential sentence positioning device returned based on attention.
In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of video sequential sentence returned based on attention
Sub- localization method, includes the following steps:It is sharp according to Three dimensional convolution neural network and Glove term vector mechanism, and on this basis
With two-way length, memory network encodes video clip and sentence in short-term, to characterize video clip content and content of the sentence;Root
According to video clip content and content of the sentence by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain
Take the attention weight vector of video and sentence and attention weighted feature;According to the attention weight vector of video and sentence or
Attention weighted feature is exported by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature
Obtain the positioning result of video sequential sentence.
The video sequential sentence localization method of the embodiment of the present invention returned based on attention, by characterizing in video clip
Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence
System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential
The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
In addition, the video sequential sentence localization method according to the above embodiment of the present invention returned based on attention can be with
With following additional technical characteristic:
Further, in one embodiment of the invention, described according to Three dimensional convolution neural network and Glove term vectors
Mechanism, and using two-way length, memory network encodes video clip and sentence in short-term on this basis, to characterize piece of video
Section content and content of the sentence, further comprise:Characterize the video clip content and amalgamation of global video sentence context
Information, and memory network characterizes each of sentence according to the contextual information of sentence in short-term using Glove term vectors and two-way length
Word.
Further, in one embodiment of the invention, the multi-modal attention mechanism includes:According to sentence characteristics
Guidance generates the video attention weight vector and the attention weighting video feature, is associated with tightly with sentence semantics with obtaining
Close critical video content;It is instructed to generate sentence attention weight vector and attention weighting sentence according to the video clip content
Subcharacter, to obtain the crucial clue positioned about sequential in sentence.
Further, in one embodiment of the invention, it is described according to the attention weights of the video and sentence to
Amount or attention weighted feature, pass through the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature
Output obtains the positioning result of video sequential sentence, further comprises:Recurrence based on the attention weight is with the video
Attention weight vector returns the video content indicated by sentence in global video as input, using the full attended operation of multilayer
Relative position;The attention weighting video feature and the note are then first merged in recurrence based on the attention weighted feature
Power of anticipating weights sentence characteristics, obtains multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, profit
Relative position of the video content in global video indicated by sentence is returned with the full attended operation of multilayer.
Further, in one embodiment of the invention, the video sequential sentence returned based on attention is fixed
Position method further include:Loss function and attention, which are returned, according to attention calibrates loss function by back-propagation algorithm iteratively
Training pattern parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
In order to achieve the above objectives, another aspect of the present invention embodiment proposes a kind of video sequential returned based on attention
Sentence positioning device, including:Characterization module, for according to Three dimensional convolution neural network and Glove term vector mechanism, further leading to
Crossing two-way length, memory network encodes video clip and content of the sentence in short-term, to characterize video clip and content of the sentence;It obtains
Modulus block, for according to the video clip content and content of the sentence by multi-modal attention Mechanism establishing video and sentence it
Between symmetrical association, to obtain the attention weight vector of video and sentence and attention weighted feature;Locating module is used for root
According to the attention weight vector or attention weighted feature of the video and sentence, pass through the retrogression mechanism based on attention weight
Or the retrogression mechanism based on attention weighted feature exports to obtain the positioning result of video sequential sentence.
The video sequential sentence positioning device of the embodiment of the present invention returned based on attention, by characterizing in video clip
Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence
System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential
The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
In addition, the video sequential sentence positioning device according to the above embodiment of the present invention returned based on attention can be with
With following additional technical characteristic:
Further, in one embodiment of the invention, the characterization module is additionally operable to:It characterizes in the video clip
Contextual information hold and amalgamation of global video sentence, and using Glove term vectors and two-way length in short-term memory network according to
Each word of the contextual information characterization sentence of sentence.
Further, in one embodiment of the invention, the acquisition module is additionally operable to:It instructs to give birth to according to sentence characteristics
At the video attention weight vector and the attention weighting video feature, closely pass is associated with sentence semantics to obtain
Key video content;Sentence attention weight vector, which is generated, according to the video clip content weights sentence characteristics with attention, with
Obtain the crucial clue about sequential positioning in sentence.
Further, in one embodiment of the invention, the locating module is additionally operable to:Based on the attention weight
Recurrence using the video attention weight vector as input, utilize the full attended operation of multilayer to return the video indicated by sentence
Relative position of the content in global video;The attention weighting is then first merged in recurrence based on the attention weighted feature
Video features and the attention weight sentence characteristics, obtain multi-modal attention weighted feature, then add with multi-modal attention
Feature is weighed as input, opposite position of the video content in global video indicated by sentence is returned using the full attended operation of multilayer
It sets.
Further, in one embodiment of the invention, the video sequential sentence returned based on attention is fixed
Position device further includes training module, is used for:Passed through according to attention recurrence loss function and attention calibration loss function reversed
Propagation algorithm iteratively training pattern parameter, to obtain the mould of the video sequential sentence localization method returned based on attention
Type.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description
Obviously, or practice through the invention is recognized.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, wherein:
Fig. 1 is the flow chart according to the video sequential sentence localization method of the embodiment of the present invention returned based on attention;
Fig. 2 is the model according to the video sequential sentence positioning device of one embodiment of the invention returned based on attention
Structural schematic diagram;With
Fig. 3 is the structural representation according to the video sequential sentence positioning device of the embodiment of the present invention returned based on attention
Figure.
Specific implementation mode
The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end
Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing
The embodiment stated is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
The video sequential sentence returned based on attention for describing to propose according to embodiments of the present invention with reference to the accompanying drawings is fixed
Position method and device, describes the video sequential returned based on attention proposed according to embodiments of the present invention with reference to the accompanying drawings first
Sentence localization method.
Fig. 1 is the flow chart of the video sequential sentence localization method returned based on attention according to the embodiment of the present invention,
As shown in Figure 1, should be included the following steps based on the video sequential sentence localization method that attention returns:
In step S101, according to Three dimensional convolution neural network and Glove term vector mechanism, and utilize on this basis double
Video clip and sentence are encoded to long memory network in short-term, to characterize video clip content and content of the sentence.
It is understood that while characterizing video clip content, the contextual information of amalgamation of global video, and use
Memory network in this way may be used according to each word of the contextual information of sentence characterization sentence in short-term for Glove term vectors and two-way length
So that the video clip content and content of the sentence that obtain more comprehensively and have robustness.
In step s 102, according to video clip content and content of the sentence by multi-modal attention Mechanism establishing video with
Symmetrical association between sentence, to obtain the attention weight vector and attention weighted feature of video and sentence.
It is understood that multi-modal attention mechanism includes:It is instructed to generate video attention weights according to sentence characteristics
Vector and attention weighting video feature are associated with close critical video content with sentence semantics to obtain;According to video clip
Content generates sentence attention weight vector and weights sentence characteristics with attention, to obtain the key positioned about sequential in sentence
Clue.
In step s 103, according to the attention weight vector or attention weighted feature of video and sentence, by being based on
The retrogression mechanism of attention weight or retrogression mechanism based on attention weighted feature export to obtain the positioning of video sequential sentence
As a result.
It is understood that position Recurrent networks, which include two kinds, returns strategy, include recurrence based on attention weight and
Recurrence based on attention weighted feature.Wherein, recurrence based on attention weight is using video attention weight vector as defeated
Enter, relative position of the video content in global video indicated by sentence is returned using the full attended operation of multilayer;Based on attention
The recurrence of power weighted feature then first merges attention weighting video feature and attention weights sentence characteristics, obtains multi-modal attention
Power weighted feature, then using multi-modal attention weighted feature as input, returned indicated by sentence using the full attended operation of multilayer
Relative position of the video content in global video.
As shown in Fig. 2, in one embodiment of the invention, the video sequential sentence localization method returned based on attention
Model be divided into three modules, including:The feature coding of integrating context information, multi-modal attention mechanism and based on pay attention to
The position Recurrent networks of power, training step are:
Training set is expressed asWherein ViIndicate i-th of video in training set, video
Shi Changwei τi, SiTo describe video ViOne sentence of content, SiIt the starting in video of described content and terminates the time and sits
Mark is respectivelyWithK is training set number of samples;
Each video is equally divided into M video clip, each sentence is also characterized as the sequence of word;By the starting of sentence
It is normalized by video length with time coordinate is terminated, prediction target namely sentence coordinate as position Recurrent networks
Actual value:
This programme devises two loss functions to instruct the learning process of block mold:I.e. attention returns loss letter
Number and attention calibrate loss function;By the way that video and sentence inputting are determined to the video sequential sentence returned based on attention
In bit model, the predicted value of sentence coordinate is exportedAttention returns the predicted value that loss function is defined as sentence coordinate
Smooth L1 distance R (t) (Smooth L1Distance) between the actual value of sentence coordinate: Attention calibration loss function then limits the actual position time positioned at sentence
WindowInterior video clip, attention weight are as big as possible:If video Vi
J-th of video clip in time windowIt is interior, then mi,j=1, otherwise mi,=0;
Joint attention returns the study of loss function and the attention calibration common guidance model of loss function, passes through classics
Back-propagation algorithm iteratively training pattern parameter.
It is understood that the model of the video sequential sentence localization method returned based on attention with it is a kind of end to end
Frame combined optimization is trained, and is reduced redundant computation cost and is improved the locating accuracy of sentence.It can be solved based on this programme
Certainly video sequential sentence orientation problem preferably serves disparate networks Video Applications, is suitable for the video content based on sentence
Quickly positioning, video frequency searching, the scenes such as video frequency abstract.
The video sequential sentence localization method of the embodiment of the present invention returned based on attention, by characterizing in video clip
Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence
System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential
The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
The video sequential sentence returned based on attention proposed according to embodiments of the present invention referring next to attached drawing description is fixed
Position device.
Fig. 3 is the structural representation according to the video sequential sentence positioning device of the embodiment of the present invention returned based on attention
Figure, as shown in figure 3, should include based on the video sequential sentence positioning device 10 that attention returns:Characterization module 100 is used for root
According to Three dimensional convolution neural network and Glove term vector mechanism, further by two-way length in short-term memory network to video clip and
Content of the sentence is encoded, to characterize video clip and content of the sentence;Acquisition module 200, for according to video clip content and
Content of the sentence is by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain the note of video and sentence
Meaning power weight vector and attention weighted feature;Locating module 300, for according to the attention weight vector of video and sentence or
Attention weighted feature is exported by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature
Obtain the positioning result of video sequential sentence.
Further, in one embodiment of the invention, characterization module 100 is additionally operable to:Characterize video clip content
With the contextual information of amalgamation of global video sentence, and using Glove term vectors and two-way length in short-term memory network according to sentence
Contextual information characterization sentence each word.
Further, in one embodiment of the invention, acquisition module 200 is additionally operable to:It instructs to give birth to according to sentence characteristics
At video attention weight vector and attention weighting video feature, it is associated in close key video sequence with sentence semantics with obtaining
Hold;Sentence attention weight vector is generated according to video clip content and weights sentence characteristics with attention, is closed with obtaining in sentence
In the crucial clue of sequential positioning.
Further, in one embodiment of the invention, locating module 300 is additionally operable to:Returning based on attention weight
Return using video attention weight vector as input, the video content indicated by sentence is returned complete using the full attended operation of multilayer
Relative position in office's video;Attention weighting video feature and attention are then first merged in recurrence based on attention weighted feature
Sentence characteristics are weighted, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as inputting, using more
The full attended operation of layer returns relative position of the video content in global video indicated by sentence.
Further, in one embodiment of the invention, dress should be positioned based on the video sequential sentence that attention returns
It further includes training module to set 10, is used for:Loss function is returned according to attention and attention calibrates loss function by reversely passing
With broadcasting algorithm iteration training pattern parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
The video sequential sentence positioning device of the embodiment of the present invention returned based on attention, by characterizing in video clip
Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence
System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential
The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
It should be noted that aforementioned explaining to the video sequential sentence localization method embodiment that is returned based on attention
The bright device for being also applied for the embodiment, details are not described herein again.
In the description of the present invention, it is to be understood that, term "center", " longitudinal direction ", " transverse direction ", " length ", " width ",
" thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time
The orientation or positional relationship of the instructions such as needle ", " counterclockwise ", " axial direction ", " radial direction ", " circumferential direction " be orientation based on ... shown in the drawings or
Position relationship is merely for convenience of description of the present invention and simplification of the description, and does not indicate or imply the indicated device or element must
There must be specific orientation, with specific azimuth configuration and operation, therefore be not considered as limiting the invention.
In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc.
Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral;Can be that machinery connects
It connects, can also be electrical connection;It can be directly connected, can also can be indirectly connected through an intermediary in two elements
The interaction relationship of the connection in portion or two elements, unless otherwise restricted clearly.For those of ordinary skill in the art
For, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
In the present invention unless specifically defined or limited otherwise, fisrt feature can be with "above" or "below" second feature
It is that the first and second features are in direct contact or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists
Second feature " on ", " top " and " above " but fisrt feature be directly above or diagonally above the second feature, or be merely representative of
Fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " lower section " and " below " can be
One feature is directly under or diagonally below the second feature, or is merely representative of fisrt feature level height and is less than second feature.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, changes, replacing and modification.
Claims (10)
1. a kind of video sequential sentence localization method returned based on attention, which is characterized in that include the following steps:
According to Three dimensional convolution neural network and Glove term vector mechanism, and two-way length memory network in short-term is utilized on this basis
Video clip and sentence are encoded, to characterize video clip content and content of the sentence;
Pass through pair between multi-modal attention Mechanism establishing video and sentence according to the video clip content and content of the sentence
Claim association, to obtain the attention weight vector and attention weighted feature of video and sentence;
According to the attention weight vector or attention weighted feature of the video and sentence, pass through returning based on attention weight
Mechanism or the retrogression mechanism based on attention weighted feature is returned to export to obtain the positioning result of video sequential sentence.
2. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described
According to Three dimensional convolution neural network and Glove term vector mechanism, and on this basis using two-way length in short-term memory network to regarding
Frequency segment and sentence are encoded, and to characterize video clip content and content of the sentence, are further comprised:
Characterize the contextual information of the video clip content and amalgamation of global video sentence, and using Glove term vectors and double
Each word of sentence is characterized according to the contextual information of sentence to long memory network in short-term.
3. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described
Multi-modal attention mechanism includes:
The video attention weight vector and the attention weighting video feature are generated according to sentence characteristics guidance, to obtain
Close critical video content is associated with sentence semantics;
It is instructed to generate sentence attention weight vector and attention weighting sentence characteristics according to the video clip content, to obtain
Crucial clue in sentence about sequential positioning.
4. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described
According to the attention weight vector or attention weighted feature of the video and sentence, pass through the regression machine based on attention weight
System or the retrogression mechanism based on attention weighted feature export to obtain the positioning result of video sequential sentence, further comprise:
Recurrence based on the attention weight connects behaviour using multilayer entirely using the video attention weight vector as input
Make relative position of the video content in global video indicated by recurrence sentence;
The attention weighting video feature is then first merged in recurrence based on the attention weighted feature and the attention adds
Sentence characteristics are weighed, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, utilize multilayer
Full attended operation returns relative position of the video content in global video indicated by sentence.
5. according to the video sequential sentence localization method that claim 1-4 any one of them is returned based on attention, feature
It is, further includes:
Loss function is returned according to attention and attention calibration loss function passes through back-propagation algorithm iteratively training pattern
Parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
6. a kind of video sequential sentence positioning device returned based on attention, which is characterized in that include the following steps:
Characterization module, for according to Three dimensional convolution neural network and Glove term vector mechanism, further being remembered in short-term by two-way length
Recall network to encode video clip and content of the sentence, to characterize video clip and content of the sentence;
Acquisition module, for according to the video clip content and content of the sentence by multi-modal attention Mechanism establishing video with
Symmetrical association between sentence, to obtain the attention weight vector and attention weighted feature of video and sentence;
Locating module, for the attention weight vector or attention weighted feature according to the video and sentence, by being based on
The retrogression mechanism of attention weight or retrogression mechanism based on attention weighted feature export to obtain the positioning of video sequential sentence
As a result.
7. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described
Characterization module is additionally operable to:
Characterize the video clip content and amalgamation of global video sentence contextual information, and using Glove term vectors and
Two-way length in short-term memory network according to the contextual information of sentence characterize sentence each word.
8. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described
Acquisition module is additionally operable to:
The video attention weight vector and the attention weighting video feature are generated according to sentence characteristics guidance, to obtain
Close critical video content is associated with sentence semantics;
Sentence attention weight vector is generated according to the video clip content and weights sentence characteristics with attention, to obtain sentence
In about sequential positioning crucial clue.
9. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described
Locating module is additionally operable to:
Recurrence based on the attention weight connects behaviour using multilayer entirely using the video attention weight vector as input
Make relative position of the video content in global video indicated by recurrence sentence;
The attention weighting video feature is then first merged in recurrence based on the attention weighted feature and the attention adds
Sentence characteristics are weighed, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, utilize multilayer
Full attended operation returns relative position of the video content in global video indicated by sentence.
10. according to the video sequential sentence positioning device that claim 6-9 any one of them is returned based on attention, feature
It is, the video sequential sentence positioning device returned based on attention further includes training module, is used for:
Loss function is returned according to attention and attention calibration loss function passes through back-propagation algorithm iteratively training pattern
Parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810367989.8A CN108647255A (en) | 2018-04-23 | 2018-04-23 | The video sequential sentence localization method and device returned based on attention |
PCT/CN2018/113805 WO2019205562A1 (en) | 2018-04-23 | 2018-11-02 | Attention regression-based method and device for positioning sentence in video timing sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810367989.8A CN108647255A (en) | 2018-04-23 | 2018-04-23 | The video sequential sentence localization method and device returned based on attention |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108647255A true CN108647255A (en) | 2018-10-12 |
Family
ID=63747336
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810367989.8A Pending CN108647255A (en) | 2018-04-23 | 2018-04-23 | The video sequential sentence localization method and device returned based on attention |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108647255A (en) |
WO (1) | WO2019205562A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109889923A (en) * | 2019-02-28 | 2019-06-14 | 杭州一知智能科技有限公司 | Utilize the method for combining the layering of video presentation to summarize video from attention network |
CN109948691A (en) * | 2019-03-14 | 2019-06-28 | 齐鲁工业大学 | Iamge description generation method and device based on depth residual error network and attention |
CN110188360A (en) * | 2019-06-06 | 2019-08-30 | 北京百度网讯科技有限公司 | Model training method and device |
CN110225368A (en) * | 2019-06-27 | 2019-09-10 | 腾讯科技(深圳)有限公司 | A kind of video locating method, device and electronic equipment |
WO2019205562A1 (en) * | 2018-04-23 | 2019-10-31 | 清华大学 | Attention regression-based method and device for positioning sentence in video timing sequence |
CN110688446A (en) * | 2019-08-23 | 2020-01-14 | 重庆兆光科技股份有限公司 | Sentence meaning mathematical space representation method, system, medium and equipment |
CN110717054A (en) * | 2019-09-16 | 2020-01-21 | 清华大学 | Method and system for generating video by crossing modal characters based on dual learning |
CN110933518A (en) * | 2019-12-11 | 2020-03-27 | 浙江大学 | Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism |
WO2020113468A1 (en) * | 2018-12-05 | 2020-06-11 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for grounding a target video clip in a video |
CN111368870A (en) * | 2019-10-31 | 2020-07-03 | 杭州电子科技大学 | Video time sequence positioning method based on intra-modal collaborative multi-linear pooling |
CN111836111A (en) * | 2019-04-17 | 2020-10-27 | 微软技术许可有限责任公司 | Technique for generating barrage |
CN112015955A (en) * | 2020-09-01 | 2020-12-01 | 清华大学 | Multi-mode data association method and device |
CN112560811A (en) * | 2021-02-19 | 2021-03-26 | 中国科学院自动化研究所 | End-to-end automatic detection research method for audio-video depression |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866938B (en) * | 2019-11-21 | 2021-04-27 | 北京理工大学 | Full-automatic video moving object segmentation method |
CN112200250A (en) * | 2020-10-14 | 2021-01-08 | 重庆金山医疗器械有限公司 | Digestive tract segmentation identification method, device and equipment of capsule endoscope image |
CN113762322B (en) * | 2021-04-22 | 2024-06-25 | 腾讯科技(北京)有限公司 | Video classification method, device and equipment based on multi-modal representation and storage medium |
CN116363817B (en) * | 2023-02-02 | 2024-01-02 | 淮阴工学院 | Chemical plant dangerous area invasion early warning method and system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199933B (en) * | 2014-09-04 | 2017-07-07 | 华中科技大学 | The football video event detection and semanteme marking method of a kind of multimodal information fusion |
US11409791B2 (en) * | 2016-06-10 | 2022-08-09 | Disney Enterprises, Inc. | Joint heterogeneous language-vision embeddings for video tagging and search |
CN107038221B (en) * | 2017-03-22 | 2020-11-17 | 杭州电子科技大学 | Video content description method based on semantic information guidance |
CN107066973B (en) * | 2017-04-17 | 2020-07-21 | 杭州电子科技大学 | Video content description method using space-time attention model |
CN108647255A (en) * | 2018-04-23 | 2018-10-12 | 清华大学 | The video sequential sentence localization method and device returned based on attention |
-
2018
- 2018-04-23 CN CN201810367989.8A patent/CN108647255A/en active Pending
- 2018-11-02 WO PCT/CN2018/113805 patent/WO2019205562A1/en active Application Filing
Non-Patent Citations (1)
Title |
---|
YUAN, YITIAN 等: "To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression", 《HTTPS://ARXIV.ORG/ABC/1804.07014V1》 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019205562A1 (en) * | 2018-04-23 | 2019-10-31 | 清华大学 | Attention regression-based method and device for positioning sentence in video timing sequence |
WO2020113468A1 (en) * | 2018-12-05 | 2020-06-11 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for grounding a target video clip in a video |
CN111480166A (en) * | 2018-12-05 | 2020-07-31 | 北京百度网讯科技有限公司 | Method and device for positioning target video clip from video |
US11410422B2 (en) | 2018-12-05 | 2022-08-09 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for grounding a target video clip in a video |
CN109889923A (en) * | 2019-02-28 | 2019-06-14 | 杭州一知智能科技有限公司 | Utilize the method for combining the layering of video presentation to summarize video from attention network |
CN109889923B (en) * | 2019-02-28 | 2021-03-26 | 杭州一知智能科技有限公司 | Method for summarizing videos by utilizing layered self-attention network combined with video description |
CN109948691A (en) * | 2019-03-14 | 2019-06-28 | 齐鲁工业大学 | Iamge description generation method and device based on depth residual error network and attention |
US11877016B2 (en) | 2019-04-17 | 2024-01-16 | Microsoft Technology Licensing, Llc | Live comments generating |
CN111836111A (en) * | 2019-04-17 | 2020-10-27 | 微软技术许可有限责任公司 | Technique for generating barrage |
CN110188360B (en) * | 2019-06-06 | 2023-04-25 | 北京百度网讯科技有限公司 | Model training method and device |
CN110188360A (en) * | 2019-06-06 | 2019-08-30 | 北京百度网讯科技有限公司 | Model training method and device |
CN110225368B (en) * | 2019-06-27 | 2020-07-10 | 腾讯科技(深圳)有限公司 | Video positioning method and device and electronic equipment |
CN110225368A (en) * | 2019-06-27 | 2019-09-10 | 腾讯科技(深圳)有限公司 | A kind of video locating method, device and electronic equipment |
CN110688446A (en) * | 2019-08-23 | 2020-01-14 | 重庆兆光科技股份有限公司 | Sentence meaning mathematical space representation method, system, medium and equipment |
CN110717054B (en) * | 2019-09-16 | 2022-07-15 | 清华大学 | Method and system for generating video by crossing modal characters based on dual learning |
CN110717054A (en) * | 2019-09-16 | 2020-01-21 | 清华大学 | Method and system for generating video by crossing modal characters based on dual learning |
CN111368870A (en) * | 2019-10-31 | 2020-07-03 | 杭州电子科技大学 | Video time sequence positioning method based on intra-modal collaborative multi-linear pooling |
CN111368870B (en) * | 2019-10-31 | 2023-09-05 | 杭州电子科技大学 | Video time sequence positioning method based on inter-modal cooperative multi-linear pooling |
CN110933518B (en) * | 2019-12-11 | 2020-10-02 | 浙江大学 | Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism |
CN110933518A (en) * | 2019-12-11 | 2020-03-27 | 浙江大学 | Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism |
CN112015955A (en) * | 2020-09-01 | 2020-12-01 | 清华大学 | Multi-mode data association method and device |
CN112015955B (en) * | 2020-09-01 | 2021-07-30 | 清华大学 | Multi-mode data association method and device |
CN112560811A (en) * | 2021-02-19 | 2021-03-26 | 中国科学院自动化研究所 | End-to-end automatic detection research method for audio-video depression |
US11963771B2 (en) | 2021-02-19 | 2024-04-23 | Institute Of Automation, Chinese Academy Of Sciences | Automatic depression detection method based on audio-video |
Also Published As
Publication number | Publication date |
---|---|
WO2019205562A1 (en) | 2019-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108647255A (en) | The video sequential sentence localization method and device returned based on attention | |
CN108052937B (en) | Based on Weakly supervised character machining device training method, device, system and medium | |
CN111738111B (en) | Road extraction method of high-resolution remote sensing image based on multi-branch cascade cavity space pyramid | |
CN109271646A (en) | Text interpretation method, device, readable storage medium storing program for executing and computer equipment | |
CN108665506A (en) | Image processing method, device, computer storage media and server | |
CN109635204A (en) | Online recommender system based on collaborative filtering and length memory network | |
CN109671102A (en) | A kind of composite type method for tracking target based on depth characteristic fusion convolutional neural networks | |
US20220092297A1 (en) | Facial beauty prediction method and device based on multi-task migration | |
CN105740984A (en) | Product concept performance evaluation method based on performance prediction | |
CN112199600A (en) | Target object identification method and device | |
CN114925238B (en) | Federal learning-based video clip retrieval method and system | |
CN110442741A (en) | A kind of mutual search method of cross-module state picture and text for merging and reordering based on tensor | |
US20230368500A1 (en) | Time-series image description method for dam defects based on local self-attention | |
CN113870312B (en) | Single target tracking method based on twin network | |
CN113255701B (en) | Small sample learning method and system based on absolute-relative learning framework | |
Huang et al. | An incremental SAR target recognition framework via memory-augmented weight alignment and enhancement discrimination | |
CN117213470A (en) | Multi-machine fragment map aggregation updating method and system | |
CN111651577A (en) | Cross-media data association analysis model training method, data association analysis method and system | |
CN117453949A (en) | Video positioning method and device | |
CN115017377B (en) | Method, device and computing equipment for searching target model | |
Chen et al. | Application of Data‐Driven Iterative Learning Algorithm in Transmission Line Defect Detection | |
Chaalal et al. | Mobility prediction for aerial base stations for a coverage extension in 5G networks | |
Zha et al. | [Retracted] Research on the Prediction of Port Economic Synergy Development Trend Based on Deep Neural Networks | |
CN113239219A (en) | Image retrieval method, system, medium and equipment based on multi-modal query | |
CN115935001A (en) | Frame-level fine-grained natural language video time positioning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181012 |
|
WD01 | Invention patent application deemed withdrawn after publication |