CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM - Google Patents
A kind of video presentation method based on two-way fractal net work and LSTM Download PDFInfo
- Publication number
- CN106934352A CN106934352A CN201710111507.8A CN201710111507A CN106934352A CN 106934352 A CN106934352 A CN 106934352A CN 201710111507 A CN201710111507 A CN 201710111507A CN 106934352 A CN106934352 A CN 106934352A
- Authority
- CN
- China
- Prior art keywords
- video
- lstm
- network
- input
- net work
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of video presentation method based on two-way fractal net work and LSTM.Methods described carries out the sampling of key frame to video to be described, first, and extract the Optical-flow Feature between adjacent two frame of former video, then learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature, two recurrent neural networks models based on LSTM units are separately input to again, finally the output valve at two each moment of independent model is weighted averagely, so as to obtain descriptive statement corresponding with the video.The present invention has been utilized respectively the information of former frame of video and light stream to video to be described, the Optical-flow Feature of addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that change of the video on Spatial Dimension and time dimension.Furthermore, abstract visual signature is carried out to low-level image feature by novel fractal net work and is expressed, so that people, thing, behavior and spatial relation for being more accurately related in analysis mining video etc. are contacted.
Description
Technical field
The invention belongs to video presentation, depth learning technology field, and in particular to one kind based on two-way fractal net work and
The video presentation method of LSTM.
Background technology
With the progress and the development of society of science and technology, especially smart mobile phone is very general for all kinds of video camera terminals
And, the price of hardware store is also increasingly cheap, and this causes that multimedia information stream exponentially formula increases.In substantial amounts of video stream
In front, automatically analysis, identification how are carried out efficiently to massive video information in the case where manual intervention is reduced as far as possible and are understood,
So as to be described from semantically giving, it has also become a heat subject in present image treatment and computer vision research field.It is right
Description is made to video with language for most people, after one brief video of viewing and may is that part very simple thing.
But, for machine, by extracting the Pixel Information of each two field picture in video, and it is analyzed, is processed, so that raw
Described into a natural language, be a challenging task.
Machine efficiently automatically can make video to describe in video frequency searching, man-machine interaction, traffic security protection etc.
Computer vision field also has a wide range of applications, and this will further facilitate research of the people to the semantic description of video.
The content of the invention
Shortcoming and deficiency it is a primary object of the present invention to overcome prior art, there is provided one kind is based on two-way fractal net work
With the video presentation method of LSTM.
In order to reach above-mentioned purpose, the present invention uses following technical scheme:
A kind of video presentation method based on two-way fractal net work and LSTM, it is characterised in that first to video to be described,
The sampling of key frame is carried out, and extracts the Optical-flow Feature between adjacent two frame of former video, then distinguished by two fractal net works
Learn and obtain the high-level characteristic expression of key frame and Optical-flow Feature, then be separately input to two recurrence god based on LSTM units
Through network model, finally the output valve at two independent each moment of recurrent neural networks model is weighted averagely, so as to obtain
Obtain descriptive statement corresponding with the video.Specifically include following steps:
S1, the sampling of key frame is carried out to video to be described, and extract the Optical-flow Feature between adjacent two frame of former video;
S2, learnt respectively by two fractal net works and obtain frame of video and the high-level characteristic of Optical-flow Feature expression;
S3, the high-level characteristic vector that previous step is obtained is input to two recurrent neural nets based on LSTM units respectively
Network;
S4, the output valve at two each moment of independent model is weighted mean deviation obtains the corresponding description language of video
Sentence.
Preferably, video extraction Optical-flow Feature to be described, is specially described in step S1:
S1.1, the Optical-flow Feature value calculated respectively on x directions and y directions of the video per adjacent two frame, and normalize to [0,
255] pixel coverage;
S1.2, the range value for calculating light stream, and it is combined into a light flow graph with reference to the Optical-flow Feature value that previous step is obtained.
Preferably, the high-level characteristic expression that key frame and Optical-flow Feature are obtained in step S2 is concretely comprised the following steps:
S2.1, the key frame of the video obtained to step S1 are sequentially inputted to first treatment space with the order at time point
The fractal net work of dimensional relationships, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network;
S2.2, the light flow graph obtained to step S1 are sequentially inputted to second process time dimension and close with the order at time point
The fractal net work of system, corresponding motion feature vector is sequentially generated by the Nonlinear Mapping relation of network.
Preferably, for the repeated application by single extension rule in step S2.1 and S2.2 generate one it is profound
Network, its topology layout is a point shape blocked;The network includes the different interaction subpath of length, but does not include
Any through type connection;Meanwhile, in order to realize extracting the ability of high-performance constant depth sub-network, employ a kind of path and give up
Method regularization point shape framework lining path coadaptation rule;For fractal net work, simplicity and the design of training
Simplicity it is corresponding, the single loss function for being connected to last layer is enough to drive internal act to go to imitate depth and supervise;Institute
The fractal net work of use is the depth convolutional neural networks based on fractal structure.
Preferably, the repeated application by single extension rule in step S2.1 and S2.2 generates a profound net
Network, its topology layout is that point shape blocked is specially:
Base case f1The layer of (z) comprising single selection type between input and output;Make C represent and block a point shape fC()
Index, fC() defines the network architecture, connection and channel type.Wherein, base case is the network comprising single convolutional layer
Represent such as formula (1-1):
f1(z)=conv (z) (1-1)
Ensuing point of shape of recursive definition such as formula (1-2):
In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network fC(·)
Width;Depth is defined as the number from conv layers be input on the longest path of output, is proportional to 2C-1;For what is classified
The usual dispersed placement tether layer of convolutional network;In order to reach identical purpose, f is usedC() as construction unit, by it with connect down
The tether layer for coming is stacked B times, obtains total depth B2C-1;Attended operationTwo characteristic blocks are combined into one;One characteristic block
It is one conv layers of result:It is the tensor that some fixed passages maintain activation in area of space;Port number corresponds to
The number of the filter of conv layers above;When a point shape is expanded, adjacent connection is merged into single articulamentum;Articulamentum handle
All its input feature vector merged blocks are into single IOB.
Preferably, a kind of collaboration in the method regularization point shape framework lining path that path is given up in step S2.1 and S2.2
The rule of adaptation is specially:Because fractal net work includes extra large-scale structure, a kind of similar dropout and drop- is used
The coarseness Regularization Strategy of connect, path is given up by the operand of random drop articulamentum to forbid parallel path
Common to adapt to, this mode effectively prevent one path of Web vector graphic as anchor buoy, and another path may as amendment
The over-fitting behavior for causing;Using two sampling policies:
For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input;
It is single-row by limiting this paths, to encourage per paths in order to whole network is selected for the overall situation
Each column turns into strong fallout predictor.
Preferably, high-level characteristic vector is input to two recurrent neural networks based on LSTM units described in step S3
Model is specially:
Recurrent neural network based on LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 respectively
The propagated forward process of individual neuron, wherein each LSTM neural unit is represented by:
it=σ (Wxixt+Whiht-1+bi) (1-3)
ft=σ (Wxfxt+Whfht-1+bf) (1-4)
ot=σ (Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt (1-7)
Wherein, σ (xt)=(1+e-xt)-1It is sigmoid nonlinear activation functions,Be hyperbolic just
Cut nonlinear activation function;it, ft, ot, ctT input gate, Memory-Gate, out gate and the corresponding shape of core door are represented respectively
State amount;For each gate, Wxi, Wxf, Wxo, WxgInput gate, Memory-Gate, out gate and the corresponding power of core door are represented respectively
Weight transfer matrix, Whi, Whf, Who, WhgInput gate is represented respectively, and Memory-Gate, out gate and core door become in t-1 moment hidden layer
Amount ht-1Corresponding transferring weights matrix, bi, bf, bo, bgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding
Bias vector.
Preferably, Artificial Neural Network Structures are in step S3:
Based on the Recursive Neural Network Structure figure of two-layer LSTM units, the recurrence of the LSTM units stacked using this two-layer
Neutral net carries out the operation of the coding and decoding to input feature vector vector, so as to realize the conversion of natural language text;Wherein,
Ground floor LSTM neurons complete to each moment input visual signature vector cataloged procedure, then each moment output
Hidden layer is expressed as the input of second layer LSTM neurons;When the characteristic vector of all frame of video is all input to ground floor LSTM god
After unit, second layer LSTM neurons will receive a designator, and start the task of decoding;In the stage of decoding, network
The loss of information is had, therefore the target of model parameter training and study is pre- in given hidden layer expression and the output of last moment
On the premise of survey, the log-likelihood function of whole output statement prediction is maximized;For with parameter θ and output statement Y=(y1,
y2,…,ym) model that represents, parameter optimization target is represented by:
Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to target
Function is optimized, and the error of whole network accumulates transmission by back-propagation algorithm on time dimension.
Preferably, the output valve at two neutral net each moment of independent model is weighted mean deviation and obtained by step S4
The corresponding descriptive statement concrete operations of video are:
S4.1, the output valve of two second layer LSTM neurons at independent each moment of recurrent neural networks model is carried out
Weighted average;
S4.2, the probability of occurrence that each word in vocabulary V is calculated using softmax functions, are expressed as:
Wherein, y represents the word of prediction, ztRepresent output valve of the recurrent neural network in t, WyRepresent that the word exists
Weighted value in vocabulary.
S4.3, the decoding stage at each moment, take the word of maximum probability in softmax function-outputs, so as to obtain
Obtain corresponding video presentation sentence.
The present invention compared with prior art, has the following advantages that and beneficial effect:
1st, the Optical-flow Feature of present invention addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that regard
Change of the frequency on Spatial Dimension and time dimension.
2nd, a kind of video presentation method based on two-way fractal net work and LSTM that the present invention is provided is by any input
Video is processed, it is possible to is automatically generated a descriptive language on video content end-to-endly, be can be applied to video
In the application fields such as retrieval, video monitoring and man-machine interaction.
3rd, the present invention carries out abstract visual signature expression to low-level image feature by novel fractal net work, so that more accurate
The contacts such as people, thing, behavior and the spatial relation being related in ground analysis mining video.
Brief description of the drawings
Fig. 1 is the flow frame diagram of the video presentation method based on two-way fractal net work and LSTM that the present invention is provided;
Fig. 2 is that embodiments of the invention use fracton network diagram;
Fig. 3 is the schematic diagram of the recurrent neural network based on LSTM units that embodiments of the invention are used.
Specific embodiment
With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited
In this.
The sampling of key frame is carried out to video to be described, and extracts the Optical-flow Feature between adjacent two frame of former video, then
Learnt respectively by two fractal net works and obtain key frame and the expression of the high-level characteristic of Optical-flow Feature, then be separately input to two
Based on the recurrent neural networks model of LSTM units, finally by two output valves at independent each moment of recurrent neural networks model
It is weighted averagely, so as to obtain descriptive statement corresponding with the video.
Fig. 1 is overall flow figure of the invention, is comprised the following steps:
(1) sampling of key frame is carried out to video to be described, and extracts the Optical-flow Feature between adjacent two frame of former video;Its
In, it is to video extraction Optical-flow Feature concrete operations to be described,:
1st, the light flow valuve on x directions and y directions of the video per adjacent two frame is calculated respectively, and normalizes to [0,255]
Pixel coverage;
2nd, the range value of light stream is calculated, and a light flow graph is combined into reference to the Optical-flow Feature value that previous step is obtained.
(2) learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature.To
The sample frame of the video that one step is obtained is sequentially inputted to first point shape net for the treatment of Spatial Dimension relation with the order at time point
Network, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network;To obtain light flow graph with time point
Order be sequentially inputted to second fractal net work of process time dimensional relationships, by the Nonlinear Mapping relation of network successively
Generate corresponding motion feature vector.
Fractal net work is mainly has introduced a kind of layout strategy based on self similarity on macroscopical framework of neutral net, leads to
The repeated application for crossing single extension rule generates a profound network, and its topology layout is a point shape blocked.Should
Network includes the different interaction subpath of length, but not comprising the connection of any through type.Meanwhile, in order to realize extracting property high
The ability of energy constant depth sub-network, the collaboration for employing the method regularization point shape framework lining path given up in a kind of path is fitted
Should.For fractal net work, the simplicity of training is corresponding with the simplicity of design, the single loss function for being connected to last layer
It is enough to drive internal act to go to imitate depth supervision.The fractal net work for being used in the present invention is the depth based on fractal structure
Convolutional neural networks.
Schematic diagram as given fractal structure in Fig. 2, base case f1Z () includes single selected class between input and output
The layer of type;Make C represent and block a point shape fCThe index of (), fC() defines the network architecture, connection and channel type.Wherein, base
Plinth situation is such as formula of the network representation comprising single convolutional layer (1-1):
f1(z)=conv (z) (1-1)
Then the ensuing fractal structure of recursive definition such as formula (1-2) is passed through:
In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network fC(·)
Width;Depth is defined as the number from conv layers be input on the longest path of output, is proportional to 2C-1;For what is classified
The usual dispersed placement tether layer of convolutional network;In order to reach identical purpose, f is usedC() as construction unit, by it with connect down
The tether layer for coming is stacked B times, obtains total depth B2C-1;Attended operationTwo characteristic blocks are combined into one, it is one of special
It is a result for convolutional layer to levy block:It is the tensor that some fixed passages maintain activation in area of space.Port number pair
The number of the convolutional layer filter before Ying Yu.When a point shape is expanded, adjacent connection is merged into single articulamentum.As Fig. 2 is right
Shown in side, this articulamentum crosses over multiple row, all its input feature vector merged blocks into single IOB.
Because fractal net work includes extra large-scale structure, therefore propose to use a kind of similar dropout and drop-
The coarseness Regularization Strategy of connect.Give up by the operand of random drop articulamentum to forbid parallel path in path
Common to adapt to, this mode effectively prevent one path of Web vector graphic as anchor buoy, and another path may as amendment
The over-fitting behavior for causing.Here two sampling policies are mainly used:
For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input;
It is single-row by limiting this paths, to encourage per paths in order to whole network is selected for the overall situation
Each column turns into strong fallout predictor.
(3) the high-level characteristic vector that previous step is obtained is input to two recurrent neural nets based on LSTM units respectively
Network.Recurrent neural network based on LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 god respectively
Through unit, the propagated forward process of wherein each LSTM neural unit is represented by:
it=σ (Wxixt+Whiht-1+bi) (1-3)
ft=σ (Wxfxt+Whfht-1+bf) (1-4)
ot=σ (Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt (1-7)
Wherein, σ (xt)=(1+e-xt)-1It is sigmoid nonlinear activation functions,Be hyperbolic just
Cut nonlinear activation function;it, ft, ot, ctT input gate, Memory-Gate, out gate and the corresponding shape of core door are represented respectively
State amount;For each gate, Wxi, Wxf, Wxo, WxgInput gate, Memory-Gate, out gate and the corresponding power of core door are represented respectively
Weight transfer matrix, Whi, Whf, Who, WhgInput gate is represented respectively, and Memory-Gate, out gate and core door become in t-1 moment hidden layer
Amount ht-1Corresponding transferring weights matrix, bi, bf, bo, bgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding
Bias vector.
Such as the Recursive Neural Network Structure figure based on two-layer LSTM units be given in Fig. 3, we utilize this two-layer heap
The recurrent neural network of folded LSTM units carries out the operation of the coding and decoding to input feature vector vector, so as to realize nature language
Say the conversion of text.Wherein, ground floor LSTM neurons complete the cataloged procedure to the input visual signature vector at each moment,
Then the hidden layer of each moment output is expressed as the input of second layer LSTM neurons;When all frame of video characteristic vector all
After being input to ground floor LSTM neurons, second layer LSTM neurons will receive a designator, and start the task of decoding.
In the stage of decoding, network has the loss of information, therefore the target of model parameter training and study is in the expression of given hidden layer
On the premise of output with last moment is predicted, the log-likelihood function of whole output statement prediction is maximized.For with parameter θ
With output statement Y=(y1,y2,…,ym) model that represents, parameter optimization target is represented by:
Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to target
Function is optimized, and the error of whole network accumulates transmission by back-propagation algorithm on time dimension.
(4) output valve at two each moment of independent model is weighted mean deviation and obtains the corresponding description language of video
, concrete operations are:
1st, the output valve of two second layer LSTM neurons at each moment of independent model is weighted averagely;
2nd, the probability of occurrence of each word in vocabulary V is calculated using softmax functions, is expressed as:
Wherein, y represents the word of prediction, ztRepresent output valve of the recurrent neural network in t, WyRepresent that the word exists
Weighted value in vocabulary.
3rd, in the decoding stage at each moment, the word of maximum probability in softmax function-outputs is taken, so as to obtain right
The video presentation sentence answered.
Above-described embodiment is the present invention preferably implementation method, but embodiments of the present invention are not by above-described embodiment
Limitation, it is other it is any without departing from Spirit Essence of the invention and the change, modification, replacement made under principle, combine, simplification,
Equivalent substitute mode is should be, is included within protection scope of the present invention.
Claims (9)
1. a kind of video presentation method based on two-way fractal net work and LSTM, it is characterised in that enter to video to be described, first
The sampling of row key frame, and the Optical-flow Feature between adjacent two frame of former video is extracted, then learned respectively by two fractal net works
The high-level characteristic expression of key frame and Optical-flow Feature is practised and obtained, then is separately input to two recurrent neurals based on LSTM units
, finally be weighted the output valve at two independent each moment of recurrent neural networks model averagely, so as to obtain by network model
Descriptive statement corresponding with the video;Specifically include following steps:
S1, the sampling of key frame is carried out to video to be described, and extract the Optical-flow Feature between adjacent two frame of former video;
S2, learnt respectively by two fractal net works and obtain key frame and the high-level characteristic of Optical-flow Feature expression;Wherein divide shape
Network is generated by the repeated application of single extension rule;
S3, the high-level characteristic vector that previous step is obtained is input to two recurrent neural network moulds based on LSTM units respectively
Type;
S4, the output valve at two independent each moment of recurrent neural networks model is weighted mean deviation obtain video it is corresponding
Descriptive statement.
2. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step
Video extraction Optical-flow Feature to be described, is specially described in rapid S1:
S1.1, the Optical-flow Feature value calculated respectively on x directions and y directions of the video per adjacent two frame, and normalize to [0,255]
Pixel coverage;
S1.2, the range value for calculating light stream, and it is combined into a light flow graph with reference to the Optical-flow Feature value that previous step is obtained.
3. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step
The high-level characteristic expression of acquisition key frame and Optical-flow Feature concretely comprises the following steps in rapid S2:
S2.1, the key frame of the video obtained to step S1 are sequentially inputted to first treatment Spatial Dimension with the order at time point
The fractal net work of relation, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network;
S2.2, second process time dimensional relationships are sequentially inputted to the order at time point to the light flow graph that step S1 is obtained
Fractal net work, corresponding motion feature vector is sequentially generated by the Nonlinear Mapping relation of network.
4. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 3, it is characterised in that step
The repeated application by single extension rule in rapid S2.1 and S2.2 generates a profound network, and its topology layout is one
The individual point shape blocked;The network includes the different interaction subpath of length, but not comprising the connection of any through type;Meanwhile,
In order to realize extracting the ability of high-performance constant depth sub-network, the method regularization point shape framework given up in a kind of path is employed
The rule of the coadaptation in lining path;For fractal net work, the simplicity of training is corresponding with the simplicity of design, single company
The loss function for being connected to last layer is enough to drive internal act to go to imitate depth supervision;The fractal net work for being used is to be based on dividing
The depth convolutional neural networks of shape structure.
5. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 4, it is characterised in that step
The repeated application by single extension rule in rapid S2.1 and S2.2 generates a profound network, and its topology layout is one
The individual point shape blocked, specially:
Base case f1The layer of (z) comprising single selection type between input and output;Make C represent and block a point shape fCThe index of (),
fC() defines the network architecture, connection and channel type;Wherein, base case be the network representation comprising single convolutional layer such as
Formula (1-1):
f1(z)=conv (z) (1-1)
Ensuing point of shape of recursive definition such as formula (1-2):
In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network fCThe width of ()
Degree;Depth is defined as from being input on the longest path of outputconvThe number of layer, is proportional to 2C-1;For the convolution net classified
The usual dispersed placement tether layer of network;In order to reach identical purpose, f is usedC() as construction unit, by it and ensuing remittance
Collection layer is stacked B times, obtains total depth B2C-1;Attended operationTwo characteristic blocks are combined into one;One characteristic block is one
Conv layers of result:It is the tensor that some fixed passages maintain activation in area of space;Port number corresponds to above
The number of conv layers of filter;When a point shape is expanded, adjacent connection is merged into single articulamentum;Articulamentum it is all its
Input feature vector merged block is into single IOB.
6. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 4, it is characterised in that step
A kind of rule of the coadaptation in the method regularization point shape framework lining path that path is given up is specially in rapid S2.1 and S2.2:
Because fractal net work includes extra large-scale structure, using a kind of coarseness of similar dropout and drop-connect just
Then change strategy, path is given up by the operand of random drop articulamentum to forbid the common adaptation of parallel path, this mode
One path of Web vector graphic be effectively prevent as anchor buoy, the over-fitting behavior that another path may cause as amendment;
Using two sampling policies:
For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input;
It is single-row by limiting this paths, to encourage each column per paths in order to whole network is selected for the overall situation
As strong fallout predictor.
7. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step
High-level characteristic vector is input into two recurrent neural networks models based on LSTM units described in rapid S3 to be specially:It is based on
The recurrent neural network of LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 neurons respectively, its
In the propagated forward process of each LSTM neural unit be represented by:
it=σ (Wxixt+Whiht-1+bi) (1-3)
ft=σ (Wxfxt+Whfht-1+bf) (1-4)
ot=σ (Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt (1-7)
Wherein,It is sigmoid nonlinear activation functions,It is hyperbolic
Tangent nonlinear activation function;it, ft, ot, ctT input gate is represented respectively, and Memory-Gate, out gate and core door are corresponding
Quantity of state;For each gate, Wxi, Wxf, Wxo, WxgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding
Transferring weights matrix, Whi, Whf, Who, WhgRepresent input gate respectively, Memory-Gate, out gate and core door are in t-1 moment hidden layers
Variable ht-1Corresponding transferring weights matrix, bi, bf, bo, bgInput gate is represented respectively, and Memory-Gate, out gate is corresponding with core door
Bias vector.
8. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 7, it is characterised in that step
Artificial Neural Network Structures are in rapid S3:
Based on the recurrent neural network of two-layer LSTM units, the conversion of natural language text is realized;Wherein, ground floor LSTM nerves
Unit completes the cataloged procedure to the input visual signature vector at each moment, and then the hidden layer expression of each moment output is used as the
Two layers of input of LSTM neurons;After the characteristic vector of all frame of video is all input to ground floor LSTM neurons, the second layer
LSTM neurons will receive a designator, and start the task of decoding;In the stage of decoding, network has the damage of information
Lose, thus model parameter training and study target be on the premise of given hidden layer expression and the output of last moment are predicted,
Maximize the log-likelihood function of whole output statement prediction;For with parameter θ and output statement Y=(y1,y2,…,ym) represent
Model, parameter optimization target is represented by:
Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to object function
Optimize, the error of whole network accumulates transmission by back-propagation algorithm on time dimension.
9. according to a kind of video presentation method based on two-way fractal net work and LSTM of claim 1, it is characterised in that step S4
The output valve at two neutral net each moment of independent model is weighted mean deviation and obtains the corresponding descriptive statement tool of video
Gymnastics conduct:
S4.1, the output valve of two second layer LSTM neurons at independent each moment of recurrent neural networks model is weighted
Averagely;
S4.2, the probability of occurrence that each word in vocabulary V is calculated using softmax functions, are expressed as:
Wherein, y represents the word of prediction, ztRepresent output valve of the recurrent neural network in t, WyRepresent the word in vocabulary
Weighted value in table;
S4.3, the decoding stage at each moment, take the word of maximum probability in softmax function-outputs, so as to obtain right
The video presentation sentence answered.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710111507.8A CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710111507.8A CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106934352A true CN106934352A (en) | 2017-07-07 |
Family
ID=59424160
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710111507.8A Pending CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106934352A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644519A (en) * | 2017-10-09 | 2018-01-30 | 中电科新型智慧城市研究院有限公司 | A kind of intelligent alarm method and system based on video human Activity recognition |
CN107909115A (en) * | 2017-12-04 | 2018-04-13 | 上海师范大学 | A kind of image Chinese subtitle generation method |
CN107909014A (en) * | 2017-10-31 | 2018-04-13 | 天津大学 | A kind of video understanding method based on deep learning |
CN108198202A (en) * | 2018-01-23 | 2018-06-22 | 北京易智能科技有限公司 | A kind of video content detection method based on light stream and neural network |
CN108228915A (en) * | 2018-03-29 | 2018-06-29 | 华南理工大学 | A kind of video retrieval method based on deep learning |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
CN108470212A (en) * | 2018-01-31 | 2018-08-31 | 江苏大学 | A kind of efficient LSTM design methods that can utilize incident duration |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
CN109284682A (en) * | 2018-08-21 | 2019-01-29 | 南京邮电大学 | A kind of gesture identification method and system based on STT-LSTM network |
CN109460812A (en) * | 2017-09-06 | 2019-03-12 | 富士通株式会社 | Average information analytical equipment, the optimization device, feature visualization device of neural network |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109753897A (en) * | 2018-12-21 | 2019-05-14 | 西北工业大学 | Based on memory unit reinforcing-time-series dynamics study Activity recognition method |
CN109785336A (en) * | 2018-12-18 | 2019-05-21 | 深圳先进技术研究院 | Image partition method and device based on multipath convolutional neural networks model |
CN110008789A (en) * | 2018-01-05 | 2019-07-12 | ***通信有限公司研究院 | Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN110084259A (en) * | 2019-01-10 | 2019-08-02 | 谢飞 | A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature |
CN110197195A (en) * | 2019-04-15 | 2019-09-03 | 深圳大学 | A kind of novel deep layer network system and method towards Activity recognition |
CN110475129A (en) * | 2018-03-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method for processing video frequency, medium and server |
CN110531163A (en) * | 2019-04-18 | 2019-12-03 | 中国人民解放军国防科技大学 | Bus capacitance state monitoring method for suspension chopper of maglev train |
CN111767765A (en) * | 2019-04-01 | 2020-10-13 | Oppo广东移动通信有限公司 | Video processing method and device, storage medium and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106407649A (en) * | 2016-08-26 | 2017-02-15 | 中国矿业大学(北京) | Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network |
-
2017
- 2017-02-28 CN CN201710111507.8A patent/CN106934352A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106407649A (en) * | 2016-08-26 | 2017-02-15 | 中国矿业大学(北京) | Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network |
Non-Patent Citations (4)
Title |
---|
GUSTAV LARSSON ET AL.: "FractalNet:Ultra-Deep Neural Networks without Residuals", 《ARXIV:1605.07648V2》 * |
JOE YUE-HEI NG ET AL.: "Beyond Short Snippets:Deep Networks for Videos Classification", 《IEEE》 * |
KAREN SIMONYAN ET AL.: "Two-Stream Convolutional Networks for Action Recognition in Videos", 《ARXIV:1406.2199V2》 * |
SUBHASHINI VENUGOPALAN ET AL.: "Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text", 《ARXIV :1604.01729V 1》 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460812A (en) * | 2017-09-06 | 2019-03-12 | 富士通株式会社 | Average information analytical equipment, the optimization device, feature visualization device of neural network |
CN110019952B (en) * | 2017-09-30 | 2023-04-18 | 华为技术有限公司 | Video description method, system and device |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN107644519A (en) * | 2017-10-09 | 2018-01-30 | 中电科新型智慧城市研究院有限公司 | A kind of intelligent alarm method and system based on video human Activity recognition |
CN107909014A (en) * | 2017-10-31 | 2018-04-13 | 天津大学 | A kind of video understanding method based on deep learning |
CN107909115A (en) * | 2017-12-04 | 2018-04-13 | 上海师范大学 | A kind of image Chinese subtitle generation method |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
CN108235116B (en) * | 2017-12-27 | 2020-06-16 | 北京市商汤科技开发有限公司 | Feature propagation method and apparatus, electronic device, and medium |
CN110008789A (en) * | 2018-01-05 | 2019-07-12 | ***通信有限公司研究院 | Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium |
CN108198202A (en) * | 2018-01-23 | 2018-06-22 | 北京易智能科技有限公司 | A kind of video content detection method based on light stream and neural network |
CN108470212A (en) * | 2018-01-31 | 2018-08-31 | 江苏大学 | A kind of efficient LSTM design methods that can utilize incident duration |
CN108470212B (en) * | 2018-01-31 | 2020-02-21 | 江苏大学 | Efficient LSTM design method capable of utilizing event duration |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
CN108536735B (en) * | 2018-03-05 | 2020-12-15 | 中国科学院自动化研究所 | Multi-mode vocabulary representation method and system based on multi-channel self-encoder |
CN110475129A (en) * | 2018-03-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method for processing video frequency, medium and server |
CN108228915A (en) * | 2018-03-29 | 2018-06-29 | 华南理工大学 | A kind of video retrieval method based on deep learning |
CN109284682A (en) * | 2018-08-21 | 2019-01-29 | 南京邮电大学 | A kind of gesture identification method and system based on STT-LSTM network |
CN109522451B (en) * | 2018-12-13 | 2024-02-27 | 连尚(新昌)网络科技有限公司 | Repeated video detection method and device |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109785336A (en) * | 2018-12-18 | 2019-05-21 | 深圳先进技术研究院 | Image partition method and device based on multipath convolutional neural networks model |
CN109785336B (en) * | 2018-12-18 | 2020-11-27 | 深圳先进技术研究院 | Image segmentation method and device based on multipath convolutional neural network model |
CN109753897A (en) * | 2018-12-21 | 2019-05-14 | 西北工业大学 | Based on memory unit reinforcing-time-series dynamics study Activity recognition method |
CN109753897B (en) * | 2018-12-21 | 2022-05-27 | 西北工业大学 | Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning |
CN110084259B (en) * | 2019-01-10 | 2022-09-20 | 谢飞 | Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics |
CN110084259A (en) * | 2019-01-10 | 2019-08-02 | 谢飞 | A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature |
CN111767765A (en) * | 2019-04-01 | 2020-10-13 | Oppo广东移动通信有限公司 | Video processing method and device, storage medium and electronic equipment |
CN110197195B (en) * | 2019-04-15 | 2022-12-23 | 深圳大学 | Novel deep network system and method for behavior recognition |
CN110197195A (en) * | 2019-04-15 | 2019-09-03 | 深圳大学 | A kind of novel deep layer network system and method towards Activity recognition |
CN110531163A (en) * | 2019-04-18 | 2019-12-03 | 中国人民解放军国防科技大学 | Bus capacitance state monitoring method for suspension chopper of maglev train |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106934352A (en) | A kind of video presentation method based on two-way fractal net work and LSTM | |
CN111985245B (en) | Relationship extraction method and system based on attention cycle gating graph convolution network | |
CN113011499B (en) | Hyperspectral remote sensing image classification method based on double-attention machine system | |
WO2021043193A1 (en) | Neural network structure search method and image processing method and device | |
CN107766324A (en) | A kind of text coherence analysis method based on deep neural network | |
Lei et al. | Shallow convolutional neural network for image classification | |
CN107679580A (en) | A kind of isomery shift image feeling polarities analysis method based on the potential association of multi-modal depth | |
CN104850890B (en) | Instance-based learning and the convolutional neural networks parameter regulation means of Sadowsky distributions | |
CN107562784A (en) | Short text classification method based on ResLCNN models | |
CN112784964A (en) | Image classification method based on bridging knowledge distillation convolution neural network | |
CN110473592B (en) | Multi-view human synthetic lethal gene prediction method | |
CN106886543A (en) | The knowledge mapping of binding entity description represents learning method and system | |
CN109817276A (en) | A kind of secondary protein structure prediction method based on deep neural network | |
Ruiz et al. | Gated graph convolutional recurrent neural networks | |
CN109740655B (en) | Article scoring prediction method based on matrix decomposition and neural collaborative filtering | |
CN105787557A (en) | Design method of deep nerve network structure for computer intelligent identification | |
CN106570522A (en) | Object recognition model establishment method and object recognition method | |
Irfan et al. | A novel lifelong learning model based on cross domain knowledge extraction and transfer to classify underwater images | |
Feng et al. | One-dimensional VGGNet for high-dimensional data | |
CN111460818A (en) | Web page text classification method based on enhanced capsule network and storage medium | |
CN112884045B (en) | Classification method of random edge deletion embedded model based on multiple visual angles | |
CN106991049A (en) | A kind of Software Defects Predict Methods and forecasting system | |
CN114077659A (en) | Knowledge graph question-answering method and system based on neighbor interaction network | |
CN113887328A (en) | Method for extracting space-time characteristics of photonic crystal space transmission spectrum in parallel by ECA-CNN fusion dual-channel RNN | |
Zhao et al. | Building damage evaluation from satellite imagery using deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170707 |
|
RJ01 | Rejection of invention patent application after publication |