CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM - Google Patents

A kind of video presentation method based on two-way fractal net work and LSTM Download PDF

Info

Publication number
CN106934352A
CN106934352A CN201710111507.8A CN201710111507A CN106934352A CN 106934352 A CN106934352 A CN 106934352A CN 201710111507 A CN201710111507 A CN 201710111507A CN 106934352 A CN106934352 A CN 106934352A
Authority
CN
China
Prior art keywords
video
lstm
network
input
net work
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710111507.8A
Other languages
Chinese (zh)
Inventor
李楚怡
袁东芝
余卫宇
胡丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201710111507.8A priority Critical patent/CN106934352A/en
Publication of CN106934352A publication Critical patent/CN106934352A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of video presentation method based on two-way fractal net work and LSTM.Methods described carries out the sampling of key frame to video to be described, first, and extract the Optical-flow Feature between adjacent two frame of former video, then learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature, two recurrent neural networks models based on LSTM units are separately input to again, finally the output valve at two each moment of independent model is weighted averagely, so as to obtain descriptive statement corresponding with the video.The present invention has been utilized respectively the information of former frame of video and light stream to video to be described, the Optical-flow Feature of addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that change of the video on Spatial Dimension and time dimension.Furthermore, abstract visual signature is carried out to low-level image feature by novel fractal net work and is expressed, so that people, thing, behavior and spatial relation for being more accurately related in analysis mining video etc. are contacted.

Description

A kind of video presentation method based on two-way fractal net work and LSTM
Technical field
The invention belongs to video presentation, depth learning technology field, and in particular to one kind based on two-way fractal net work and The video presentation method of LSTM.
Background technology
With the progress and the development of society of science and technology, especially smart mobile phone is very general for all kinds of video camera terminals And, the price of hardware store is also increasingly cheap, and this causes that multimedia information stream exponentially formula increases.In substantial amounts of video stream In front, automatically analysis, identification how are carried out efficiently to massive video information in the case where manual intervention is reduced as far as possible and are understood, So as to be described from semantically giving, it has also become a heat subject in present image treatment and computer vision research field.It is right Description is made to video with language for most people, after one brief video of viewing and may is that part very simple thing. But, for machine, by extracting the Pixel Information of each two field picture in video, and it is analyzed, is processed, so that raw Described into a natural language, be a challenging task.
Machine efficiently automatically can make video to describe in video frequency searching, man-machine interaction, traffic security protection etc. Computer vision field also has a wide range of applications, and this will further facilitate research of the people to the semantic description of video.
The content of the invention
Shortcoming and deficiency it is a primary object of the present invention to overcome prior art, there is provided one kind is based on two-way fractal net work With the video presentation method of LSTM.
In order to reach above-mentioned purpose, the present invention uses following technical scheme:
A kind of video presentation method based on two-way fractal net work and LSTM, it is characterised in that first to video to be described, The sampling of key frame is carried out, and extracts the Optical-flow Feature between adjacent two frame of former video, then distinguished by two fractal net works Learn and obtain the high-level characteristic expression of key frame and Optical-flow Feature, then be separately input to two recurrence god based on LSTM units Through network model, finally the output valve at two independent each moment of recurrent neural networks model is weighted averagely, so as to obtain Obtain descriptive statement corresponding with the video.Specifically include following steps:
S1, the sampling of key frame is carried out to video to be described, and extract the Optical-flow Feature between adjacent two frame of former video;
S2, learnt respectively by two fractal net works and obtain frame of video and the high-level characteristic of Optical-flow Feature expression;
S3, the high-level characteristic vector that previous step is obtained is input to two recurrent neural nets based on LSTM units respectively Network;
S4, the output valve at two each moment of independent model is weighted mean deviation obtains the corresponding description language of video Sentence.
Preferably, video extraction Optical-flow Feature to be described, is specially described in step S1:
S1.1, the Optical-flow Feature value calculated respectively on x directions and y directions of the video per adjacent two frame, and normalize to [0, 255] pixel coverage;
S1.2, the range value for calculating light stream, and it is combined into a light flow graph with reference to the Optical-flow Feature value that previous step is obtained.
Preferably, the high-level characteristic expression that key frame and Optical-flow Feature are obtained in step S2 is concretely comprised the following steps:
S2.1, the key frame of the video obtained to step S1 are sequentially inputted to first treatment space with the order at time point The fractal net work of dimensional relationships, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network;
S2.2, the light flow graph obtained to step S1 are sequentially inputted to second process time dimension and close with the order at time point The fractal net work of system, corresponding motion feature vector is sequentially generated by the Nonlinear Mapping relation of network.
Preferably, for the repeated application by single extension rule in step S2.1 and S2.2 generate one it is profound Network, its topology layout is a point shape blocked;The network includes the different interaction subpath of length, but does not include Any through type connection;Meanwhile, in order to realize extracting the ability of high-performance constant depth sub-network, employ a kind of path and give up Method regularization point shape framework lining path coadaptation rule;For fractal net work, simplicity and the design of training Simplicity it is corresponding, the single loss function for being connected to last layer is enough to drive internal act to go to imitate depth and supervise;Institute The fractal net work of use is the depth convolutional neural networks based on fractal structure.
Preferably, the repeated application by single extension rule in step S2.1 and S2.2 generates a profound net Network, its topology layout is that point shape blocked is specially:
Base case f1The layer of (z) comprising single selection type between input and output;Make C represent and block a point shape fC() Index, fC() defines the network architecture, connection and channel type.Wherein, base case is the network comprising single convolutional layer Represent such as formula (1-1):
f1(z)=conv (z) (1-1)
Ensuing point of shape of recursive definition such as formula (1-2):
In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network fC(·) Width;Depth is defined as the number from conv layers be input on the longest path of output, is proportional to 2C-1;For what is classified The usual dispersed placement tether layer of convolutional network;In order to reach identical purpose, f is usedC() as construction unit, by it with connect down The tether layer for coming is stacked B times, obtains total depth B2C-1;Attended operationTwo characteristic blocks are combined into one;One characteristic block It is one conv layers of result:It is the tensor that some fixed passages maintain activation in area of space;Port number corresponds to The number of the filter of conv layers above;When a point shape is expanded, adjacent connection is merged into single articulamentum;Articulamentum handle All its input feature vector merged blocks are into single IOB.
Preferably, a kind of collaboration in the method regularization point shape framework lining path that path is given up in step S2.1 and S2.2 The rule of adaptation is specially:Because fractal net work includes extra large-scale structure, a kind of similar dropout and drop- is used The coarseness Regularization Strategy of connect, path is given up by the operand of random drop articulamentum to forbid parallel path Common to adapt to, this mode effectively prevent one path of Web vector graphic as anchor buoy, and another path may as amendment The over-fitting behavior for causing;Using two sampling policies:
For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input;
It is single-row by limiting this paths, to encourage per paths in order to whole network is selected for the overall situation Each column turns into strong fallout predictor.
Preferably, high-level characteristic vector is input to two recurrent neural networks based on LSTM units described in step S3 Model is specially:
Recurrent neural network based on LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 respectively The propagated forward process of individual neuron, wherein each LSTM neural unit is represented by:
it=σ (Wxixt+Whiht-1+bi) (1-3)
ft=σ (Wxfxt+Whfht-1+bf) (1-4)
ot=σ (Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt (1-7)
Wherein, σ (xt)=(1+e-xt)-1It is sigmoid nonlinear activation functions,Be hyperbolic just Cut nonlinear activation function;it, ft, ot, ctT input gate, Memory-Gate, out gate and the corresponding shape of core door are represented respectively State amount;For each gate, Wxi, Wxf, Wxo, WxgInput gate, Memory-Gate, out gate and the corresponding power of core door are represented respectively Weight transfer matrix, Whi, Whf, Who, WhgInput gate is represented respectively, and Memory-Gate, out gate and core door become in t-1 moment hidden layer Amount ht-1Corresponding transferring weights matrix, bi, bf, bo, bgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Bias vector.
Preferably, Artificial Neural Network Structures are in step S3:
Based on the Recursive Neural Network Structure figure of two-layer LSTM units, the recurrence of the LSTM units stacked using this two-layer Neutral net carries out the operation of the coding and decoding to input feature vector vector, so as to realize the conversion of natural language text;Wherein, Ground floor LSTM neurons complete to each moment input visual signature vector cataloged procedure, then each moment output Hidden layer is expressed as the input of second layer LSTM neurons;When the characteristic vector of all frame of video is all input to ground floor LSTM god After unit, second layer LSTM neurons will receive a designator, and start the task of decoding;In the stage of decoding, network The loss of information is had, therefore the target of model parameter training and study is pre- in given hidden layer expression and the output of last moment On the premise of survey, the log-likelihood function of whole output statement prediction is maximized;For with parameter θ and output statement Y=(y1, y2,…,ym) model that represents, parameter optimization target is represented by:
Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to target Function is optimized, and the error of whole network accumulates transmission by back-propagation algorithm on time dimension.
Preferably, the output valve at two neutral net each moment of independent model is weighted mean deviation and obtained by step S4 The corresponding descriptive statement concrete operations of video are:
S4.1, the output valve of two second layer LSTM neurons at independent each moment of recurrent neural networks model is carried out Weighted average;
S4.2, the probability of occurrence that each word in vocabulary V is calculated using softmax functions, are expressed as:
Wherein, y represents the word of prediction, ztRepresent output valve of the recurrent neural network in t, WyRepresent that the word exists Weighted value in vocabulary.
S4.3, the decoding stage at each moment, take the word of maximum probability in softmax function-outputs, so as to obtain Obtain corresponding video presentation sentence.
The present invention compared with prior art, has the following advantages that and beneficial effect:
1st, the Optical-flow Feature of present invention addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that regard Change of the frequency on Spatial Dimension and time dimension.
2nd, a kind of video presentation method based on two-way fractal net work and LSTM that the present invention is provided is by any input Video is processed, it is possible to is automatically generated a descriptive language on video content end-to-endly, be can be applied to video In the application fields such as retrieval, video monitoring and man-machine interaction.
3rd, the present invention carries out abstract visual signature expression to low-level image feature by novel fractal net work, so that more accurate The contacts such as people, thing, behavior and the spatial relation being related in ground analysis mining video.
Brief description of the drawings
Fig. 1 is the flow frame diagram of the video presentation method based on two-way fractal net work and LSTM that the present invention is provided;
Fig. 2 is that embodiments of the invention use fracton network diagram;
Fig. 3 is the schematic diagram of the recurrent neural network based on LSTM units that embodiments of the invention are used.
Specific embodiment
With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited In this.
The sampling of key frame is carried out to video to be described, and extracts the Optical-flow Feature between adjacent two frame of former video, then Learnt respectively by two fractal net works and obtain key frame and the expression of the high-level characteristic of Optical-flow Feature, then be separately input to two Based on the recurrent neural networks model of LSTM units, finally by two output valves at independent each moment of recurrent neural networks model It is weighted averagely, so as to obtain descriptive statement corresponding with the video.
Fig. 1 is overall flow figure of the invention, is comprised the following steps:
(1) sampling of key frame is carried out to video to be described, and extracts the Optical-flow Feature between adjacent two frame of former video;Its In, it is to video extraction Optical-flow Feature concrete operations to be described,:
1st, the light flow valuve on x directions and y directions of the video per adjacent two frame is calculated respectively, and normalizes to [0,255] Pixel coverage;
2nd, the range value of light stream is calculated, and a light flow graph is combined into reference to the Optical-flow Feature value that previous step is obtained.
(2) learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature.To The sample frame of the video that one step is obtained is sequentially inputted to first point shape net for the treatment of Spatial Dimension relation with the order at time point Network, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network;To obtain light flow graph with time point Order be sequentially inputted to second fractal net work of process time dimensional relationships, by the Nonlinear Mapping relation of network successively Generate corresponding motion feature vector.
Fractal net work is mainly has introduced a kind of layout strategy based on self similarity on macroscopical framework of neutral net, leads to The repeated application for crossing single extension rule generates a profound network, and its topology layout is a point shape blocked.Should Network includes the different interaction subpath of length, but not comprising the connection of any through type.Meanwhile, in order to realize extracting property high The ability of energy constant depth sub-network, the collaboration for employing the method regularization point shape framework lining path given up in a kind of path is fitted Should.For fractal net work, the simplicity of training is corresponding with the simplicity of design, the single loss function for being connected to last layer It is enough to drive internal act to go to imitate depth supervision.The fractal net work for being used in the present invention is the depth based on fractal structure Convolutional neural networks.
Schematic diagram as given fractal structure in Fig. 2, base case f1Z () includes single selected class between input and output The layer of type;Make C represent and block a point shape fCThe index of (), fC() defines the network architecture, connection and channel type.Wherein, base Plinth situation is such as formula of the network representation comprising single convolutional layer (1-1):
f1(z)=conv (z) (1-1)
Then the ensuing fractal structure of recursive definition such as formula (1-2) is passed through:
In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network fC(·) Width;Depth is defined as the number from conv layers be input on the longest path of output, is proportional to 2C-1;For what is classified The usual dispersed placement tether layer of convolutional network;In order to reach identical purpose, f is usedC() as construction unit, by it with connect down The tether layer for coming is stacked B times, obtains total depth B2C-1;Attended operationTwo characteristic blocks are combined into one, it is one of special It is a result for convolutional layer to levy block:It is the tensor that some fixed passages maintain activation in area of space.Port number pair The number of the convolutional layer filter before Ying Yu.When a point shape is expanded, adjacent connection is merged into single articulamentum.As Fig. 2 is right Shown in side, this articulamentum crosses over multiple row, all its input feature vector merged blocks into single IOB.
Because fractal net work includes extra large-scale structure, therefore propose to use a kind of similar dropout and drop- The coarseness Regularization Strategy of connect.Give up by the operand of random drop articulamentum to forbid parallel path in path Common to adapt to, this mode effectively prevent one path of Web vector graphic as anchor buoy, and another path may as amendment The over-fitting behavior for causing.Here two sampling policies are mainly used:
For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input;
It is single-row by limiting this paths, to encourage per paths in order to whole network is selected for the overall situation Each column turns into strong fallout predictor.
(3) the high-level characteristic vector that previous step is obtained is input to two recurrent neural nets based on LSTM units respectively Network.Recurrent neural network based on LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 god respectively Through unit, the propagated forward process of wherein each LSTM neural unit is represented by:
it=σ (Wxixt+Whiht-1+bi) (1-3)
ft=σ (Wxfxt+Whfht-1+bf) (1-4)
ot=σ (Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt (1-7)
Wherein, σ (xt)=(1+e-xt)-1It is sigmoid nonlinear activation functions,Be hyperbolic just Cut nonlinear activation function;it, ft, ot, ctT input gate, Memory-Gate, out gate and the corresponding shape of core door are represented respectively State amount;For each gate, Wxi, Wxf, Wxo, WxgInput gate, Memory-Gate, out gate and the corresponding power of core door are represented respectively Weight transfer matrix, Whi, Whf, Who, WhgInput gate is represented respectively, and Memory-Gate, out gate and core door become in t-1 moment hidden layer Amount ht-1Corresponding transferring weights matrix, bi, bf, bo, bgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Bias vector.
Such as the Recursive Neural Network Structure figure based on two-layer LSTM units be given in Fig. 3, we utilize this two-layer heap The recurrent neural network of folded LSTM units carries out the operation of the coding and decoding to input feature vector vector, so as to realize nature language Say the conversion of text.Wherein, ground floor LSTM neurons complete the cataloged procedure to the input visual signature vector at each moment, Then the hidden layer of each moment output is expressed as the input of second layer LSTM neurons;When all frame of video characteristic vector all After being input to ground floor LSTM neurons, second layer LSTM neurons will receive a designator, and start the task of decoding. In the stage of decoding, network has the loss of information, therefore the target of model parameter training and study is in the expression of given hidden layer On the premise of output with last moment is predicted, the log-likelihood function of whole output statement prediction is maximized.For with parameter θ With output statement Y=(y1,y2,…,ym) model that represents, parameter optimization target is represented by:
Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to target Function is optimized, and the error of whole network accumulates transmission by back-propagation algorithm on time dimension.
(4) output valve at two each moment of independent model is weighted mean deviation and obtains the corresponding description language of video , concrete operations are:
1st, the output valve of two second layer LSTM neurons at each moment of independent model is weighted averagely;
2nd, the probability of occurrence of each word in vocabulary V is calculated using softmax functions, is expressed as:
Wherein, y represents the word of prediction, ztRepresent output valve of the recurrent neural network in t, WyRepresent that the word exists Weighted value in vocabulary.
3rd, in the decoding stage at each moment, the word of maximum probability in softmax function-outputs is taken, so as to obtain right The video presentation sentence answered.
Above-described embodiment is the present invention preferably implementation method, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from Spirit Essence of the invention and the change, modification, replacement made under principle, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims (9)

1. a kind of video presentation method based on two-way fractal net work and LSTM, it is characterised in that enter to video to be described, first The sampling of row key frame, and the Optical-flow Feature between adjacent two frame of former video is extracted, then learned respectively by two fractal net works The high-level characteristic expression of key frame and Optical-flow Feature is practised and obtained, then is separately input to two recurrent neurals based on LSTM units , finally be weighted the output valve at two independent each moment of recurrent neural networks model averagely, so as to obtain by network model Descriptive statement corresponding with the video;Specifically include following steps:
S1, the sampling of key frame is carried out to video to be described, and extract the Optical-flow Feature between adjacent two frame of former video;
S2, learnt respectively by two fractal net works and obtain key frame and the high-level characteristic of Optical-flow Feature expression;Wherein divide shape Network is generated by the repeated application of single extension rule;
S3, the high-level characteristic vector that previous step is obtained is input to two recurrent neural network moulds based on LSTM units respectively Type;
S4, the output valve at two independent each moment of recurrent neural networks model is weighted mean deviation obtain video it is corresponding Descriptive statement.
2. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step Video extraction Optical-flow Feature to be described, is specially described in rapid S1:
S1.1, the Optical-flow Feature value calculated respectively on x directions and y directions of the video per adjacent two frame, and normalize to [0,255] Pixel coverage;
S1.2, the range value for calculating light stream, and it is combined into a light flow graph with reference to the Optical-flow Feature value that previous step is obtained.
3. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step The high-level characteristic expression of acquisition key frame and Optical-flow Feature concretely comprises the following steps in rapid S2:
S2.1, the key frame of the video obtained to step S1 are sequentially inputted to first treatment Spatial Dimension with the order at time point The fractal net work of relation, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network;
S2.2, second process time dimensional relationships are sequentially inputted to the order at time point to the light flow graph that step S1 is obtained Fractal net work, corresponding motion feature vector is sequentially generated by the Nonlinear Mapping relation of network.
4. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 3, it is characterised in that step The repeated application by single extension rule in rapid S2.1 and S2.2 generates a profound network, and its topology layout is one The individual point shape blocked;The network includes the different interaction subpath of length, but not comprising the connection of any through type;Meanwhile, In order to realize extracting the ability of high-performance constant depth sub-network, the method regularization point shape framework given up in a kind of path is employed The rule of the coadaptation in lining path;For fractal net work, the simplicity of training is corresponding with the simplicity of design, single company The loss function for being connected to last layer is enough to drive internal act to go to imitate depth supervision;The fractal net work for being used is to be based on dividing The depth convolutional neural networks of shape structure.
5. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 4, it is characterised in that step The repeated application by single extension rule in rapid S2.1 and S2.2 generates a profound network, and its topology layout is one The individual point shape blocked, specially:
Base case f1The layer of (z) comprising single selection type between input and output;Make C represent and block a point shape fCThe index of (), fC() defines the network architecture, connection and channel type;Wherein, base case be the network representation comprising single convolutional layer such as Formula (1-1):
f1(z)=conv (z) (1-1)
Ensuing point of shape of recursive definition such as formula (1-2):
In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network fCThe width of () Degree;Depth is defined as from being input on the longest path of outputconvThe number of layer, is proportional to 2C-1;For the convolution net classified The usual dispersed placement tether layer of network;In order to reach identical purpose, f is usedC() as construction unit, by it and ensuing remittance Collection layer is stacked B times, obtains total depth B2C-1;Attended operationTwo characteristic blocks are combined into one;One characteristic block is one Conv layers of result:It is the tensor that some fixed passages maintain activation in area of space;Port number corresponds to above The number of conv layers of filter;When a point shape is expanded, adjacent connection is merged into single articulamentum;Articulamentum it is all its Input feature vector merged block is into single IOB.
6. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 4, it is characterised in that step A kind of rule of the coadaptation in the method regularization point shape framework lining path that path is given up is specially in rapid S2.1 and S2.2: Because fractal net work includes extra large-scale structure, using a kind of coarseness of similar dropout and drop-connect just Then change strategy, path is given up by the operand of random drop articulamentum to forbid the common adaptation of parallel path, this mode One path of Web vector graphic be effectively prevent as anchor buoy, the over-fitting behavior that another path may cause as amendment; Using two sampling policies:
For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input;
It is single-row by limiting this paths, to encourage each column per paths in order to whole network is selected for the overall situation As strong fallout predictor.
7. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step High-level characteristic vector is input into two recurrent neural networks models based on LSTM units described in rapid S3 to be specially:It is based on The recurrent neural network of LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 neurons respectively, its In the propagated forward process of each LSTM neural unit be represented by:
it=σ (Wxixt+Whiht-1+bi) (1-3)
ft=σ (Wxfxt+Whfht-1+bf) (1-4)
ot=σ (Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt (1-7)
Wherein,It is sigmoid nonlinear activation functions,It is hyperbolic Tangent nonlinear activation function;it, ft, ot, ctT input gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Quantity of state;For each gate, Wxi, Wxf, Wxo, WxgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Transferring weights matrix, Whi, Whf, Who, WhgRepresent input gate respectively, Memory-Gate, out gate and core door are in t-1 moment hidden layers Variable ht-1Corresponding transferring weights matrix, bi, bf, bo, bgInput gate is represented respectively, and Memory-Gate, out gate is corresponding with core door Bias vector.
8. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 7, it is characterised in that step Artificial Neural Network Structures are in rapid S3:
Based on the recurrent neural network of two-layer LSTM units, the conversion of natural language text is realized;Wherein, ground floor LSTM nerves Unit completes the cataloged procedure to the input visual signature vector at each moment, and then the hidden layer expression of each moment output is used as the Two layers of input of LSTM neurons;After the characteristic vector of all frame of video is all input to ground floor LSTM neurons, the second layer LSTM neurons will receive a designator, and start the task of decoding;In the stage of decoding, network has the damage of information Lose, thus model parameter training and study target be on the premise of given hidden layer expression and the output of last moment are predicted, Maximize the log-likelihood function of whole output statement prediction;For with parameter θ and output statement Y=(y1,y2,…,ym) represent Model, parameter optimization target is represented by:
θ * = arg max Σ ( h , Y ) log p ( Y | h ; θ ) - - - ( 1 - 9 )
Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to object function Optimize, the error of whole network accumulates transmission by back-propagation algorithm on time dimension.
9. according to a kind of video presentation method based on two-way fractal net work and LSTM of claim 1, it is characterised in that step S4 The output valve at two neutral net each moment of independent model is weighted mean deviation and obtains the corresponding descriptive statement tool of video Gymnastics conduct:
S4.1, the output valve of two second layer LSTM neurons at independent each moment of recurrent neural networks model is weighted Averagely;
S4.2, the probability of occurrence that each word in vocabulary V is calculated using softmax functions, are expressed as:
P ( y | z t ) = exp ( W y z t ) Σ y ′ ∈ V exp ( W y ′ z t ) - - - ( 1 - 10 )
Wherein, y represents the word of prediction, ztRepresent output valve of the recurrent neural network in t, WyRepresent the word in vocabulary Weighted value in table;
S4.3, the decoding stage at each moment, take the word of maximum probability in softmax function-outputs, so as to obtain right The video presentation sentence answered.
CN201710111507.8A 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM Pending CN106934352A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710111507.8A CN106934352A (en) 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710111507.8A CN106934352A (en) 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM

Publications (1)

Publication Number Publication Date
CN106934352A true CN106934352A (en) 2017-07-07

Family

ID=59424160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710111507.8A Pending CN106934352A (en) 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM

Country Status (1)

Country Link
CN (1) CN106934352A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644519A (en) * 2017-10-09 2018-01-30 中电科新型智慧城市研究院有限公司 A kind of intelligent alarm method and system based on video human Activity recognition
CN107909115A (en) * 2017-12-04 2018-04-13 上海师范大学 A kind of image Chinese subtitle generation method
CN107909014A (en) * 2017-10-31 2018-04-13 天津大学 A kind of video understanding method based on deep learning
CN108198202A (en) * 2018-01-23 2018-06-22 北京易智能科技有限公司 A kind of video content detection method based on light stream and neural network
CN108228915A (en) * 2018-03-29 2018-06-29 华南理工大学 A kind of video retrieval method based on deep learning
CN108235116A (en) * 2017-12-27 2018-06-29 北京市商汤科技开发有限公司 Feature propagation method and device, electronic equipment, program and medium
CN108470212A (en) * 2018-01-31 2018-08-31 江苏大学 A kind of efficient LSTM design methods that can utilize incident duration
CN108536735A (en) * 2018-03-05 2018-09-14 中国科学院自动化研究所 Multi-modal lexical representation method and system based on multichannel self-encoding encoder
CN109284682A (en) * 2018-08-21 2019-01-29 南京邮电大学 A kind of gesture identification method and system based on STT-LSTM network
CN109460812A (en) * 2017-09-06 2019-03-12 富士通株式会社 Average information analytical equipment, the optimization device, feature visualization device of neural network
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN109753897A (en) * 2018-12-21 2019-05-14 西北工业大学 Based on memory unit reinforcing-time-series dynamics study Activity recognition method
CN109785336A (en) * 2018-12-18 2019-05-21 深圳先进技术研究院 Image partition method and device based on multipath convolutional neural networks model
CN110008789A (en) * 2018-01-05 2019-07-12 ***通信有限公司研究院 Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN110084259A (en) * 2019-01-10 2019-08-02 谢飞 A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature
CN110197195A (en) * 2019-04-15 2019-09-03 深圳大学 A kind of novel deep layer network system and method towards Activity recognition
CN110475129A (en) * 2018-03-05 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, medium and server
CN110531163A (en) * 2019-04-18 2019-12-03 中国人民解放军国防科技大学 Bus capacitance state monitoring method for suspension chopper of maglev train
CN111767765A (en) * 2019-04-01 2020-10-13 Oppo广东移动通信有限公司 Video processing method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106407649A (en) * 2016-08-26 2017-02-15 中国矿业大学(北京) Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106407649A (en) * 2016-08-26 2017-02-15 中国矿业大学(北京) Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GUSTAV LARSSON ET AL.: "FractalNet:Ultra-Deep Neural Networks without Residuals", 《ARXIV:1605.07648V2》 *
JOE YUE-HEI NG ET AL.: "Beyond Short Snippets:Deep Networks for Videos Classification", 《IEEE》 *
KAREN SIMONYAN ET AL.: "Two-Stream Convolutional Networks for Action Recognition in Videos", 《ARXIV:1406.2199V2》 *
SUBHASHINI VENUGOPALAN ET AL.: "Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text", 《ARXIV :1604.01729V 1》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460812A (en) * 2017-09-06 2019-03-12 富士通株式会社 Average information analytical equipment, the optimization device, feature visualization device of neural network
CN110019952B (en) * 2017-09-30 2023-04-18 华为技术有限公司 Video description method, system and device
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN107644519A (en) * 2017-10-09 2018-01-30 中电科新型智慧城市研究院有限公司 A kind of intelligent alarm method and system based on video human Activity recognition
CN107909014A (en) * 2017-10-31 2018-04-13 天津大学 A kind of video understanding method based on deep learning
CN107909115A (en) * 2017-12-04 2018-04-13 上海师范大学 A kind of image Chinese subtitle generation method
CN108235116A (en) * 2017-12-27 2018-06-29 北京市商汤科技开发有限公司 Feature propagation method and device, electronic equipment, program and medium
CN108235116B (en) * 2017-12-27 2020-06-16 北京市商汤科技开发有限公司 Feature propagation method and apparatus, electronic device, and medium
CN110008789A (en) * 2018-01-05 2019-07-12 ***通信有限公司研究院 Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium
CN108198202A (en) * 2018-01-23 2018-06-22 北京易智能科技有限公司 A kind of video content detection method based on light stream and neural network
CN108470212A (en) * 2018-01-31 2018-08-31 江苏大学 A kind of efficient LSTM design methods that can utilize incident duration
CN108470212B (en) * 2018-01-31 2020-02-21 江苏大学 Efficient LSTM design method capable of utilizing event duration
CN108536735A (en) * 2018-03-05 2018-09-14 中国科学院自动化研究所 Multi-modal lexical representation method and system based on multichannel self-encoding encoder
CN108536735B (en) * 2018-03-05 2020-12-15 中国科学院自动化研究所 Multi-mode vocabulary representation method and system based on multi-channel self-encoder
CN110475129A (en) * 2018-03-05 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, medium and server
CN108228915A (en) * 2018-03-29 2018-06-29 华南理工大学 A kind of video retrieval method based on deep learning
CN109284682A (en) * 2018-08-21 2019-01-29 南京邮电大学 A kind of gesture identification method and system based on STT-LSTM network
CN109522451B (en) * 2018-12-13 2024-02-27 连尚(新昌)网络科技有限公司 Repeated video detection method and device
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN109785336A (en) * 2018-12-18 2019-05-21 深圳先进技术研究院 Image partition method and device based on multipath convolutional neural networks model
CN109785336B (en) * 2018-12-18 2020-11-27 深圳先进技术研究院 Image segmentation method and device based on multipath convolutional neural network model
CN109753897A (en) * 2018-12-21 2019-05-14 西北工业大学 Based on memory unit reinforcing-time-series dynamics study Activity recognition method
CN109753897B (en) * 2018-12-21 2022-05-27 西北工业大学 Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning
CN110084259B (en) * 2019-01-10 2022-09-20 谢飞 Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics
CN110084259A (en) * 2019-01-10 2019-08-02 谢飞 A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature
CN111767765A (en) * 2019-04-01 2020-10-13 Oppo广东移动通信有限公司 Video processing method and device, storage medium and electronic equipment
CN110197195B (en) * 2019-04-15 2022-12-23 深圳大学 Novel deep network system and method for behavior recognition
CN110197195A (en) * 2019-04-15 2019-09-03 深圳大学 A kind of novel deep layer network system and method towards Activity recognition
CN110531163A (en) * 2019-04-18 2019-12-03 中国人民解放军国防科技大学 Bus capacitance state monitoring method for suspension chopper of maglev train

Similar Documents

Publication Publication Date Title
CN106934352A (en) A kind of video presentation method based on two-way fractal net work and LSTM
CN111985245B (en) Relationship extraction method and system based on attention cycle gating graph convolution network
CN113011499B (en) Hyperspectral remote sensing image classification method based on double-attention machine system
WO2021043193A1 (en) Neural network structure search method and image processing method and device
CN107766324A (en) A kind of text coherence analysis method based on deep neural network
Lei et al. Shallow convolutional neural network for image classification
CN107679580A (en) A kind of isomery shift image feeling polarities analysis method based on the potential association of multi-modal depth
CN104850890B (en) Instance-based learning and the convolutional neural networks parameter regulation means of Sadowsky distributions
CN107562784A (en) Short text classification method based on ResLCNN models
CN112784964A (en) Image classification method based on bridging knowledge distillation convolution neural network
CN110473592B (en) Multi-view human synthetic lethal gene prediction method
CN106886543A (en) The knowledge mapping of binding entity description represents learning method and system
CN109817276A (en) A kind of secondary protein structure prediction method based on deep neural network
Ruiz et al. Gated graph convolutional recurrent neural networks
CN109740655B (en) Article scoring prediction method based on matrix decomposition and neural collaborative filtering
CN105787557A (en) Design method of deep nerve network structure for computer intelligent identification
CN106570522A (en) Object recognition model establishment method and object recognition method
Irfan et al. A novel lifelong learning model based on cross domain knowledge extraction and transfer to classify underwater images
Feng et al. One-dimensional VGGNet for high-dimensional data
CN111460818A (en) Web page text classification method based on enhanced capsule network and storage medium
CN112884045B (en) Classification method of random edge deletion embedded model based on multiple visual angles
CN106991049A (en) A kind of Software Defects Predict Methods and forecasting system
CN114077659A (en) Knowledge graph question-answering method and system based on neighbor interaction network
CN113887328A (en) Method for extracting space-time characteristics of photonic crystal space transmission spectrum in parallel by ECA-CNN fusion dual-channel RNN
Zhao et al. Building damage evaluation from satellite imagery using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170707

RJ01 Rejection of invention patent application after publication