CN106934352A

CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM

Info

Publication number: CN106934352A
Application number: CN201710111507.8A
Authority: CN
Inventors: 李楚怡; 袁东芝; 余卫宇; 胡丹
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2017-07-07

Abstract

The invention discloses a kind of video presentation method based on two-way fractal net work and LSTM.Methods described carries out the sampling of key frame to video to be described, first, and extract the Optical-flow Feature between adjacent two frame of former video, then learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature, two recurrent neural networks models based on LSTM units are separately input to again, finally the output valve at two each moment of independent model is weighted averagely, so as to obtain descriptive statement corresponding with the video.The present invention has been utilized respectively the information of former frame of video and light stream to video to be described, the Optical-flow Feature of addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that change of the video on Spatial Dimension and time dimension.Furthermore, abstract visual signature is carried out to low-level image feature by novel fractal net work and is expressed, so that people, thing, behavior and spatial relation for being more accurately related in analysis mining video etc. are contacted.

Description

A kind of video presentation method based on two-way fractal net work and LSTM

Technical field

The invention belongs to video presentation, depth learning technology field, and in particular to one kind based on two-way fractal net work and The video presentation method of LSTM.

Background technology

With the progress and the development of society of science and technology, especially smart mobile phone is very general for all kinds of video camera terminals And, the price of hardware store is also increasingly cheap, and this causes that multimedia information stream exponentially formula increases.In substantial amounts of video stream In front, automatically analysis, identification how are carried out efficiently to massive video information in the case where manual intervention is reduced as far as possible and are understood, So as to be described from semantically giving, it has also become a heat subject in present image treatment and computer vision research field.It is right Description is made to video with language for most people, after one brief video of viewing and may is that part very simple thing. But, for machine, by extracting the Pixel Information of each two field picture in video, and it is analyzed, is processed, so that raw Described into a natural language, be a challenging task.

Machine efficiently automatically can make video to describe in video frequency searching, man-machine interaction, traffic security protection etc. Computer vision field also has a wide range of applications, and this will further facilitate research of the people to the semantic description of video.

The content of the invention

Shortcoming and deficiency it is a primary object of the present invention to overcome prior art, there is provided one kind is based on two-way fractal net work With the video presentation method of LSTM.

In order to reach above-mentioned purpose, the present invention uses following technical scheme：

A kind of video presentation method based on two-way fractal net work and LSTM, it is characterised in that first to video to be described, The sampling of key frame is carried out, and extracts the Optical-flow Feature between adjacent two frame of former video, then distinguished by two fractal net works Learn and obtain the high-level characteristic expression of key frame and Optical-flow Feature, then be separately input to two recurrence god based on LSTM units Through network model, finally the output valve at two independent each moment of recurrent neural networks model is weighted averagely, so as to obtain Obtain descriptive statement corresponding with the video.Specifically include following steps：

S1, the sampling of key frame is carried out to video to be described, and extract the Optical-flow Feature between adjacent two frame of former video；

S2, learnt respectively by two fractal net works and obtain frame of video and the high-level characteristic of Optical-flow Feature expression；

S3, the high-level characteristic vector that previous step is obtained is input to two recurrent neural nets based on LSTM units respectively Network；

S4, the output valve at two each moment of independent model is weighted mean deviation obtains the corresponding description language of video Sentence.

Preferably, video extraction Optical-flow Feature to be described, is specially described in step S1：

S1.1, the Optical-flow Feature value calculated respectively on x directions and y directions of the video per adjacent two frame, and normalize to [0, 255] pixel coverage；

S1.2, the range value for calculating light stream, and it is combined into a light flow graph with reference to the Optical-flow Feature value that previous step is obtained.

Preferably, the high-level characteristic expression that key frame and Optical-flow Feature are obtained in step S2 is concretely comprised the following steps：

S2.1, the key frame of the video obtained to step S1 are sequentially inputted to first treatment space with the order at time point The fractal net work of dimensional relationships, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network；

S2.2, the light flow graph obtained to step S1 are sequentially inputted to second process time dimension and close with the order at time point The fractal net work of system, corresponding motion feature vector is sequentially generated by the Nonlinear Mapping relation of network.

Preferably, for the repeated application by single extension rule in step S2.1 and S2.2 generate one it is profound Network, its topology layout is a point shape blocked；The network includes the different interaction subpath of length, but does not include Any through type connection；Meanwhile, in order to realize extracting the ability of high-performance constant depth sub-network, employ a kind of path and give up Method regularization point shape framework lining path coadaptation rule；For fractal net work, simplicity and the design of training Simplicity it is corresponding, the single loss function for being connected to last layer is enough to drive internal act to go to imitate depth and supervise；Institute The fractal net work of use is the depth convolutional neural networks based on fractal structure.

Preferably, the repeated application by single extension rule in step S2.1 and S2.2 generates a profound net Network, its topology layout is that point shape blocked is specially：

Base case f₁The layer of (z) comprising single selection type between input and output；Make C represent and block a point shape f_C() Index, f_C() defines the network architecture, connection and channel type.Wherein, base case is the network comprising single convolutional layer Represent such as formula (1-1)：

f₁(z)=conv (z) (1-1)

Ensuing point of shape of recursive definition such as formula (1-2)：

In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network f_C(·) Width；Depth is defined as the number from conv layers be input on the longest path of output, is proportional to 2^C-1；For what is classified The usual dispersed placement tether layer of convolutional network；In order to reach identical purpose, f is used_C() as construction unit, by it with connect down The tether layer for coming is stacked B times, obtains total depth B2^C-1；Attended operationTwo characteristic blocks are combined into one；One characteristic block It is one conv layers of result：It is the tensor that some fixed passages maintain activation in area of space；Port number corresponds to The number of the filter of conv layers above；When a point shape is expanded, adjacent connection is merged into single articulamentum；Articulamentum handle All its input feature vector merged blocks are into single IOB.

Preferably, a kind of collaboration in the method regularization point shape framework lining path that path is given up in step S2.1 and S2.2 The rule of adaptation is specially：Because fractal net work includes extra large-scale structure, a kind of similar dropout and drop- is used The coarseness Regularization Strategy of connect, path is given up by the operand of random drop articulamentum to forbid parallel path Common to adapt to, this mode effectively prevent one path of Web vector graphic as anchor buoy, and another path may as amendment The over-fitting behavior for causing；Using two sampling policies：

For part, articulamentum gives up each input with fixed probability, but ensures at least to retain an input；

It is single-row by limiting this paths, to encourage per paths in order to whole network is selected for the overall situation Each column turns into strong fallout predictor.

Preferably, high-level characteristic vector is input to two recurrent neural networks based on LSTM units described in step S3 Model is specially：

Recurrent neural network based on LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 respectively The propagated forward process of individual neuron, wherein each LSTM neural unit is represented by：

i_t=σ (W_xix_t+W_hih_t-1+b_i) (1-3)

f_t=σ (W_xfx_t+W_hfh_t-1+b_f) (1-4)

o_t=σ (W_xox_t+W_hoh_t-1+b_o) (1-5)

c_t=f_t*c_t-1+i_t*g_t (1-7)

Wherein, σ (x_t)=(1+e^-xt)^-1It is sigmoid nonlinear activation functions,Be hyperbolic just Cut nonlinear activation function；i_t, f_t, o_t, c_tT input gate, Memory-Gate, out gate and the corresponding shape of core door are represented respectively State amount；For each gate, W_xi, W_xf, W_xo, W_xgInput gate, Memory-Gate, out gate and the corresponding power of core door are represented respectively Weight transfer matrix, W_hi, W_hf, W_ho, W_hgInput gate is represented respectively, and Memory-Gate, out gate and core door become in t-1 moment hidden layer Amount h_t-1Corresponding transferring weights matrix, b_i, b_f, b_o, b_gInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Bias vector.

Preferably, Artificial Neural Network Structures are in step S3：

Based on the Recursive Neural Network Structure figure of two-layer LSTM units, the recurrence of the LSTM units stacked using this two-layer Neutral net carries out the operation of the coding and decoding to input feature vector vector, so as to realize the conversion of natural language text；Wherein, Ground floor LSTM neurons complete to each moment input visual signature vector cataloged procedure, then each moment output Hidden layer is expressed as the input of second layer LSTM neurons；When the characteristic vector of all frame of video is all input to ground floor LSTM god After unit, second layer LSTM neurons will receive a designator, and start the task of decoding；In the stage of decoding, network The loss of information is had, therefore the target of model parameter training and study is pre- in given hidden layer expression and the output of last moment On the premise of survey, the log-likelihood function of whole output statement prediction is maximized；For with parameter θ and output statement Y=(y₁, y₂,…,y_m) model that represents, parameter optimization target is represented by：

Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to target Function is optimized, and the error of whole network accumulates transmission by back-propagation algorithm on time dimension.

Preferably, the output valve at two neutral net each moment of independent model is weighted mean deviation and obtained by step S4 The corresponding descriptive statement concrete operations of video are：

S4.1, the output valve of two second layer LSTM neurons at independent each moment of recurrent neural networks model is carried out Weighted average；

S4.2, the probability of occurrence that each word in vocabulary V is calculated using softmax functions, are expressed as：

Wherein, y represents the word of prediction, z_tRepresent output valve of the recurrent neural network in t, W_yRepresent that the word exists Weighted value in vocabulary.

S4.3, the decoding stage at each moment, take the word of maximum probability in softmax function-outputs, so as to obtain Obtain corresponding video presentation sentence.

The present invention compared with prior art, has the following advantages that and beneficial effect：

1st, the Optical-flow Feature of present invention addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that regard Change of the frequency on Spatial Dimension and time dimension.

2nd, a kind of video presentation method based on two-way fractal net work and LSTM that the present invention is provided is by any input Video is processed, it is possible to is automatically generated a descriptive language on video content end-to-endly, be can be applied to video In the application fields such as retrieval, video monitoring and man-machine interaction.

3rd, the present invention carries out abstract visual signature expression to low-level image feature by novel fractal net work, so that more accurate The contacts such as people, thing, behavior and the spatial relation being related in ground analysis mining video.

Brief description of the drawings

Fig. 1 is the flow frame diagram of the video presentation method based on two-way fractal net work and LSTM that the present invention is provided；

Fig. 2 is that embodiments of the invention use fracton network diagram；

Fig. 3 is the schematic diagram of the recurrent neural network based on LSTM units that embodiments of the invention are used.

Specific embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited In this.

The sampling of key frame is carried out to video to be described, and extracts the Optical-flow Feature between adjacent two frame of former video, then Learnt respectively by two fractal net works and obtain key frame and the expression of the high-level characteristic of Optical-flow Feature, then be separately input to two Based on the recurrent neural networks model of LSTM units, finally by two output valves at independent each moment of recurrent neural networks model It is weighted averagely, so as to obtain descriptive statement corresponding with the video.

Fig. 1 is overall flow figure of the invention, is comprised the following steps：

(1) sampling of key frame is carried out to video to be described, and extracts the Optical-flow Feature between adjacent two frame of former video；Its In, it is to video extraction Optical-flow Feature concrete operations to be described,：

1st, the light flow valuve on x directions and y directions of the video per adjacent two frame is calculated respectively, and normalizes to [0,255] Pixel coverage；

2nd, the range value of light stream is calculated, and a light flow graph is combined into reference to the Optical-flow Feature value that previous step is obtained.

(2) learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature.To The sample frame of the video that one step is obtained is sequentially inputted to first point shape net for the treatment of Spatial Dimension relation with the order at time point Network, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network；To obtain light flow graph with time point Order be sequentially inputted to second fractal net work of process time dimensional relationships, by the Nonlinear Mapping relation of network successively Generate corresponding motion feature vector.

Fractal net work is mainly has introduced a kind of layout strategy based on self similarity on macroscopical framework of neutral net, leads to The repeated application for crossing single extension rule generates a profound network, and its topology layout is a point shape blocked.Should Network includes the different interaction subpath of length, but not comprising the connection of any through type.Meanwhile, in order to realize extracting property high The ability of energy constant depth sub-network, the collaboration for employing the method regularization point shape framework lining path given up in a kind of path is fitted Should.For fractal net work, the simplicity of training is corresponding with the simplicity of design, the single loss function for being connected to last layer It is enough to drive internal act to go to imitate depth supervision.The fractal net work for being used in the present invention is the depth based on fractal structure Convolutional neural networks.

Schematic diagram as given fractal structure in Fig. 2, base case f₁Z () includes single selected class between input and output The layer of type；Make C represent and block a point shape f_CThe index of (), f_C() defines the network architecture, connection and channel type.Wherein, base Plinth situation is such as formula of the network representation comprising single convolutional layer (1-1)：

f₁(z)=conv (z) (1-1)

Then the ensuing fractal structure of recursive definition such as formula (1-2) is passed through：

In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network f_C(·) Width；Depth is defined as the number from conv layers be input on the longest path of output, is proportional to 2^C-1；For what is classified The usual dispersed placement tether layer of convolutional network；In order to reach identical purpose, f is used_C() as construction unit, by it with connect down The tether layer for coming is stacked B times, obtains total depth B2^C-1；Attended operationTwo characteristic blocks are combined into one, it is one of special It is a result for convolutional layer to levy block：It is the tensor that some fixed passages maintain activation in area of space.Port number pair The number of the convolutional layer filter before Ying Yu.When a point shape is expanded, adjacent connection is merged into single articulamentum.As Fig. 2 is right Shown in side, this articulamentum crosses over multiple row, all its input feature vector merged blocks into single IOB.

Because fractal net work includes extra large-scale structure, therefore propose to use a kind of similar dropout and drop- The coarseness Regularization Strategy of connect.Give up by the operand of random drop articulamentum to forbid parallel path in path Common to adapt to, this mode effectively prevent one path of Web vector graphic as anchor buoy, and another path may as amendment The over-fitting behavior for causing.Here two sampling policies are mainly used：

(3) the high-level characteristic vector that previous step is obtained is input to two recurrent neural nets based on LSTM units respectively Network.Recurrent neural network based on LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 god respectively Through unit, the propagated forward process of wherein each LSTM neural unit is represented by：

i_t=σ (W_xix_t+W_hih_t-1+b_i) (1-3)

f_t=σ (W_xfx_t+W_hfh_t-1+b_f) (1-4)

o_t=σ (W_xox_t+W_hoh_t-1+b_o) (1-5)

c_t=f_t*c_t-1+i_t*g_t (1-7)

Such as the Recursive Neural Network Structure figure based on two-layer LSTM units be given in Fig. 3, we utilize this two-layer heap The recurrent neural network of folded LSTM units carries out the operation of the coding and decoding to input feature vector vector, so as to realize nature language Say the conversion of text.Wherein, ground floor LSTM neurons complete the cataloged procedure to the input visual signature vector at each moment, Then the hidden layer of each moment output is expressed as the input of second layer LSTM neurons；When all frame of video characteristic vector all After being input to ground floor LSTM neurons, second layer LSTM neurons will receive a designator, and start the task of decoding. In the stage of decoding, network has the loss of information, therefore the target of model parameter training and study is in the expression of given hidden layer On the premise of output with last moment is predicted, the log-likelihood function of whole output statement prediction is maximized.For with parameter θ With output statement Y=(y₁,y₂,…,y_m) model that represents, parameter optimization target is represented by：

(4) output valve at two each moment of independent model is weighted mean deviation and obtains the corresponding description language of video , concrete operations are：

1st, the output valve of two second layer LSTM neurons at each moment of independent model is weighted averagely；

2nd, the probability of occurrence of each word in vocabulary V is calculated using softmax functions, is expressed as：

3rd, in the decoding stage at each moment, the word of maximum probability in softmax function-outputs is taken, so as to obtain right The video presentation sentence answered.

Above-described embodiment is the present invention preferably implementation method, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from Spirit Essence of the invention and the change, modification, replacement made under principle, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. a kind of video presentation method based on two-way fractal net work and LSTM, it is characterised in that enter to video to be described, first The sampling of row key frame, and the Optical-flow Feature between adjacent two frame of former video is extracted, then learned respectively by two fractal net works The high-level characteristic expression of key frame and Optical-flow Feature is practised and obtained, then is separately input to two recurrent neurals based on LSTM units , finally be weighted the output valve at two independent each moment of recurrent neural networks model averagely, so as to obtain by network model Descriptive statement corresponding with the video；Specifically include following steps：

S2, learnt respectively by two fractal net works and obtain key frame and the high-level characteristic of Optical-flow Feature expression；Wherein divide shape Network is generated by the repeated application of single extension rule；

S3, the high-level characteristic vector that previous step is obtained is input to two recurrent neural network moulds based on LSTM units respectively Type；

S4, the output valve at two independent each moment of recurrent neural networks model is weighted mean deviation obtain video it is corresponding Descriptive statement.

2. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step Video extraction Optical-flow Feature to be described, is specially described in rapid S1：

S1.1, the Optical-flow Feature value calculated respectively on x directions and y directions of the video per adjacent two frame, and normalize to [0,255] Pixel coverage；

3. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step The high-level characteristic expression of acquisition key frame and Optical-flow Feature concretely comprises the following steps in rapid S2：

S2.1, the key frame of the video obtained to step S1 are sequentially inputted to first treatment Spatial Dimension with the order at time point The fractal net work of relation, corresponding visual signature vector is sequentially generated by the Nonlinear Mapping relation of network；

S2.2, second process time dimensional relationships are sequentially inputted to the order at time point to the light flow graph that step S1 is obtained Fractal net work, corresponding motion feature vector is sequentially generated by the Nonlinear Mapping relation of network.

4. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 3, it is characterised in that step The repeated application by single extension rule in rapid S2.1 and S2.2 generates a profound network, and its topology layout is one The individual point shape blocked；The network includes the different interaction subpath of length, but not comprising the connection of any through type；Meanwhile, In order to realize extracting the ability of high-performance constant depth sub-network, the method regularization point shape framework given up in a kind of path is employed The rule of the coadaptation in lining path；For fractal net work, the simplicity of training is corresponding with the simplicity of design, single company The loss function for being connected to last layer is enough to drive internal act to go to imitate depth supervision；The fractal net work for being used is to be based on dividing The depth convolutional neural networks of shape structure.

5. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 4, it is characterised in that step The repeated application by single extension rule in rapid S2.1 and S2.2 generates a profound network, and its topology layout is one The individual point shape blocked, specially：

Base case f₁The layer of (z) comprising single selection type between input and output；Make C represent and block a point shape f_CThe index of (), f_C() defines the network architecture, connection and channel type；Wherein, base case be the network representation comprising single convolutional layer such as Formula (1-1)：

f₁(z)=conv (z) (1-1)

Ensuing point of shape of recursive definition such as formula (1-2)：

In formula (1-2),Represent compound, andAttended operation is represented, C corresponds to columns, in other words network f_CThe width of () Degree；Depth is defined as from being input on the longest path of output^convThe number of layer, is proportional to 2^C-1；For the convolution net classified The usual dispersed placement tether layer of network；In order to reach identical purpose, f is used_C() as construction unit, by it and ensuing remittance Collection layer is stacked B times, obtains total depth B2^C-1；Attended operationTwo characteristic blocks are combined into one；One characteristic block is one Conv layers of result：It is the tensor that some fixed passages maintain activation in area of space；Port number corresponds to above The number of conv layers of filter；When a point shape is expanded, adjacent connection is merged into single articulamentum；Articulamentum it is all its Input feature vector merged block is into single IOB.

6. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 4, it is characterised in that step A kind of rule of the coadaptation in the method regularization point shape framework lining path that path is given up is specially in rapid S2.1 and S2.2： Because fractal net work includes extra large-scale structure, using a kind of coarseness of similar dropout and drop-connect just Then change strategy, path is given up by the operand of random drop articulamentum to forbid the common adaptation of parallel path, this mode One path of Web vector graphic be effectively prevent as anchor buoy, the over-fitting behavior that another path may cause as amendment； Using two sampling policies：

It is single-row by limiting this paths, to encourage each column per paths in order to whole network is selected for the overall situation As strong fallout predictor.

7. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 1, it is characterised in that step High-level characteristic vector is input into two recurrent neural networks models based on LSTM units described in rapid S3 to be specially：It is based on The recurrent neural network of LSTM units includes two-layer LSTM units, and ground floor and the second layer include 1000 neurons respectively, its In the propagated forward process of each LSTM neural unit be represented by：

i_t=σ (W_xix_t+W_hih_t-1+b_i) (1-3)

f_t=σ (W_xfx_t+W_hfh_t-1+b_f) (1-4)

o_t=σ (W_xox_t+W_hoh_t-1+b_o) (1-5)

c_t=f_t*c_t-1+i_t*g_t (1-7)

Wherein,It is sigmoid nonlinear activation functions,It is hyperbolic Tangent nonlinear activation function；i_t, f_t, o_t, c_tT input gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Quantity of state；For each gate, W_xi, W_xf, W_xo, W_xgInput gate is represented respectively, and Memory-Gate, out gate and core door are corresponding Transferring weights matrix, W_hi, W_hf, W_ho, W_hgRepresent input gate respectively, Memory-Gate, out gate and core door are in t-1 moment hidden layers Variable h_t-1Corresponding transferring weights matrix, b_i, b_f, b_o, b_gInput gate is represented respectively, and Memory-Gate, out gate is corresponding with core door Bias vector.

8. a kind of video presentation method based on two-way fractal net work and LSTM according to claim 7, it is characterised in that step Artificial Neural Network Structures are in rapid S3：

Based on the recurrent neural network of two-layer LSTM units, the conversion of natural language text is realized；Wherein, ground floor LSTM nerves Unit completes the cataloged procedure to the input visual signature vector at each moment, and then the hidden layer expression of each moment output is used as the Two layers of input of LSTM neurons；After the characteristic vector of all frame of video is all input to ground floor LSTM neurons, the second layer LSTM neurons will receive a designator, and start the task of decoding；In the stage of decoding, network has the damage of information Lose, thus model parameter training and study target be on the premise of given hidden layer expression and the output of last moment are predicted, Maximize the log-likelihood function of whole output statement prediction；For with parameter θ and output statement Y=(y₁,y₂,…,y_m) represent Model, parameter optimization target is represented by：

θ^{*} = \arg \max \underset{(h, Y)}{Σ} \log p (Y | h; θ) - - - (1 - 9)

Here, θ is parameter, and Y represents the prediction sentence of output, and h is expressed for hidden layer, using stochastic gradient descent method to object function Optimize, the error of whole network accumulates transmission by back-propagation algorithm on time dimension.

9. according to a kind of video presentation method based on two-way fractal net work and LSTM of claim 1, it is characterised in that step S4 The output valve at two neutral net each moment of independent model is weighted mean deviation and obtains the corresponding descriptive statement tool of video Gymnastics conduct：

S4.1, the output valve of two second layer LSTM neurons at independent each moment of recurrent neural networks model is weighted Averagely；

P (y | z_{t}) = \frac{\exp (W_{y} z_{t})}{Σ_{y^{'} &Element; V} \exp (W_{y^{'}} z_{t})} - - - (1 - 10)

Wherein, y represents the word of prediction, z_tRepresent output valve of the recurrent neural network in t, W_yRepresent the word in vocabulary Weighted value in table；

S4.3, the decoding stage at each moment, take the word of maximum probability in softmax function-outputs, so as to obtain right The video presentation sentence answered.