CN110288029A

CN110288029A - Image Description Methods based on Tri-LSTMs model

Info

Publication number: CN110288029A
Application number: CN201910565977.0A
Authority: CN
Inventors: 王爽; 侯彪; 张磊; 孟芸; 叶秀眺; 田敬贤
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-09-27
Anticipated expiration: 2039-06-27
Also published as: CN110288029B

Abstract

The invention discloses a kind of Image Description Methods based on Tri-LSTMs model, it the steps include: to generate training set and map term vector, it builds and trains RPN convolutional neural networks and Faster-RCNN convolutional neural networks, extract the full articulamentum feature of image, Tri-LSTMs model is constructed and trained, iamge description is generated.The present invention is combined with multiple long short-term memory network LSTM, while the full articulamentum feature of image and 300 dimension GLOVE term vectors of word is utilized, and effectively increases the diversity for generating subtitle, generates more accurate iamge description.

Description

Image Description Methods based on Tri-LSTMs model

Technical field

The invention belongs to technical field of image processing, further relate to one of iamge description technical field and are based on The Image Description Methods of Tri-LSTMs model.The present invention can be used for generating given image accurate and have multifarious sentence To describe the content of image.Wherein, Tri-LSTMs is indicated by semantic LSTM module, vision LSTM module and language LSTM module The Tri-LSTMs model of three modules composition.

Background technique

Iamge description is to give an image, and generated statement describes the content of image.The sentence of generation not only wants smooth, It also wants accurately describe the relationship between attribute, position and the object of object and object in image.The figure of generation As describing to can be used for finding the image for meeting description content, it is convenient for image retrieval.In addition, the iamge description of generation is switched to blind Wen Hou can help blind person to understand picture material.

Patented technology " a kind of Image Description Methods and system based on bag of words " (patent that Shenzhen University possesses at it Application number: 201410491596X, Authorization Notice No.: CN104299010B) in propose a kind of image based on bag of words and retouch State method.It is too low with accuracy that the patented technology mainly solves the problems, such as that conventional method information is lost.The patented technology realizes step Suddenly are as follows: (1) extract characteristic point from image to be described,；(2) the distance between vision word in the characteristic point and code book is calculated Set, and obtained between the characteristic point and the vision word by Gauss subordinating degree function, using the distance set Degree of membership set；(3) the degree of membership set is utilized, being subordinate to for the vision word for describing each characteristic point is counted Degree forms histogram vectors, and the histogram vectors are to describe the image to be described,.Although the patented technology improves biography The iamge description technology of system, the accuracy of description is higher, and still, this method, which still has, to be disadvantageous in that, needs artificial Characteristic point is extracted, is had a significant impact using different extracting methods to result, the process of extraction is many and diverse, and the image ultimately generated It is insufficient that diversity is described.

Patented technology " a kind of generation method from structured text to iamge description " (patent that University Of Tianjin possesses at it Application number: 2016108541692, Authorization Notice No.: CN106503055B) in propose a kind of base from structured text to image The generation method of description.The iamge description accuracy that the patented technology mainly solves prior art generation is low and diversity is insufficient Problem.The patented technology realizes step are as follows: (1) from the Internet download picture, constitutes picture training set；(2) to scheming in training set As corresponding description carries out morphological analysis, structural texture text；(3) existing neural network model is utilized, training set is extracted The convolutional neural networks feature of image, and with<characteristics of image, structured text>conduct input constructs multitask identification model； (4) as the input of recurrent neural network, it is refreshing that training obtains recurrence for the structured text extracted using in training set and corresponding description Parameter through network model；(5) the convolutional neural networks feature for inputting image to be described, is obtained pre- by multitask identification model Geodesic structure text；(6) input prediction structured text obtains iamge description by recurrent neural networks model.The patent skill Although art improves the problem for improving the iamge description diversity deficiency of generation, still, the deficiency that this method still has It is in only only used characteristics of image, do not instructed using other effective informations decoding process, influence to ultimately generate Iamge description accuracy.

Paper " the Show and Tell:A Neural Image Caption that Oriol Vinyals et al. is delivered at it The Image Description Methods based on coder-decoder model are proposed in Generator " (2015 meeting paper of cvpr).This method It is to extract characteristics of image first with convolutional neural networks (ConvolutionalNeural Network, CNN), is then delivered to length The corresponding description of image is generated in short-term memory network (Long Short-TermMemory, LSTM).This method is for the first time using volume Code device-decoder structure solves the problems, such as iamge description, and still, this method, which still has, to be disadvantageous in that, model structure It is too simple, the iamge description inaccuracy of generation.

Paper " Show, Attend and Tell:Neural the Image Caption that Kelvin Xu et al. is delivered at it Iing is proposed in Generation with Visual Attention " (2015 meeting paper of cvpr) will long memory network in short-term The Image Description Methods of (Long Short-TermMemory, LSTM) in conjunction with attention mechanism.This method is in decoding process Different weights is distributed the different location of image, to give different attention rates to the object of different location.This method is raw At more accurate iamge description, it was demonstrated that long memory network (Long Short-TermMemory, LSTM) and attention in short-term The validity that mechanism combines.But the shortcoming that this method still has is, the long memory network (Long in short-term of single layer Short-TermMemory, LSTM) a variety of responsibilities such as sentence generates, image weights are distributed are undertaken simultaneously, responsibility, which is obscured, to cause to give birth to At iamge description it is not accurate enough.

Paper " the Image Captioning with Semantic that Quanzeng You et al. is delivered at it The middle proposition of Attention " (2016 meeting paper of cvpr) is by semantic attribute, characteristics of image simultaneously in conjunction with attention mechanism Image Description Methods.As semantic attribute, then this method chooses highest 1000 words of the frequency of occurrences in lexicon first Semantic attribute after the input layer and output layer of decoder introduce weighting.This method demonstrates while by semantic attribute, image Validity of the feature in conjunction with attention mechanism.But the shortcoming that this method still has is, different images are corresponding to be retouched Otherness is too small between stating, and the description of generation is stiff, templating.

Summary of the invention

It is an object of the invention to overcome above-mentioned the deficiencies in the prior art, propose a kind of based on Tri-LSTMs model Image Description Methods.The present invention can effectively improve the accuracy and diversity of iamge description.

Realizing technical thought of the invention is: firstly, building and training RPN convolutional neural networks model and faster- RCNN network model；Then, it builds and trains Tri-LSTMs model；Finally, the faster-RCNN network good using pre-training Image-region is input in Tri-LSTMs model by model extraction image-region, generates iamge description to image.

Realize that specific step is as follows for the object of the invention:

(1) it generates training set and maps term vector:

(1a) is concentrated from the image data with iamge description chooses at least 80000 samples composition training sets, selected Each sample be an image-description pair, each image-description contains piece image and five corresponding images in Description；

(1b) counts all of all samples if the iamge description of each sample is made of multiple English words in training set The frequency of iamge description Chinese and English word appearance simultaneously drops power sequence, chooses preceding 1000 words, selected each word is mapped GLOVE term vector is tieed up for corresponding 300, and is stored in computer；

(2) RPN convolutional neural networks model and faster-RCNN network model are built:

(2a) builds one by eight convolutional layers and the one Softmax layers RPN convolutional neural networks model constituted and sets Set each layer parameter；

(2b) builds one by five convolutional layers, one ROIpooling layers, four full articulamentums and one Softmax layers Simultaneously each layer parameter is arranged in the faster-RCNN network model of composition；

(3) training RPN convolutional neural networks and fast-RCNN convolutional neural networks:

Using alternately training method, alternately instruction is carried out to RPN convolutional neural networks and fast-RCNN convolutional neural networks Practice, obtains trained RPN convolutional neural networks and fast-RCNN convolutional neural networks；

(4) the full articulamentum feature of each sample image in training set is extracted:

Each sample image in training set is sequentially inputted in trained RPN convolutional neural networks by (4a), output In each sample image in the position of all target roughing frames and frame target type；

Image-region in each target roughing frame is separately input to trained on ImageNet database by (4b) In resnet101 network, by all full articulamentum characteristic storage of the full articulamentum output of the network the last layer into computer；

(5) Tri-LSTMs model is constructed:

One shot and long term memory network LSTM and an attention network are successively formed semanteme LSTM module, length by (5a) Phase memory network LSTM contains 1024 neurons；

One shot and long term memory network LSTM and an attention network are successively formed vision LSTM module, length by (5b) Phase memory network LSTM contains 1024 neurons；

One shot and long term memory network LSTM, a full articulamentum are successively formed language LSTM module, shot and long term by (5c) Memory network LSTM contains 1024 neurons, and the neuron number of full articulamentum is set as all iamge descriptions in training set The total words for including；

Semantic LSTM module, vision LSTM module, language LSTM module are successively formed Tri-LSTMs model by (5d)；

(6) training Tri-LSTMs model:

(6a) at different times, using training sample image describe in different location word as input, from zero moment Start, training Tri-LSTMs model；

The whole of the full articulamentum output of the resnet101 network the last layer stored in (6b) read step (4b) computer Full articulamentum feature, using the average value of all full articulamentum features as feature vector；

Feature vector is added by (6c) with the term vector of the word mapping at current time in iamge description, is input to semanteme In shot and long term memory network LSTM in LSTM module, shot and long term memory network LSTM forward conduction, which exports, hides state；

The 1000 300 dimension GLOVE term vectors stored in (6d) read step (1) computer, are input to semantic LSTM mould GLOVE term vector in the attention network of block, after attention network forward conduction after output weighting；

(6e) is by the output phase of attention network in the hiding state at semantic LSTM module current time and semanteme LSTM module Add, using obtain and vector as the output of semanteme LSTM module；

(6f) semantic LSTM module is exported and vector, be input to shot and long term memory network LSTM in vision LSTM module In, shot and long term memory network LSTM forward conduction, which exports, hides state；

The whole of the full articulamentum output of the resnet101 network the last layer stored in (6g) read step (4b) computer Full articulamentum feature, is input in the attention network of vision LSTM module, attention network forward conduction, after output weighting Full articulamentum feature vector；

(6h), will by the output of attention network in the hiding state at vision LSTM module current time and vision LSTM module Obtain and output of the vector as vision LSTM module；

(6i) by the output of semantic LSTM module and vector, be input to shot and long term memory network in language LSTM module In LSTM, shot and long term memory network LSTM forward conduction, which exports, hides state, and hiding state is input in full articulamentum, is exported next The probability vector of a moment word；

(6j) judges with the presence or absence of word in next moment iamge description, if so, calculating word probability vector and image Step (6b) is executed after describing the intersection entropy loss between the word vector at next moment, otherwise, is executed step (6k)；

(6k) is added the intersection entropy loss at all moment to obtain total losses, uses all ginsengs in BP algorithm Optimized model Number keeps total losses minimum, and the deconditioning when total losses convergence obtains trained Tri-LSTMs model；

(7) iamge description is generated:

One natural image is input in the good faster-RCNN of pre-training by (7a), exports target roughing frame；

Image-region in target roughing frame is input in trained resnet101 network by (7b), exports full connection Tomographic image feature；

Full articulamentum characteristics of image is input in Tri-LSTMs model by (7c), generates iamge description.

The present invention compared with prior art, has the advantage that

First, since the Tri-LSTMs model that the present invention constructs is combined using three shot and long term memory network LSTM, overcome The prior art just with single shot and long term memory network LSTM generates iamge description, causes model structure too simple, The shortcomings that sufficiently exact iamge description cannot be generated, closes the present invention by multiple shot and long term memory network LSTM groups Come, can effectively promote the accuracy of iamge description, have the advantages that generalization ability is stronger.

Second, 300 dimension GLOVE term vector conducts of the full articulamentum feature of image, word are utilized in the present invention simultaneously The input of Tri-LSTMs model overcomes the prior art and input of the full articulamentum feature of image as model is used only, can benefit Effective information is excessively single, therefore the problem for the iamge description diversity deficiency for causing Image Description Methods to generate, so that The present invention has the advantages that the iamge description generated is more diversified

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is the structure chart of the Tri-LSTMs model constructed in the present invention.

Fig. 3 is analogous diagram of the invention.

Specific embodiment

The present invention will be further described with reference to the accompanying drawing.

Referring to Fig.1, the step of realizing to the present invention is further described.

Step 1, it generates training set and maps term vector.

It is concentrated from the image data with iamge description and chooses at least 80000 samples composition training sets, selected is every A sample is an image-description pair, and each image-description contains piece image in and five corresponding images are retouched It states.Iamge description refers to, attribute, position and the mutual relationship of objects in images.

If the iamge description of each sample is made of multiple English words in training set, all images of all samples are counted It describes the frequency that Chinese and English word occurs and simultaneously drops power sequence, choose preceding 1000 words, selected each word is mapped as pair The 300 dimension GLOVE term vectors answered, and be stored in computer.

Step 2, RPN convolutional neural networks model and faster-RCNN network model are built.

One is built by eight convolutional layers and the one Softmax layers RPN convolutional neural networks model constituted and is arranged each Layer parameter, each layer convolution kernel size is 3*3.

One is built to be made of five convolutional layers, one ROIpooling layers, four full articulamentums and one Softmax layers Faster-RCNN network model and each layer parameter is set, each layer convolution kernel size is 3*3.

Step 3, training RPN convolutional neural networks and fast-RCNN convolutional neural networks.

Using alternately training method, alternately instruction is carried out to RPN convolutional neural networks and fast-RCNN convolutional neural networks Practice, obtains trained RPN convolutional neural networks and fast-RCNN convolutional neural networks.

Alternately the step of training method is as follows:

Step 1 chooses a random value to each parameter of RPN convolutional neural networks, carries out random initializtion.

Training sample image is input in the RPN convolutional neural networks after initialization, uses backpropagation BP by step 2 Algorithm trains the network, adjusts RPN convolutional neural networks parameter, until the convergence of all parameters, obtains first trained RPN convolutional neural networks.

Training sample image is input in trained RPN convolutional neural networks by step 3, exports training sample image On target roughing frame.

Step 4 chooses a random value to each parameter of fast-RCNN convolutional neural networks, carries out random initializtion.

Step 5, after the target roughing frame obtained in training sample image and 3 step of this step is input to initialization In fast-RCNN convolutional neural networks, using the backpropagation BP algorithm training network, fast-RCNN convolutional Neural net is adjusted Network parameter obtains first trained fast-RCNN convolutional neural networks until the convergence of all parameters.

Step 6 fixes the parameter of trained first five layer of convolutional layer of RPN convolutional neural networks and sheet in this step step 2 The parameter of trained fast-RCNN convolutional neural networks, training sample image is input to trained in step step 5 In RPN convolutional neural networks, the loose parameter of RPN convolutional neural networks is finely tuned using backpropagation BP algorithm, until its receipts Until holding back, final trained RPN convolutional neural networks model is obtained.

Training sample image is input to this step step 6 finally in trained RPN convolutional neural networks, weight by step 7 Newly obtain the target roughing frame on sample image.

Step 8 fixes first five layer of convolution layer parameter of trained fast-RCNN convolutional neural networks and this step in the 5th step The final trained RPN convolutional neural networks parameter of rapid step 6 will retrieve in training sample image and this step step 7 Target roughing frame input fast-RCNN convolutional neural networks in, use backpropagation BP algorithm fine tuning fast-RCNN convolution mind Through the loose parameter of network, until its convergence, final trained fast-RCNN convolutional neural networks are obtained.

Step 4, the full articulamentum feature of image is extracted.

Sample image in training set is sequentially inputted in trained RPN convolutional neural networks, each sample is exported In image in the position of all target roughing frames and frame target type.

Image-region in each target roughing frame is separately input to trained on ImageNet database In resnet101 network, by all full articulamentum characteristic storage of the full articulamentum output of the network the last layer into computer.

Step 5, Tri-LSTMs model is constructed.

One shot and long term memory network LSTM and an attention network are successively formed into semanteme LSTM module, shot and long term note Recall network LSTM and contains 1024 neurons.

One shot and long term memory network LSTM and an attention network are successively formed into vision LSTM module, shot and long term note Recall network LSTM and contains 1024 neurons.

One shot and long term memory network LSTM, a full articulamentum are successively formed into language LSTM module, shot and long term memory Network LSTM contains 1024 neurons, and the neuron number of full articulamentum is set as all iamge descriptions in training set and includes Total words.

Semantic LSTM module, vision LSTM module, language LSTM module are successively formed into Tri-LSTMs model such as Fig. 2 institute Show.

Step 6, training Tri-LSTMs model.

Step 1, at different times, using training sample image describe in different location word as input, when from zero Quarter, training Tri-LSTMs model.

Step 2, the full articulamentum output of the resnet101 network the last layer stored in computer in read step 4 it is complete The full articulamentum feature in portion, using the average value of all full articulamentum features as feature vector.

Feature vector is added with the term vector of the word mapping at current time in iamge description, is input to semanteme by step 3 In shot and long term memory network LSTM in LSTM module, shot and long term memory network LSTM forward conduction, which exports, hides state.

The shot and long term memory network LSTM forward conduction is realized according to the following formula:

i_t=sigmoid (W_ixx_t+W_ihh_t-1)

f_t=sigmoid (W_fxx_t+W_fhh_t-1)

o_t=sigmoid (W_oxx_t+W_ohh_t-1)

c_t=f_t⊙c_t-1+i_t⊙tanh(W_cxx_t+W_chh_t-1)

h_t=o_t⊙tanh(c_t)

Wherein, i_tIndicate that the input gate of t moment shot and long term memory network LSTM, sigmoid indicate activation primitiveE is indicated using natural constant e as the index operation at bottom, W_ixIndicate the transferring weights matrix of input gate, x_tIndicate the input of t moment shot and long term memory network LSTM, W_ihIndicate the transferring weights matrix of hiding state corresponding to input gate, h_t-1Indicate the hiding state of t-1 moment shot and long term memory network LSTM, f_tIndicate the forgetting of t moment shot and long term memory network LSTM Door, W_fxIndicate the transferring weights matrix of forgetting door, W_fhIndicate the transferring weights matrix of hiding state corresponding to forgetting door, o_tIt indicates The out gate of t moment shot and long term memory network LSTM, W_oxIndicate the transferring weights matrix of out gate, W_ohIndicate that out gate institute is right The transferring weights matrix for the hiding state answered, c_tIndicate that the state cell of t moment shot and long term memory network LSTM, ⊙ indicate in calculating Product operation, c_t-1Indicate that the state cell of t-1 moment shot and long term memory network LSTM, tanh indicate activation primitiveW_cxIndicate the transferring weights matrix of state cell, W_chIndicate the power of hiding state corresponding to state cell Weight transfer matrix, h_tIndicate the hiding state of t moment shot and long term memory network LSTM.

Step 4, the 1000 300 dimension GLOVE term vectors stored in read step 1 in computer, is input to semantic LSTM GLOVE term vector in the attention network of module, after attention network forward conduction after output weighting.

The attention network forward conduction is realized according to the following formula:

a_i,t=tanh (W_ss_i+W_hh_t)

Wherein, a_i,tIndicate that the weighted value of i-th of vector in t moment 1000 300 dimension GLOVE term vectors, tanh indicate to swash Function livingE is indicated using natural constant e as the index operation at bottom, W_sIndicate 300 dimension GLOVE term vectors Transferring weights matrix, s_iIndicate i-th of term vector in 1000 300 dimension GLOVE term vectors of input, W_hIndicate semanteme LSTM The transferring weights matrix of the hiding state of shot and long term memory network LSTM output in module, h_tIndicate t moment semanteme LSTM module In shot and long term memory network LSTM output hiding state,Indicate attention network output in t moment semanteme LSTM module Feature vector, K indicate that the sum of 300 dimension GLOVE term vectors, ∑ indicate sum operation, and i indicates the rope of each vector in term vector Draw.

Step 5, by the output of attention network in the hiding state at semantic LSTM module current time and semanteme LSTM module It is added, obtain and output of the vector as semanteme LSTM module.

Step 6, by the output of semantic LSTM module and vector, be input to shot and long term memory network in vision LSTM module In LSTM, shot and long term memory network LSTM forward conduction, which exports, hides state.

i_t=sigmoid (W_ixx_t+W_ihh_t-1)

f_t=sigmoid (W_fxx_t+W_fhh_t-1)

o_t=sigmoid (W_oxx_t+W_ohh_t-1)

c_t=f_t⊙c_t-1+i_t⊙tanh(W_cxx_t+W_chh_t-1)

h_t=o_t⊙tanh(c_t)

Wherein, i_tIndicate that the input gate of t moment shot and long term memory network LSTM, sigmoid indicate activation primitiveE is indicated using natural constant e as the index operation at bottom, W_ixIndicate the transferring weights matrix of input gate, x_tIndicate the input of t moment shot and long term memory network LSTM, W_ihIndicate the transferring weights matrix of hiding state corresponding to input gate, h_t-1Indicate the hiding state of t-1 moment shot and long term memory network LSTM, f_tIndicate the forgetting of t moment shot and long term memory network LSTM Door, W_fxIndicate the transferring weights matrix of forgetting door, W_fhIndicate the transferring weights matrix of hiding state corresponding to forgetting door, o_tIt indicates The out gate of t moment shot and long term memory network LSTM, W_oxIndicate the transferring weights matrix of out gate, W_ohIndicate that out gate institute is right The transferring weights matrix for the hiding state answered, c_tIndicate that the state cell of t moment shot and long term memory network LSTM, ⊙ indicate in calculating Product operation, c_t-1Indicate that the state cell of t-1 moment shot and long term memory network LSTM, tanh indicate activation primitiveW_cxIndicate the transferring weights matrix of state cell, W_chIndicate hiding state corresponding to state cell Transferring weights matrix, h_tIndicate the hiding state of t moment shot and long term memory network LSTM.

Step 7, the whole of the full articulamentum output of the resnet101 network the last layer stored in 4 computer of read step Full articulamentum feature, is input in the attention network of vision LSTM module, attention network forward conduction, after output weighting Full articulamentum feature vector.

a_i,t=tanh (W_vv_i+W_hh_t)

Wherein, a_i,tIndicate the weight of ith feature in the whole full articulamentum features of t moment, tanh indicates activation primitiveE is indicated using natural constant e as the index operation at bottom, W_vIndicate the transferring weights square of full articulamentum feature Battle array, v_iIndicate ith feature in all full articulamentum features, W_hIndicate the shot and long term memory network LSTM in vision LSTM module Hiding state weight matrix, h_tIndicate the hiding state of the shot and long term memory network LSTM output in t moment vision LSTM module,Indicate the output of attention network in t moment vision LSTM module, K indicates that the sum of full articulamentum feature vector, ∑ indicate Sum operation, i indicate the index of each vector in feature vector.

Step 8, by the output of attention network in the hiding state at vision LSTM module current time and vision LSTM module, Obtain and output of the vector as vision LSTM module.

Step 9, by the output of semantic LSTM module and vector, be input to shot and long term memory network in language LSTM module In LSTM, shot and long term memory network LSTM forward conduction, which exports, hides state, and hiding state is input in full articulamentum, is exported next The probability vector of a moment word.

i_t=sigmoid (W_ixx_t+W_ihh_t-1)

f_t=sigmoid (W_fxx_t+W_fhh_t-1)

o_t=sigmoid (W_oxx_t+W_ohh_t-1)

c_t=f_t⊙c_t-1+i_t⊙tanh(W_cxx_t+W_chh_t-1)

h_t=o_t⊙tanh(c_t)

Step 10, judges whether next moment iamge description has word, retouches if so, calculating word probability vector with image After stating the intersection entropy loss between the word vector at next moment, then the step 2 of this step is executed, otherwise, executes this step Step 11.

Between the word probability vector and the word vector at iamge description next moment intersect entropy loss be according to What following formula were calculated:

Wherein, loss indicates the friendship between word probability vector and the word vector at training set iamge description next moment Entropy loss is pitched, N indicates that the sum of word in training set iamge description, ∑ indicate sum operation, and t is indicated in training set iamge description The index of word, log expression is using natural constant e as the log operations at bottom, P (s_t|I；It θ) indicates all complete of training set image The average value of articulamentum feature is input in Tri-LSTMs model, the t moment word probability vector of output, and I indicates training set figure As all the average value of articulamentum feature, θ indicate all parameters of Tri-LSTMs model entirely.

The intersection entropy loss at all moment is added, obtains total losses, use the institute in BP algorithm Optimized model by step 11 There is parameter, keep total losses minimum, the deconditioning when total losses convergence obtains trained Tri-LSTMs model.

Step 7, iamge description is generated.

One natural image is input in the good faster-RCNN of pre-training, target roughing frame is exported.

Image-region in target roughing frame is input in trained resnet101 network, full articulamentum figure is exported As feature.

Full articulamentum characteristics of image is input in Tri-LSTMs model, iamge description is generated.

Effect of the present invention is described further below with reference to emulation.

1, emulation experiment condition:

The hardware platform of emulation experiment of the invention are as follows: Dell Computer Intel (R) Core5 processor, dominant frequency 3.20GHz, memory 64GB；

The software platform of emulation experiment of the invention are as follows: Python3.5, Tensorflow1.2 platform.

Data set used in emulation experiment of the present invention is COCO data set, which is one obtained by team, Microsoft A data set that can be used to do iamge description generation, the building time of data set are 2014, the training set of data set and survey Examination collection respectively contains 123287 and 40,775 images.

2, emulation content and interpretation of result:

Emulation experiment of the present invention is using the present invention and two prior arts (adaptive attention mechanism methods, the side scst Method), 123287 training set samples of COCO data set are separately input to be trained in the model respectively constructed, will be tested 40,775 images of collection are separately input in trained model, are generated three kinds of methods and are retouched to the image of every test set image It states.

In emulation experiment, two prior arts of use refer to:

The adaptive attention mechanism method of the prior art refers to that Jiasen Lu et al. is in paper " Knowing When to Look:Adaptive Attention via A Visual Sentinel for Image Captioning”(cvpr 2017 Meeting paper) in the iamge description generation method that proposes, referred to as adaptive attention mechanism method.

Prior art scst method refers to that Jiasen Lu et al. is in paper " Self-critical Sequence The iamge description generation method proposed in Training for Image Captioning " (2017 meeting paper of cvpr), letter Claim scst method.

In order to compare three kinds of methods generation iamge description superiority and inferiority, using four evaluation indexes (BLEU-4, METEOR, ROUGE-L, CIDER) iamge description of three kinds of methods generation on COCO test set image is evaluated respectively.By index result It is depicted as table 1, wherein Net-1 indicates that, the present invention is based on the Image Description Methods of Tri-LSTMs model, Net-2 indicates adaptive Attention mechanism method, Net-3 indicate scst method.

The quantitative analysis table of the present invention and two prior art classification results in 1. emulation experiment of table

From table 1 it follows that the network in the present invention is compared to based on adaptive attention mechanism method, the side scst Method obtains higher score in various evaluation indexes, thus performs better than, and can generate more accurate iamge description.

For more intuitive description effect of the invention, chosen in the simulation result on COCO test set from the present invention at random Two figures, as shown in Figure 3, wherein Fig. 3 (a), Fig. 3 (b) they are a natural image and the image pair in COCO test set The iamge description answered.

The iamge description that the present invention generates it can be seen from the analogous diagram of Fig. 3 is more accurate, specifically describes in image Content.

Claims

1. a kind of Image Description Methods based on Tri-LSTMs model, which is characterized in that build by semantic LSTM module, vision The Tri-LSTMs model of LSTM module and language LSTM module composition, describes image to any one natural image generated statement The step of content, this method, is as follows:

(1) it generates training set and maps term vector:

(1a) is concentrated from the image data with iamge description chooses at least 80000 samples composition training sets, and selected is every A sample is an image-description pair, and each image-description contains piece image in and five corresponding images are retouched It states；

(1b) counts all images of all samples if the iamge description of each sample is made of multiple English words in training set It describes the frequency that Chinese and English word occurs and simultaneously drops power sequence, choose preceding 1000 words, selected each word is mapped as pair The 300 dimension GLOVE term vectors answered, and be stored in computer；

(2a) builds one by eight convolutional layers and the one Softmax layers RPN convolutional neural networks model constituted and is arranged each Layer parameter；

(2b) builds one and is made of five convolutional layers, one ROIpooling layers, four full articulamentums and one Softmax layers Faster-RCNN network model and each layer parameter is set；

Using alternately training method, alternately training is carried out to RPN convolutional neural networks and fast-RCNN convolutional neural networks, is obtained To trained RPN convolutional neural networks and fast-RCNN convolutional neural networks；

Each sample image in training set is sequentially inputted in trained RPN convolutional neural networks by (4a), and output is each In sample image in the position of all target roughing frames and frame target type；

(5) Tri-LSTMs model is constructed:

One shot and long term memory network LSTM and an attention network are successively formed semanteme LSTM module, shot and long term note by (5a) Recall network LSTM and contains 1024 neurons；

One shot and long term memory network LSTM and an attention network are successively formed vision LSTM module, shot and long term note by (5b) Recall network LSTM and contains 1024 neurons；

One shot and long term memory network LSTM, a full articulamentum are successively formed language LSTM module, shot and long term memory by (5c) Network LSTM contains 1024 neurons, and the neuron number of full articulamentum is set as all iamge descriptions in training set and includes Total words；

(6) training Tri-LSTMs model:

(6a) at different times, using training sample image describe in different location word as input, since zero moment, Training Tri-LSTMs model；

The whole of the full articulamentum output of the resnet101 network the last layer stored in (6b) read step (4b) computer connects entirely A layer feature is connect, using the average value of all full articulamentum features as feature vector；

Feature vector is added by (6c) with the term vector of the word mapping at current time in iamge description, is input to semantic LSTM mould In shot and long term memory network LSTM in block, shot and long term memory network LSTM forward conduction, which exports, hides state；

The 1000 300 dimension GLOVE term vectors stored in (6d) read step (1) computer, are input to semantic LSTM module GLOVE term vector in attention network, after attention network forward conduction after output weighting；

The hiding state at semantic LSTM module current time is added by (6e) with the output of attention network in semanteme LSTM module, will Obtain and output of the vector as semanteme LSTM module；

(6f) semantic LSTM module is exported and vector, be input in vision LSTM module in shot and long term memory network LSTM, Shot and long term memory network LSTM forward conduction, which exports, hides state；

The whole of the full articulamentum output of the resnet101 network the last layer stored in (6g) read step (4b) computer connects entirely A layer feature is connect, is input in the attention network of vision LSTM module, attention network forward conduction, connecting after output weighting entirely Connect a layer feature vector；

(6h) will obtain the output of attention network in the hiding state at vision LSTM module current time and vision LSTM module And output of the vector as vision LSTM module；

(6i) by the output of semantic LSTM module and vector, be input to shot and long term memory network LSTM in language LSTM module In, shot and long term memory network LSTM forward conduction, which exports, hides state, hiding state is input in full articulamentum, when exporting next Carve the probability vector of word；

(6j) judges with the presence or absence of word in next moment iamge description, if so, calculating word probability vector and iamge description Step (6b) is executed after intersection entropy loss between the word vector at next moment, otherwise, is executed step (6k)；

(6k) is added the intersection entropy loss at all moment to obtain total losses, using all parameters in BP algorithm Optimized model, Keep total losses minimum, the deconditioning when total losses convergence obtains trained Tri-LSTMs model；

(7) iamge description is generated:

Image-region in target roughing frame is input in trained resnet101 network by (7b), exports full articulamentum figure As feature；

2. the Image Description Methods according to claim 1 based on Tri-LSTMs model, which is characterized in that step (1a) Described in iamge description refer to, attribute, position and the mutual relationship of objects in images.

3. the Image Description Methods according to claim 1 based on Tri-LSTMs model, which is characterized in that in step (3) The step of alternately training method, is as follows:

The first step chooses a random value to each parameter of RPN convolutional neural networks, carries out random initializtion；

Training sample image is input in the RPN convolutional neural networks after initialization, uses backpropagation BP algorithm by second step The training network, adjusts RPN convolutional neural networks parameter, until the convergence of all parameters, obtains trained RPN volumes first Product neural network；

Training sample image is input in trained RPN convolutional neural networks by third step, is exported in training sample image Target roughing frame；

4th step chooses a random value to each parameter of fast-RCNN convolutional neural networks, carries out random initializtion；

The target roughing frame obtained in training sample image and third step is input to the fast-RCNN after initialization by the 5th step In convolutional neural networks, using the backpropagation BP algorithm training network, fast-RCNN convolutional neural networks parameter is adjusted, directly Until the convergence of all parameters, first trained fast-RCNN convolutional neural networks are obtained；

6th step is fixed in second step and is instructed in the parameter and the 5th step of trained first five layer of convolutional layer of RPN convolutional neural networks Training sample image is input to trained RPN convolutional neural networks by the parameter for the fast-RCNN convolutional neural networks perfected In, the loose parameter of RPN convolutional neural networks is finely tuned using backpropagation BP algorithm, until its convergence, is obtained final Trained RPN convolutional neural networks model；

Training sample image is input to the 6th step finally in trained RPN convolutional neural networks, retrieves sample by the 7th step Target roughing frame on this image；

8th step fixes first five layer of convolution layer parameter of trained fast-RCNN convolutional neural networks and the 6th step in the 5th step Final trained RPN convolutional neural networks parameter, the target roughing frame that will be retrieved in training sample image and the 7th step It inputs in fast-RCNN convolutional neural networks, it is unlocked using backpropagation BP algorithm fine tuning fast-RCNN convolutional neural networks Parameter obtain final trained fast-RCNN convolutional neural networks until its convergence.

4. the Image Description Methods according to claim 1 based on Tri-LSTMs model, which is characterized in that step (6c), Shot and long term memory network LSTM forward conduction described in step (6f), step (6i) is realized according to the following formula:

i_t=sigmoid (W_ixx_t+W_ihh_t-1)

f_t=sigmoid (W_fxx_t+W_fhh_t-1)

o_t=sigmoid (W_oxx_t+W_ohh_t-1)

c_t=f_t⊙c_t-1+i_t⊙tanh(W_cxx_t+W_chh_t-1)

h_t=o_t⊙tanh(c_t)

5. the Image Description Methods according to claim 1 based on Tri-LSTMs model, which is characterized in that step (6d) Described in attention network forward conduction realize according to the following formula:

a_i,t=tanh (W_ss_i+W_hh_t)

Wherein, a_i,tIndicate that the weighted value of i-th of vector in t moment 1000 300 dimension GLOVE term vectors, tanh indicate activation letter NumberE is indicated using natural constant e as the index operation at bottom, W_sIndicate the weight of 300 dimension GLOVE term vectors Transfer matrix, s_iIndicate i-th of term vector in 1000 300 dimension GLOVE term vectors of input, W_hIndicate semanteme LSTM module In shot and long term memory network LSTM output hiding state transferring weights matrix, h_tIt indicates in t moment semanteme LSTM module The hiding state of shot and long term memory network LSTM output,Indicate the feature of attention network output in t moment semanteme LSTM module Vector, K indicate that the sum of 300 dimension GLOVE term vectors, ∑ indicate sum operation, and i indicates the index of each vector in term vector.

6. the Image Description Methods according to claim 1 based on Tri-LSTMs model, which is characterized in that step (6g) Described in attention network forward conduction realize according to the following formula:

a_i,t=tanh (W_vv_i+W_hh_t)

7. the Image Description Methods according to claim 1 based on Tri-LSTMs model, which is characterized in that step (6j) Described in word probability vector and the entropy loss that intersects between the word vector at iamge description next moment be according to following What formula was calculated:

Wherein, loss indicates the cross entropy between word probability vector and the word vector at training set iamge description next moment Loss, N indicate that the sum of word in training set iamge description, ∑ indicate sum operation, and t indicates word in training set iamge description Index, log indicate using natural constant e as the log operations at bottom, P (s_t|I；It θ) indicates all full connection of training set image The average value of layer feature is input in Tri-LSTMs model, the t moment word probability vector of output, and I indicates that training set image is complete The average value of the full articulamentum feature in portion, θ indicate all parameters of Tri-LSTMs model.