CN106650813A

CN106650813A - Image understanding method based on depth residual error network and LSTM

Info

Publication number: CN106650813A
Application number: CN201611226528.6A
Authority: CN
Inventors: 胡丹; 袁东芝; 余卫宇; 李楚怡
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-12-27
Filing date: 2016-12-27
Publication date: 2017-05-10
Anticipated expiration: 2036-12-27
Also published as: CN106650813B

Abstract

The invention discloses an image understanding method based on a depth residual error network and an LSTM; the method comprises the following steps: firstly building a depth residual error network model so as to extract image abstract features, and storing the features as a feature matrix; using a dynamic attention mechanism in a LSTM model to dynamically form a proper feature vector according to the feature matrix; finally using the LSTM model to form a natural language (English) according to the feature vector. The method uses the advantages of the depth residual error network on image feature extraction and LSTM advantages on time sequence modeling; the depth residual error network and the LSTM model can form an encode-decode framework so as to convert the image content information into the natural language, thus extracting the deep information from the image.

Description

It is a kind of based on depth residual error network and the image understanding method of LSTM

Technical field

The present invention relates to image, semantic understands, deep learning field, particularly one kind are based on depth residual error network and LSTM The image understanding method of (Long Short-term Memory).

Background technology

Image understanding refers to the understanding to image, semantic.It is that, with image as object, knowledge is core, research image in what There is a science what scene correlation between what target, target, image are position.

Image understanding input is view data, the High-level content for being knowledge, belonging to image procossing research field of output. It on the basis of images steganalysis it is important that further study the property and its correlation of each target in image, and draw Understanding and the explanation to original objective scene to picture material implication, and then instruct and planning behavior.

At present conventional image understanding method is mainly based upon method of the low-level image feature in combination with grader, first using little The image processing algorithms such as wave conversion, Scale invariant features transform (SIFT), edge extracting carry out feature extraction to image, then make Calculated with the image recognitions such as potential Di Li Crays distribution (LDA), HMM (HMM), SVMs (SVM) and reasoning Method carries out Classification and Identification and sets up semantic model to the feature for extracting.Realize from algorithm, at present conventional image reason Resolving Algorithm has that generalization is poor, robustness is low, local dependence is strong, realize difficult, the low shortcoming of discrimination.

The content of the invention

The invention discloses it is a kind of based on depth residual error network and the image understanding method of LSTM, this process employs depth Residual error network in image characteristics extraction and LSTM to the advantage in terms of time series modeling, depth residual error network and LSTM models into One coding-decoding framework, by image content information natural language is changed into, and reaches the mesh of the profound information for extracting image 's.

The purpose of the present invention is realized by following technical scheme：Based on depth residual error network and the image understanding side of LSTM Method, it is characterised in that：It is applied to extract the depth residual error network model of abstract characteristics from input picture, is given birth to according to abstract characteristics Into the LSTM models of natural language；Specifically include following steps：

S1：Download training dataset；

S2：Data in step S1 data set are pre-processed；

S3：Training depth residual error network model；

S4：Training LSTM models；

S5：The abstract characteristics of images to be recognized are extracted with the depth residual error network model trained in step S3；

S6：The feature extracted in step S5 is input in the LSTM models that step S4 is trained, LSTM models are according to spy Levy generation natural language.

Preferably, training dataset is downloaded in step S1：Respectively from http://www.image-net.org、http:// Mscoco.org downloads ImageNet, MS-COCO common image data set in the two websites；ImageNet data sets are divided into training Image set and test chart image set, MS-COCO data sets are divided into training image set test image collection, corresponding, have 5 per pictures The individual natural language sentence for describing its content information.

Preferably, step S2 pretreatment is included to two kinds of situations of ImageNet data sets and MS-COCO data sets：

For ImageNet data sets：Each image, scales the images to 256 × 256 sizes, then from image in Intercept at lower left and right 55 it is big it is little be 224 × 224 standard size image, and the classification by standard size image corresponding thereto Preserve in pairs, one " standard size image-classification " is to as a data；

It is as follows for the step of MS-COCO data sets, pretreatment：

S2.1, the corresponding image of each natural language sentence is preserved in pairs, one " image-nature sentence " is right As a data；

S2.2, the image of " image-nature sentence " centering is maintained length-width ratio constant and scaled, be cut into 224 × 224 mark Quasi- sized image, and the classification by standard size image corresponding thereto preserves in pairs, " standard size an image-nature language Sentence " is to as a data；

The word occurred in S2.3, all natural sentences of statistics, duplicate removal, sequence, word total number is designated as K；By each Word all represents with the column vector of 1 × K, in column vector under be designated as word sequence number and dispose 1, other positions 0, such a vector Referred to as word is vectorial, and all of " word, word vector " is to constituting dictionary DIC of the length for K；

S2.4, by the natural sentence of " image-nature sentence " centering with based on dictionary DIC word vector representation, one Length can be expressed as the natural sentence y of C：

Preferably, depth residual error network model is trained in step S3：(" conv+ subscripts " table is used comprising 46 convolution blocks Show), 2 pond layers, 1 full articulamentum and 1 softmax grader；In each convolution block, first with batch normalization (BN) side Then method carries out nonlinear transformation using amendment linear unit (ReLu) to data normalization to data, finally carries out convolution behaviour Make.Using stochastic gradient descent (SGD) and back-propagation method (BP) during training, with pretreated ImageNet data sets (" standard size image-classification " to) is used as sample；For each sample, standard size image is propagated forward in a network, Jing Output prediction classification after softmax layers is crossed, then prediction classification is propagated backward into network header with the difference of concrete class, reversely Stochastic gradient descent algorithm adjustment network parameter used in communication process.The process of repeated sample input, until network convergence.

Preferably, LSTM models are trained in step S4：The basic structure of LSTM models is made up of LSTM neurons.LSTM moulds Type includes C layer LSTM neurons (C is the maximum length of natural sentence set in advance), can be sequentially output C word；Here make It is pretreated MS-COCO data sets (" standard size image-nature sentence " to) as sample；Training LSTM models Step is as follows：

S4.1, standard size image is input in the depth residual error network of step S3, from conv5_3_c convolution blocks end Abstract characteristics matrix is extracted, size is 7*7*2048=49*2048, used Represent；

S4.2, for each moment t, a picture material vector is generated according to below equation dynamic：

e_ti=f_att(a_i,h_t-1)

Wherein, a_iIt is the vector in abstraction matrix a, h_t-1It was the hidden state amount at a upper moment, f_attIt it is one based on many The attention model of layer perceptron, can automatically determine the abstract characteristics that moment t more notes, α_tiIt is and a_iCorresponding weight, The picture material vector being dynamically generated；

S4.3, for each moment t, the forward conduction process of LSTM neurons can be expressed as：

h_t=o_ttanh(c_t)

Wherein, σ is sigmoid functions, σ (x)=(1+e^-x)^-1, i_t、f_t、c_t、o_t、h_tT input gate is represented respectively, lost Forget the state variable corresponding to door, mnemon, out gate, hidden layer；W_i、U_i、Z_i、W_f、U_f、Z_f、W_o、U_o、Z_o、W_c、U_c、Z_cFor The weight matrix that LSTM model learnings are arrived, b_i、b_f、b_c、b_oIt is bias term that LSTM model learnings are arrived,It is one The embeded matrix of random initializtion, m is a constant, y_t-1It is the word of upper moment LSTM model output；C during t=0_t、h_t Initialize by formula below：

Wherein, f_iinit,c、f_iinit,hIt is two independent multi-layer perception (MLP)s；

S4.4, for each moment t, try to achieve the word y of output by maximizing following formula_t：

Wherein, λ is a constant, and C is the maximum length of natural sentence in sample；

S4.5, the difference that natural sentence in nature sentence and sample is predicted according to cross entropy costing bio disturbance, then using anti- Stochastic gradient descent (SGD) Algorithm for Training to propagation algorithm (BP) and based on RMSProp, makes cross entropy minimum.

S4.6, for each sample in MS-COCO data sets, repeat S4.1-S4.5 steps.

S4.7, repetition S4.1-S4.6 steps 20 time.

Preferably, the feature that images to be recognized is extracted in step S5 is concretely comprised the following steps：

S7.1：The image of Imagenet data sets is pre-processed using in step S2；

S7.2：Pretreated image is input in the depth residual error network that step S3 is trained, from bottom convolution Abstract characteristics matrix is extracted in block end, and size is 7*7*2048=49*2048.

Preferably, LSTM models generate nature sentence according to characteristics of image in step S6, for each moment t, wherein 0≤ t<C, using step S4.1-S4.4 a word is generated, and all words are sequentially connected composition nature sentence.

The present invention compared with prior art, has the advantage that and beneficial effect：

1st, this method is theoretical using deep learning, using great amount of images sample training depth residual error network model and LSTM moulds Type, can automatically learn the universal pattern in image, and strong robustness is applied widely.

2nd, the depth residual error network that the inventive method is adopted has 50 layers of profound structure, can fully extract in image Abstract characteristics；Meanwhile, the inventive method employs LSTM models, and rightly the time series such as natural language can be modeled, will Characteristic vector changes into natural language.Depth residual error network and LSTM network integrations, have been obviously improved the degree of accuracy of image understanding.

3rd, invention introduces a kind of dynamic attention mechanism, can move according to the eigenmatrix that depth residual error network extraction is arrived The generation suitable characteristics vector of state so that LSTM has the advantages that diverse location of the dynamic focusing to image.

Description of the drawings

Fig. 1 is a kind of based on depth residual error network and the idiographic flow of the image understanding method of LSTM of the embodiment of the present invention Figure；

Fig. 2 is step (3) in a kind of image understanding method based on depth residual error network and LSTM of the embodiment of the present invention Depth residual error network architecture；

Fig. 3 is step (3) in a kind of image understanding method based on depth residual error network and LSTM of the embodiment of the present invention Depth residual error network model in convolution block concrete structure；

Fig. 4 is step (4) in a kind of image understanding method based on depth residual error network and LSTM of the embodiment of the present invention LSTM models in LSTM neurons structure.

Specific embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited In this.

Embodiment

Method of the present invention flow chart is illustrated in figure 1, is comprised the steps：

(1), training dataset is downloaded：Respectively from http://www.image-net.org、http://mscoco.org this Download ImageNet, MS-COCO common image data set in two websites.ImageNet data sets are divided into training image collection and test Image set, training image collection contains the picture of 1000 classifications, each classification 1300, and test chart image set contains 50000 pictures； MS-COCO data sets are divided into training image set test image collection, and training image collection includes 82783 pictures, test chart image set It is corresponding comprising 40504 pictures, there are 5 natural language sentences for describing its content information per pictures.

(2), pre-process：

It is as follows for the step of MS-COCO data sets, pretreatment：

2.1st, the corresponding image of each natural language sentence is preserved in pairs, one " image-nature sentence " is right As a data；

2.2nd, maintain length-width ratio constant the image of " image-nature sentence " centering and scale, be cut into 224 × 224 mark Quasi- sized image, and the classification by standard size image corresponding thereto preserves in pairs, " standard size an image-nature language Sentence " is to as a data；

2.3rd, the word occurred in all natural sentences is counted, duplicate removal, sequence, word total number is designated as K；By each list Word all represents with the column vector of 1 × K, in column vector under be designated as word sequence number and dispose 1, other positions 0, such a vector claims For word vector, all of " word, word vector " is to constituting dictionary DIC of the length for K；

2.4th, by word vector representation of the natural sentence of " image-nature sentence " centering based on dictionary DIC, one long Spending the natural sentence y for C can be expressed as：

(3), depth residual error network model is trained：Depth residual error network structure comprising 46 convolution blocks as shown in Fig. 2 (use " conv+ subscripts " represent), 2 pond layers, 1 full articulamentum and a softmax graders.In each convolution block, first use Normalization (BN) method is criticized to data normalization, then nonlinear transformation is carried out to data using amendment linear unit (ReLu), Most with carrying out convolution operation.Using stochastic gradient descent (SGD) and back-propagation method (BP) during training, with pretreated ImageNet data sets (" standard size image-classification " to) are used as sample.Design parameter indicates in fig. 2, for example, " conv2_1_a, 1*1,64,1 " represents the entitled conv2_1_a of the convolution block, and convolution kernel size is 1 × 1, and step-length is 1, output 64 characteristic patterns.

(4), LSTM models are trained：Shown in Fig. 4 is that the basic structure of LSTM models is made up of LSTM neurons.LSTM moulds Type includes C layer LSTM neurons (C is the maximum length of natural sentence set in advance), can be sequentially output C word.Here make It is pretreated MS-COCO data sets (" standard size image-nature sentence " to) as sample.Training LSTM models Step is as follows：

4.1st, standard size image is input in the depth residual error network of step (3), from conv5_3_c convolution blocks end Abstract characteristics matrix is extracted, size is 7*7*2048=49*2048, used Represent；

4.2nd, for each moment t, a picture material vector is generated according to below equation dynamic：

e_ti=f_att(a_i,h_t-1)

4.3rd, for each moment t, the forward conduction process of LSTM neurons can be expressed as：

h_t=o_ttanh(c_t)

Wherein, σ is sigmoid functions, σ (x)=(1+e^-x)^-1, i_t、f_t、c_t、o_t、h_tT input gate is represented respectively, lost Forget the state variable corresponding to door, mnemon, out gate, hidden layer.W_i、U_i、Z_i、W_f、U_f、Z_f、W_o、U_o、Z_o、W_c、U_c、Z_cFor The weight matrix that LSTM, model learning are arrived, b_i、b_f、b_c、b_oIt is bias term that LSTM model learnings are arrived,It is one The embeded matrix of random initializtion, m is a constant, y_t-1It is the word of upper moment LSTM model output；C during t=0_t、h_t Initialize by formula below：

Wherein, f_init,c、f_init,hIt is two independent multi-layer perception (MLP)s；

4.4th, for each moment t, by maximizing following formula the word y of output is tried to achieve_t：

4.5th, the difference of natural sentence in nature sentence and sample is predicted according to cross entropy costing bio disturbance, then using reverse Propagation algorithm (BP) and stochastic gradient descent (SGD) Algorithm for Training based on RMSProp, make cross entropy minimum.

4.6th, for each sample in MS-COCO data sets, 4.1-4.5 steps are repeated.

4.7th, 4.1-4.6 steps 20 time are repeated.

(5) abstract characteristics of images to be recognized, are extracted with the depth residual error network model trained in step (3).First The image of Imagenet data sets is pre-processed using in step (2), then pretreated image is input into step (3) in the depth residual error network for training, abstract characteristics matrix is extracted from bottom convolution block end, size is 7*7*2048= 49*2048。

(6), the abstract characteristics extracted in step (5) are input in the LSTM models that step (4) is trained, for each Moment t, wherein 0≤t<C, using step S4.1-S4.4 a word is generated, and all words are sequentially connected composition nature sentence.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention not by above-described embodiment Limit, other any Spirit Essences without departing from the present invention and the change, modification, replacement made under principle, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that：It is applied to from input picture Extract abstract characteristics depth residual error network model, according to abstract characteristics generate natural language LSTM models；Specifically include as Lower step：

S1：Download training dataset；

S2：Data in step S1 data set are pre-processed；

S3：Training depth residual error network model；

S4：Training LSTM models；

S6：The feature extracted in step S5 is input in the LSTM models that step S4 is trained, LSTM models are given birth to according to feature Into natural sentence.

2. according to claim 1 a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that Data set in described step S1 is download two common image data sets of ImageNet, MS-COCO.

3. according to claim 1 a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that The pretreatment of step S2 is included to two kinds of situations of ImageNet data sets and MS-COCO data sets：

For ImageNet data sets：Each image, scales the images to 256 × 256 sizes, then left from image upper, middle and lower Intercept at right 55 it is big it is little be 224 × 224 standard size image, and the classification by standard size image corresponding thereto is paired Preserve, one " standard size image-classification " is to as a data；

It is as follows for the step of MS-COCO data sets, pretreatment：

S2.1, the corresponding image of each natural language sentence is preserved in pairs, one " image-nature sentence " is to conduct One data；

S2.2, the image of " image-nature sentence " centering is maintained length-width ratio constant and scaled, be cut into 224 × 224 gauge Very little image, and the classification by standard size image corresponding thereto preserves in pairs, one " standard size image-nature sentence " is right As a data；

The word occurred in S2.3, all natural sentences of statistics, duplicate removal, sequence, word total number is designated as K；By each word All represented with the column vector of 1 × K, in column vector under be designated as word sequence number dispose 1, other positions 0, such a vector is referred to as Word vector, all of " word, word vector " is to constituting dictionary DIC of the length for K；

S2.4, by the natural sentence of " image-nature sentence " centering with based on dictionary DIC word vector representation, a length Natural sentence y for C can be expressed as：

4. according to claim 1 a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that The structure of depth residual error network model includes multilayer convolution block, pond layer, full articulamentum and softmax classification in step S3 Device；In each convolution block, first then data normalization is carried out using amendment linear unit to data with batch method for normalizing Nonlinear transformation, finally carries out convolution operation.

5. a kind of according to claim 1 or 4 based on depth residual error network and the image understanding method of LSTM, its feature exists In training depth residual error network model uses stochastic gradient descent and back-propagation method in step S3, after pretreatment ImageNet data sets in " standard size image-classification " to as sample；For each sample, standard size image is in net Propagate forward in network, the output prediction classification after softmax layers, then the difference backpropagation of classification and concrete class will be predicted To network header, stochastic gradient descent algorithm adjustment network parameter used in back-propagation process；The process of repeated sample input, Until network convergence.

6. according to claim 1 a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that In step S4, LSTM models include C layer LSTM neurons, and wherein C is the maximum length of natural sentence set in advance, according to C word of secondary output；Using pretreated MS-COCO data sets " standard size image-nature sentence " to as sample；Instruction Practice LSTM model steps as follows：

S4.1, standard size image is input in the depth residual error network of step S3, bottom convolution block end is extracted abstract Eigenmatrix, size is 7*7*2048=49*2048, abstraction matrix a={ a₁,…,a_L,Wherein L=49, D= 2048,0≤i≤L；

e_ti=f_att(a_i,h_t-1)

α_{t i} = \frac{\exp (e_{t i})}{Σ_{k = 1}^{L} \exp (e_{t k})}

\hat{z_{t}} = Σ_{i = 1}^{L} α_{t i} a_{i}

Wherein, a_iIt is the vector in abstraction matrix a, h_t-1It was the hidden state amount at a upper moment, f_attIt is one and is based on Multilayer Perception The attention model of machine, can automatically determine the abstract characteristics that moment t more notes, α_tiIt is and a_iCorresponding weight,It is dynamic The picture material vector of generation；

i_{t} = σ (W_{i} {Ey}_{t - 1} + U_{i} h_{t - 1} + Z_{i} \hat{z_{t}} + b_{i})

f_{t} = σ (W_{f} {Ey}_{t - 1} + U_{f} h_{t - 1} + Z_{f} \hat{z_{t}} + b_{f})

c_{t} = f_{c} c_{t - 1} + i_{t} \tanh (W_{c} {Ey}_{t - 1} + U_{c} h_{t - 1} + Z_{c} \hat{z_{t}} + b_{c})

o_{t} = σ (W_{o} {Ey}_{t - 1} + U_{o} h_{t - 1} + Z_{o} \hat{z_{t}} + b_{o})

h_t=o_ttanh(c_t)

Wherein, σ is sigmoid functions, σ (x)=(1+e^-x)^-1, i_t、f_t、c_t、o_t、h_tT input gate is represented respectively, forgotten State variable corresponding to door, mnemon, out gate, hidden layer；W_i、U_i、Z_i、W_f、U_f、Z_f、W_o、U_o、Z_o、W_c、U_c、Z_cFor The weight matrix that LSTM model learnings are arrived, b_i、b_f、b_c、b_oIt is bias term that LSTM model learnings are arrived,It is one The embeded matrix of random initializtion, m is a constant, y_t-1It is the word of upper moment LSTM model output；C during t=0_t、h_t Initialize by formula below：

c_{0} = f_{i n i t, c} (\frac{1}{L} Σ_{i}^{L} a_{i})

h_{0} = f_{i n i t, h} (\frac{1}{L} Σ_{i}^{L} a_{i})

S4.4, for each moment t, try to achieve the word y of output by solving following optimization problem_t：

\min (- l o g (p (y_{t} | a, y_{t - 1})) + {λΣ}_{i}^{L} {(1 - Σ_{t}^{C} α_{t i})}^{2})

S4.5, for each moment t, the difference of natural sentence in nature sentence and sample is predicted according to cross entropy costing bio disturbance, Then trained using back-propagation algorithm and the stochastic gradient descent algorithm based on RMSProp, make cross entropy minimum；

S4.6, for each sample in MS-COCO data sets, repeat S4.1-S4.5 steps；

S4.7, repetition S4.1-S4.6 steps 20 time.

7. according to claim 1 a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that The feature of extraction images to be recognized concretely comprises the following steps in the S5：

S7.1, the image of Imagenet data sets is pre-processed using in step S2；

S7.2, pretreated image is input in the depth residual error network that step S3 is trained, from bottom convolution block end Abstract characteristics matrix is extracted at end, and size is 7*7*2048=49*2048.

8. according to claim 1 a kind of based on depth residual error network and the image understanding method of LSTM, it is characterised in that LSTM models generate nature sentence according to feature in step S6, for each moment t, wherein 0≤t<C, using step S4.1-S4.4 generates a word, and all words are sequentially connected composition nature sentence.