CN111737511B

CN111737511B - Image description method based on self-adaptive local concept embedding

Info

Publication number: CN111737511B
Application number: CN202010554218.7A
Authority: CN
Inventors: 王溢; 王振宁; 许金泉; 曾尔曼
Original assignee: Nanqiang Zhishi Xiamen Technology Co ltd
Current assignee: Nanqiang Zhishi Xiamen Technology Co ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2022-06-07
Anticipated expiration: 2040-06-17
Also published as: CN111737511A

Abstract

The invention discloses an image description method based on self-adaptive local concept embedding, which belongs to the technical field of artificial intelligence and comprises the following steps: step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector; and 2, inputting the features extracted in the step 1 into the trained neural network, thereby outputting a description result of the image to be described. Aiming at the defect that the traditional image description method based on the attention mechanism does not explicitly model the relationship between the local area and the concept, the method provides a scheme for adaptively generating the visual area and the visual concept through a context mechanism, so that the connection between the vision and the language is strengthened, and the accuracy of the generation description is improved.

Description

Image description method based on self-adaptive local concept embedding

Technical Field

The invention relates to automatic image description in the field of artificial intelligence, in particular to a method for researching an image description model based on adaptive local concept embedding and used for describing objective contents of images by natural language based on pictures.

Background

Image automatic description (Image capturing) is a machine ultimate intelligent task proposed in the artificial intelligence field in recent years, and the task is to describe the objective contents of an Image in a natural language for a given Image. With the development of computer vision technology, the task of completing target detection, identification, segmentation and the like cannot meet the production requirements of people, and the method has urgent need for automatically and objectively automatically describing image contents. Different from tasks such as target detection and semantic segmentation, the image automatic description is to integrally and objectively describe objects, attributes, relationships among the objects, corresponding scenes and the like in the image by using an automatic language, and the task is one of important directions of computer vision understanding and is regarded as an important mark of artificial intelligence.

The task of automatic description of images in the past, which was mainly achieved by template-based methods and retrieval-based methods, has been greatly advanced until recently inspired by natural language technology, starting with the use of encoder-decoder frameworks, attention mechanisms and objective functions based on reinforcement learning.

Xu et al [1] introduced for the first time a mechanism of attention in the picture description task to embed important visual attributes and scenes into the description generator. Following this, much work has been directed to improving attention mechanisms. For example, Chen [2] et al propose a spatial and channel attention mechanism to select salient regions and salient semantic patterns; lu et al [3] proposed the concept of visual sentinel to decide whether to pay attention to visual information or text information in the next step, greatly improving the accuracy of the model; anderson et al [4] first acquires the region by a pre-trained target detector and then adds this to the model to generate the image captions. However, these methods only focus on the context and visual characteristics of a specific task, and do not take into account the relationship between explicit modeled visual characteristics and concepts.

The references referred to are as follows:

[1].Xu，K.；Ba，J.；Kiros，R.；Cho，K.；Courville，A.；Salakhudinov，R.；Zemel，R.；and Bengio，Y.2015.Show，attend and tell：Neural image caption generation with visual attention.In ICML.

[2].Chen，L.；Zhang，H.；Xiao，J.；Nie，L.；Shao，J.；Liu，W.；and Chua，T.-S.2017b.Sca-cnn：Spatial and channel-wise attention in convolutional networks for image captioning.In CVPR.

[3].Lu，J.；Xiong，C.；Parikh，D.；and Socher，R.2017.Knowing when to look：Adaptive attention via a visual sentinel for image captioning.In CVPR.

[4].Anderson，P.；He，X.；Buehler，C.；Teney，D.；Johnson，M.；Gould，S.；and Zhang，L.2018.Bottom-up and top-down attention for image captioning and visual question answering.In CVPR.

disclosure of Invention

The invention aims to provide an image description method based on adaptive local concept embedding, and provides a scheme for adaptively generating a visual region and a visual concept thereby through a context mechanism aiming at the defect that the traditional image description method based on an attention mechanism does not explicitly model the relationship between a local region and a concept, so that the connection and the accuracy of vision to language are enhanced.

In order to achieve the above purpose, the solution of the invention is:

an image description method based on adaptive local concept embedding comprises the following steps:

step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector;

step 2, inputting the features extracted in the step 1 into a trained neural network, thereby outputting a description result of the image to be described; wherein, the global loss function of the neural network is obtained by the following method;

step A1, preprocessing the text content in the training set to obtain a sentence sequence; for images in a training set, a target detector is adopted to extract a plurality of candidate areas, and characteristics V ═ V { V } corresponding to each candidate area are extracted₁，v₂...，v_kIn which v is_i∈R^d，i＝1，2, k, d are dimensions of each feature vector;

step A2, sending the characteristic V into an adaptive pilot signal generation layer to generate an adaptive pilot signal;

step A3, acquiring local visual features by using an attention mechanism and using an adaptive pilot signal, and obtaining a local concept;

step A4, embedding the local concept into a generating model by a vector cracking method to obtain a current output word;

step a5, iteratively generate an entire sentence, and define a loss function that generates the sentence.

In step 1, the training method of the target detector includes: the target detector adopts an fast R-CNN framework, a skeleton network of the target detector is a deep convolution residual error network, an end-to-end method is adopted to train in a classical target detection data set PASCAL VOC2007, and then a multi-modal data set Visual Genome is further trained to fine-tune network parameters.

In the step a1, the specific process of preprocessing the text content in the training set to obtain the sentence sequence is as follows: firstly, performing stop word processing on text contents in a training set, and performing lowercase on all English words; then, segmenting the text content according to spaces, eliminating words with the occurrence frequency less than a threshold value in the description of the data set for the obtained words, and replacing the words with "< UNK >"; finally, the beginning and END of the sentence are added with the start "< BOS >" and the END "< END >" respectively.

In step a2, the correlation formula for generating the adaptive pilot signal based on the characteristic V is as follows:

wherein t is the t-th word of the sentence sequence,

generating a layer input for the adaptive pilot signal, and W_eIs a matrix of word vectors, which is,

is the pilot signal, x, output by the layer_tIndicating the index corresponding to the word input at time t.

The specific process of the step a3 is as follows:

first according to the following formula:

wherein,

W_v1∈R^k×d、W_h1∈R^k×dis a parameter to be learned, I is belonged to R^kFor vectors with all elements being 1, the Softmax function is a normalized exponential function; thereby obtaining the importance of each candidate region

To obtain the local visual features that the current model focuses on:

wherein,

the visual concept obtained, W_vcTo achieve a pre-trained concept detection layer,

the vision concept concerned by the model is defined, and sigma is an activation function;

by using

The adaptive pilot signal is modified as follows:

wherein [;]representing vector stitching, W_hA parameter matrix needing to be trained;

the following iterations are then performed until the final local concept is obtained, as follows:

wherein,

W_v2∈R^k×d、W_h2∈R^k×dis a parameter to be learned, I is belonged to R^kFor vectors where all elements are 1, the Softmax function is a normalized exponential function.

The specific process of the step a4 is as follows:

the following vector lysis was first performed:

where diag (.) denotes vector diagonalization, x_tIndicating the index corresponding to the word input at time t,

and

splitting local concepts, and embedding information into input words and hidden states;

the following information definition module inputs for embedding local concepts:

wherein [; 1; 1; represents a vector stitching operation;

then, the input of the embedded information is mapped to obtain

i_t＝σ(W_iE_i)，f_t＝σ(W_fE_f)

o_t＝σ(W_oE_o)，c_t＝σ(W_cE_c)

Wherein, W_i、E_i、W_f、E_f、W_o、E_o、W_c、E_cAll are parameter matrices that need to be trained;

finally, the probability distribution of the next word is obtained:

wherein W_yThe hidden states are mapped to a vocabulary for the parameter matrix to be trained.

The specific process of the step a5 is as follows:

for a predicted sentence Y_1∶TIn other words, the probability of generating an entire sentence is multiplied by the probability of each word, i.e.:

wherein T is the sentence length;

training the model through two stages of supervised learning and reinforcement learning; in the supervised learning phase, cross entropy is adopted for a given target sentence

In terms of this, the loss function is defined as:

in the reinforcement learning stage, reinforcement learning is adopted for training, and the loss function is defined as:

wherein

Represents sentences sampled by greedy method, and

representing sentences sampled by the monte carlo method.

After the scheme is adopted, the invention has the following outstanding advantages:

(1) the method explicitly models the relation between the local visual area and the semantic concept, thereby providing accurate connection between vision and language, greatly reducing the semantic gap between image description tasks and greatly improving the accuracy and comprehensiveness of the generated sentences;

(2) the method has strong mobility, can be suitable for any image description model based on an attention mechanism, and improves the performance of the model;

(3) the improved image description integrity and accuracy are mainly applied to understanding the visual concept of a given picture, automatically generating description for the given picture, and having a great number of application prospects in the fields of image retrieval, blind navigation, automatic generation of medical reports and early education.

Drawings

FIG. 1 is a flow chart of the image automatic description method based on adaptive local concept embedding of the present invention;

wherein, RAM is a local concept extraction module, LCFM is a local concept cracking embedding module, and Attention is an Attention module;

FIG. 2 is a comparison of sentences generated by different image description models;

wherein UP-DOWN is a name named top-DOWN baseline method;

FIG. 3 is a result of similarity determination and visualization in column units of a mapping matrix used when embedding local concepts;

FIG. 4 is a semantic concept of the visualization of a region and the mapping of the region correspondingly visualized for the framework adaptive selection employed in the present invention;

fig. 5 is a visualization of correspondence of a certain semantic concept with a visual area.

Detailed Description

The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.

The invention aims to provide an image description method based on adaptive local concept embedding, which aims at overcoming the defect that the traditional image description method based on an attention mechanism does not explicitly model the relationship between a local region and a concept, provides a scheme for adaptively generating a visual region and a visual concept by a context mechanism, strengthens the connection and accuracy of vision to language, and provides an image description method based on adaptive local concept embedding. The specific algorithm flow is shown in fig. 1.

The invention comprises the following steps:

1) for the images in the image library, firstly, extracting corresponding image features by using a convolutional neural network;

2) adopting a cyclic neural network to map the current input word sum and the global image characteristics to the hidden layer for output, and taking the hidden layer as a guide signal;

3) obtaining the weight of each local image feature by using the guide signal by adopting an attention mechanism, adaptively obtaining local visual features, and extracting local concepts by using a trained concept extractor;

4) establishing a local concept cracking module, embedding a local concept into a generation model, and acquiring a current output word;

5) the iteration generates the whole sentence and defines the loss function of the generated sentence.

Each module is specifically as follows:

1. deep convolution feature extraction and description data preprocessing

Performing stop word processing on text contents in all training data, and performing lowercase on all English words; then, the text content is segmented according to spaces to obtain 9487 words, the words with the occurrence frequency less than five in the description of the data set are removed and replaced by "< UNK >", and meanwhile, a start symbol "< BOS >" and an END symbol "< END >" are added at the beginning and the END of the description sentence respectively.

Firstly, extracting 36 fixed candidate regions by using a pre-trained target detector, and extracting a feature V ═ V corresponding to each candidate region by using a residual deep convolution network₁，v₂...，v_kIn which v is_i∈R^dI 1, 2, k, d are dimensions of the respective feature vectors, k is 36 and d is 2048.

2. Adaptive pilot generation layer

First, the first layer is a convolution loop network for generating an adaptive pilot signal to provide guidance for extracting local visual features later, and the layer inputs and processes are defined as follows:

wherein t is the t-th word of the sentence sequence,

generating a layer input for the adaptive pilot signal, and W_eIs a matrix of word vectors and is,

is the pilot signal, x, output by the layer_tIndicating the corresponding index of the word input at time tAnd (3) introducing.

3. Local concept extraction

As shown in FIG. 1, following the local concept extraction layer, the present invention first utilizes

As a guide, local visual information is obtained, and thus adaptive local concepts are obtained, the process is derived as follows:

wherein,

W_v1∈R^k×d、W_h1∈R^k×dis a parameter to be learned, I is belonged to R^kFor vectors where all elements are 1, the Softmax function is a normalized exponential function. Thus, the importance of each candidate region can be obtained

To obtain the local visual features that the current model focuses on:

wherein,

i.e. the visual concept obtained, W_vcTo achieve a pre-trained concept detection layer,

i.e. the visual concept that the model is focused on, σ is the activation function. Obtained

The quality of the attention mechanism can be well reflected, so the information is used for modifying the guide signal to improve the attention level, and the modification is as follows:

wherein [;]representing vector stitching, W_hFor the parameter matrix to be trained, the process is the same as the first process, so that the final local concept can be obtained, and the process is as follows:

wherein,

4. Local concept cracking embedded module

The local concept is obtained through the above process, and then the local concept is embedded into the model through a vector splitting method, so as to effectively use the information to generate the image description, wherein the vector splitting process is as follows:

and

the local concepts are split and then information is embedded into the input words and hidden states. The information definition module input of the embedded local concept is as follows:

wherein [; 1; 1; .]A vector stitching operation is shown. Then, the input of the embedded information is mapped to obtain

i_t＝σ(W_iE_i)，f_t＝σ(W_fE_f)

Wherein, W_i、E_i、W_f、E_f、W_o、E_o、W_c、E_cAll are parameter matrixes to be trained, and finally, we obtain the probability distribution of the next word through the information:

5. Global loss function construction

For a predicted sentence Y_1∶TIn other words, the probability of generating the entire sentence can be multiplied by the probability of each word, i.e.:

where T is the sentence length. The invention trains the model in two stages, including supervised learning and reinforcement learning. The former employs cross entropy for a given target sentence

In terms of this, the loss function is defined as:

the latter is trained by reinforcement learning, and the loss function is defined as:

wherein

Represents sentences sampled by greedy method, and

representing sentences sampled by the monte carlo method.

The specific experimental results are as follows:

to verify the feasibility and advancement of the proposed model, we performed the evaluation of the model in the generic data set MSCOCO. The quantitative comparison with the latest image automatic description method is shown in table 1, and we can see that the performance of the proposed model has high advantages on various evaluation indexes. In addition, we can see that the text description generated by visualizing the input image, the description given by way of example is in english, and the chinese description is generated by the same automatic generation process (as shown in fig. 2), and that the model models the local visual information display, so that the model achieves obvious improvement on the image description. FIG. 3 is a pair of W_*a ^TW_*aThe results show that the method of the present invention embeds local concepts well into the model. Fig. 4 shows the visual regions concerned by the two module layers when each word is generated and the visual concept generated by the visual regions, and it can be seen that a more accurate visual concept can be obtained by modification. FIG. 5 labels the region of greatest model interest after the generation of a particular concept, which indicates that the method of the present invention can help overcome the semantic gap problem. The descriptions and concepts in fig. 2 to fig. 4 are all in english as an example, but the present invention can be directly extended to chinese description with the same mechanism.

TABLE 1 comparison of the method of the invention with the currently most advanced methods

Model	B-1	B-4	M	R	C	S
							LSM-A	78.6	35.5	27.3	56.8	118.3	20.8
GCN-LSTM	80.5	38.2	28.5	58.5	128.3	22.0
							Stack-Cap	78.6	36.1	27.4	56.9	120.4	20.9
SGAE	80.8	38.4	28.4	58.6	127.8	22.1
							Up-Down	79.8	36.3	27.7	56.9	120.1	21.4
The method of the invention	80.6	39.0	28.6	58.8	128.3	22.3

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. An image description method based on self-adaptive local concept embedding is characterized by comprising the following steps:

step A1, preprocessing the text content in the training set to obtain a sentence sequence; for the images in the training set, a target detector is adopted to extract a plurality of candidate areas, and characteristics corresponding to the candidate areas are extracted

Wherein

，

，

Dimensions of each feature vector;

step A2, characterizing

Sending the signal into an adaptive pilot signal generation layer to generate an adaptive pilot signal;

step A5, generating the whole sentence by iteration, and defining the loss function of the generated sentence;

in the step A2, based on characteristics

The correlation formula for generating the adaptive pilot signal is as follows:

wherein,

is the first of a sentence sequence

The number of the individual words,

generating a layer input for the adaptive pilot signal, an

Is a matrix of word vectors, which is,

is the pilot signal output by the layer and,

to represent

Indexes corresponding to words input at any moment;

the specific process of the step A3 is as follows:

first according to the following formula:

wherein,

、

、

is a parameter that needs to be learned,

for vectors with all elements being 1, the Softmax function is a normalized exponential function; thereby obtaining the importance of each candidate region

And is used for obtaining the local visual characteristics concerned by the current model:

wherein,

i.e. the visual concept that is obtained,

to achieve a pre-trained concept detection layer,

i.e. the visual concept that the model focuses on,

is an activation function;

by using

The adaptive pilot signal is modified as follows:

wherein

The concatenation of the vectors is represented and,

a parameter matrix needing to be trained;

wherein,

、

、

is a parameter that needs to be learned by the user,

for vectors with all elements being 1, the Softmax function is a normalized exponential function;

the specific process of the step A4 is as follows:

the following vector lysis was first performed:

wherein,

the representation vector is diagonalized and,

to represent

The index corresponding to the word input at the time,

and

is to crack the local concepts and thereafter embed the informationEntering an input word and a hidden state;

wherein [ ] represents a vector splicing operation;

then, the input of the embedded information is mapped to obtain

：

Wherein,

、

、

、

、

、

、

、

all are the parameters needing trainingA number matrix;

finally, the probability distribution of the next word is obtained:

wherein

The hidden states are mapped to a vocabulary for the parameter matrix to be trained.

2. The image description method based on adaptive local concept embedding of claim 1, characterized in that: in step 1, the training method of the target detector comprises the following steps: the target detector adopts an fast R-CNN framework, a skeleton network of the target detector is a deep convolution residual error network, an end-to-end method is adopted to train in a classical target detection data set PASCAL VOC2007, and then a multi-modal data set Visual Genome is further trained to fine-tune network parameters.

3. The image description method based on adaptive local concept embedding of claim 1, characterized in that: in step a1, the specific process of preprocessing the text content in the training set to obtain a sentence sequence is as follows: firstly, performing stop word processing on text contents in a training set, and performing lowercase on all English words; then, segmenting the text content according to spaces, eliminating words with the occurrence frequency less than a threshold value in the description of the data set for the obtained words, and replacing the words with "< UNK >"; finally, the beginning and END of the sentence are added with the start "< BOS >" and the END "< END >" respectively.

4. The image description method based on adaptive local concept embedding of claim 1, characterized in that: the specific process of the step A5 is as follows:

for predicted sentence Y = Y_1:TIn other words, the probability of generating an entire sentence isThe probability of each word is multiplied by:

wherein

Is the sentence length;

In terms of this, the loss function is defined as:

wherein

Represents sentences sampled by greedy method, and

representing sentences sampled by the monte carlo method.