CN111737511B - Image description method based on self-adaptive local concept embedding - Google Patents

Image description method based on self-adaptive local concept embedding Download PDF

Info

Publication number
CN111737511B
CN111737511B CN202010554218.7A CN202010554218A CN111737511B CN 111737511 B CN111737511 B CN 111737511B CN 202010554218 A CN202010554218 A CN 202010554218A CN 111737511 B CN111737511 B CN 111737511B
Authority
CN
China
Prior art keywords
concept
local
sentence
image
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010554218.7A
Other languages
Chinese (zh)
Other versions
CN111737511A (en
Inventor
王溢
王振宁
许金泉
曾尔曼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanqiang Zhishi Xiamen Technology Co ltd
Original Assignee
Nanqiang Zhishi Xiamen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanqiang Zhishi Xiamen Technology Co ltd filed Critical Nanqiang Zhishi Xiamen Technology Co ltd
Priority to CN202010554218.7A priority Critical patent/CN111737511B/en
Publication of CN111737511A publication Critical patent/CN111737511A/en
Application granted granted Critical
Publication of CN111737511B publication Critical patent/CN111737511B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an image description method based on self-adaptive local concept embedding, which belongs to the technical field of artificial intelligence and comprises the following steps: step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector; and 2, inputting the features extracted in the step 1 into the trained neural network, thereby outputting a description result of the image to be described. Aiming at the defect that the traditional image description method based on the attention mechanism does not explicitly model the relationship between the local area and the concept, the method provides a scheme for adaptively generating the visual area and the visual concept through a context mechanism, so that the connection between the vision and the language is strengthened, and the accuracy of the generation description is improved.

Description

Image description method based on self-adaptive local concept embedding
Technical Field
The invention relates to automatic image description in the field of artificial intelligence, in particular to a method for researching an image description model based on adaptive local concept embedding and used for describing objective contents of images by natural language based on pictures.
Background
Image automatic description (Image capturing) is a machine ultimate intelligent task proposed in the artificial intelligence field in recent years, and the task is to describe the objective contents of an Image in a natural language for a given Image. With the development of computer vision technology, the task of completing target detection, identification, segmentation and the like cannot meet the production requirements of people, and the method has urgent need for automatically and objectively automatically describing image contents. Different from tasks such as target detection and semantic segmentation, the image automatic description is to integrally and objectively describe objects, attributes, relationships among the objects, corresponding scenes and the like in the image by using an automatic language, and the task is one of important directions of computer vision understanding and is regarded as an important mark of artificial intelligence.
The task of automatic description of images in the past, which was mainly achieved by template-based methods and retrieval-based methods, has been greatly advanced until recently inspired by natural language technology, starting with the use of encoder-decoder frameworks, attention mechanisms and objective functions based on reinforcement learning.
Xu et al [1] introduced for the first time a mechanism of attention in the picture description task to embed important visual attributes and scenes into the description generator. Following this, much work has been directed to improving attention mechanisms. For example, Chen [2] et al propose a spatial and channel attention mechanism to select salient regions and salient semantic patterns; lu et al [3] proposed the concept of visual sentinel to decide whether to pay attention to visual information or text information in the next step, greatly improving the accuracy of the model; anderson et al [4] first acquires the region by a pre-trained target detector and then adds this to the model to generate the image captions. However, these methods only focus on the context and visual characteristics of a specific task, and do not take into account the relationship between explicit modeled visual characteristics and concepts.
The references referred to are as follows:
[1].Xu,K.;Ba,J.;Kiros,R.;Cho,K.;Courville,A.;Salakhudinov,R.;Zemel,R.;and Bengio,Y.2015.Show,attend and tell:Neural image caption generation with visual attention.In ICML.
[2].Chen,L.;Zhang,H.;Xiao,J.;Nie,L.;Shao,J.;Liu,W.;and Chua,T.-S.2017b.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning.In CVPR.
[3].Lu,J.;Xiong,C.;Parikh,D.;and Socher,R.2017.Knowing when to look:Adaptive attention via a visual sentinel for image captioning.In CVPR.
[4].Anderson,P.;He,X.;Buehler,C.;Teney,D.;Johnson,M.;Gould,S.;and Zhang,L.2018.Bottom-up and top-down attention for image captioning and visual question answering.In CVPR.
disclosure of Invention
The invention aims to provide an image description method based on adaptive local concept embedding, and provides a scheme for adaptively generating a visual region and a visual concept thereby through a context mechanism aiming at the defect that the traditional image description method based on an attention mechanism does not explicitly model the relationship between a local region and a concept, so that the connection and the accuracy of vision to language are enhanced.
In order to achieve the above purpose, the solution of the invention is:
an image description method based on adaptive local concept embedding comprises the following steps:
step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector;
step 2, inputting the features extracted in the step 1 into a trained neural network, thereby outputting a description result of the image to be described; wherein, the global loss function of the neural network is obtained by the following method;
step A1, preprocessing the text content in the training set to obtain a sentence sequence; for images in a training set, a target detector is adopted to extract a plurality of candidate areas, and characteristics V ═ V { V } corresponding to each candidate area are extracted1,v2...,vkIn which v isi∈Rd,i=1,2, k, d are dimensions of each feature vector;
step A2, sending the characteristic V into an adaptive pilot signal generation layer to generate an adaptive pilot signal;
step A3, acquiring local visual features by using an attention mechanism and using an adaptive pilot signal, and obtaining a local concept;
step A4, embedding the local concept into a generating model by a vector cracking method to obtain a current output word;
step a5, iteratively generate an entire sentence, and define a loss function that generates the sentence.
In step 1, the training method of the target detector includes: the target detector adopts an fast R-CNN framework, a skeleton network of the target detector is a deep convolution residual error network, an end-to-end method is adopted to train in a classical target detection data set PASCAL VOC2007, and then a multi-modal data set Visual Genome is further trained to fine-tune network parameters.
In the step a1, the specific process of preprocessing the text content in the training set to obtain the sentence sequence is as follows: firstly, performing stop word processing on text contents in a training set, and performing lowercase on all English words; then, segmenting the text content according to spaces, eliminating words with the occurrence frequency less than a threshold value in the description of the data set for the obtained words, and replacing the words with "< UNK >"; finally, the beginning and END of the sentence are added with the start "< BOS >" and the END "< END >" respectively.
In step a2, the correlation formula for generating the adaptive pilot signal based on the characteristic V is as follows:
Figure BDA0002543706000000031
Figure BDA0002543706000000032
Figure BDA0002543706000000033
wherein t is the t-th word of the sentence sequence,
Figure BDA0002543706000000034
generating a layer input for the adaptive pilot signal, and WeIs a matrix of word vectors, which is,
Figure BDA0002543706000000035
is the pilot signal, x, output by the layertIndicating the index corresponding to the word input at time t.
The specific process of the step a3 is as follows:
first according to the following formula:
Figure BDA0002543706000000036
Figure BDA0002543706000000037
wherein,
Figure BDA0002543706000000038
Wv1∈Rk×d、Wh1∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors with all elements being 1, the Softmax function is a normalized exponential function; thereby obtaining the importance of each candidate region
Figure BDA0002543706000000039
To obtain the local visual features that the current model focuses on:
Figure BDA0002543706000000041
Figure BDA0002543706000000042
wherein,
Figure BDA0002543706000000043
the visual concept obtained, WvcTo achieve a pre-trained concept detection layer,
Figure BDA0002543706000000044
the vision concept concerned by the model is defined, and sigma is an activation function;
by using
Figure BDA0002543706000000045
The adaptive pilot signal is modified as follows:
Figure BDA0002543706000000046
wherein [;]representing vector stitching, WhA parameter matrix needing to be trained;
the following iterations are then performed until the final local concept is obtained, as follows:
Figure BDA0002543706000000047
Figure BDA0002543706000000048
Figure BDA0002543706000000049
Figure BDA00025437060000000410
wherein,
Figure BDA00025437060000000411
Wv2∈Rk×d、Wh2∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors where all elements are 1, the Softmax function is a normalized exponential function.
The specific process of the step a4 is as follows:
the following vector lysis was first performed:
Figure BDA00025437060000000412
Figure BDA00025437060000000413
where diag (.) denotes vector diagonalization, xtIndicating the index corresponding to the word input at time t,
Figure BDA00025437060000000414
and
Figure BDA00025437060000000415
splitting local concepts, and embedding information into input words and hidden states;
the following information definition module inputs for embedding local concepts:
Figure BDA00025437060000000416
wherein [; 1; 1; represents a vector stitching operation;
then, the input of the embedded information is mapped to obtain
Figure BDA0002543706000000051
it=σ(WiEi),ft=σ(WfEf)
ot=σ(WoEo),ct=σ(WcEc)
Figure BDA0002543706000000052
Figure BDA0002543706000000053
Wherein, Wi、Ei、Wf、Ef、Wo、Eo、Wc、EcAll are parameter matrices that need to be trained;
finally, the probability distribution of the next word is obtained:
Figure BDA0002543706000000054
wherein WyThe hidden states are mapped to a vocabulary for the parameter matrix to be trained.
The specific process of the step a5 is as follows:
for a predicted sentence Y1∶TIn other words, the probability of generating an entire sentence is multiplied by the probability of each word, i.e.:
Figure BDA0002543706000000055
wherein T is the sentence length;
training the model through two stages of supervised learning and reinforcement learning; in the supervised learning phase, cross entropy is adopted for a given target sentence
Figure BDA0002543706000000056
In terms of this, the loss function is defined as:
Figure BDA0002543706000000057
in the reinforcement learning stage, reinforcement learning is adopted for training, and the loss function is defined as:
Figure BDA0002543706000000058
wherein
Figure BDA0002543706000000059
Represents sentences sampled by greedy method, and
Figure BDA00025437060000000510
representing sentences sampled by the monte carlo method.
After the scheme is adopted, the invention has the following outstanding advantages:
(1) the method explicitly models the relation between the local visual area and the semantic concept, thereby providing accurate connection between vision and language, greatly reducing the semantic gap between image description tasks and greatly improving the accuracy and comprehensiveness of the generated sentences;
(2) the method has strong mobility, can be suitable for any image description model based on an attention mechanism, and improves the performance of the model;
(3) the improved image description integrity and accuracy are mainly applied to understanding the visual concept of a given picture, automatically generating description for the given picture, and having a great number of application prospects in the fields of image retrieval, blind navigation, automatic generation of medical reports and early education.
Drawings
FIG. 1 is a flow chart of the image automatic description method based on adaptive local concept embedding of the present invention;
wherein, RAM is a local concept extraction module, LCFM is a local concept cracking embedding module, and Attention is an Attention module;
FIG. 2 is a comparison of sentences generated by different image description models;
wherein UP-DOWN is a name named top-DOWN baseline method;
FIG. 3 is a result of similarity determination and visualization in column units of a mapping matrix used when embedding local concepts;
FIG. 4 is a semantic concept of the visualization of a region and the mapping of the region correspondingly visualized for the framework adaptive selection employed in the present invention;
fig. 5 is a visualization of correspondence of a certain semantic concept with a visual area.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
The invention aims to provide an image description method based on adaptive local concept embedding, which aims at overcoming the defect that the traditional image description method based on an attention mechanism does not explicitly model the relationship between a local region and a concept, provides a scheme for adaptively generating a visual region and a visual concept by a context mechanism, strengthens the connection and accuracy of vision to language, and provides an image description method based on adaptive local concept embedding. The specific algorithm flow is shown in fig. 1.
The invention comprises the following steps:
1) for the images in the image library, firstly, extracting corresponding image features by using a convolutional neural network;
2) adopting a cyclic neural network to map the current input word sum and the global image characteristics to the hidden layer for output, and taking the hidden layer as a guide signal;
3) obtaining the weight of each local image feature by using the guide signal by adopting an attention mechanism, adaptively obtaining local visual features, and extracting local concepts by using a trained concept extractor;
4) establishing a local concept cracking module, embedding a local concept into a generation model, and acquiring a current output word;
5) the iteration generates the whole sentence and defines the loss function of the generated sentence.
Each module is specifically as follows:
1. deep convolution feature extraction and description data preprocessing
Performing stop word processing on text contents in all training data, and performing lowercase on all English words; then, the text content is segmented according to spaces to obtain 9487 words, the words with the occurrence frequency less than five in the description of the data set are removed and replaced by "< UNK >", and meanwhile, a start symbol "< BOS >" and an END symbol "< END >" are added at the beginning and the END of the description sentence respectively.
Firstly, extracting 36 fixed candidate regions by using a pre-trained target detector, and extracting a feature V ═ V corresponding to each candidate region by using a residual deep convolution network1,v2...,vkIn which v isi∈RdI 1, 2, k, d are dimensions of the respective feature vectors, k is 36 and d is 2048.
2. Adaptive pilot generation layer
First, the first layer is a convolution loop network for generating an adaptive pilot signal to provide guidance for extracting local visual features later, and the layer inputs and processes are defined as follows:
Figure BDA0002543706000000071
Figure BDA0002543706000000072
Figure BDA0002543706000000073
wherein t is the t-th word of the sentence sequence,
Figure BDA0002543706000000074
generating a layer input for the adaptive pilot signal, and WeIs a matrix of word vectors and is,
Figure BDA0002543706000000075
is the pilot signal, x, output by the layertIndicating the corresponding index of the word input at time tAnd (3) introducing.
3. Local concept extraction
As shown in FIG. 1, following the local concept extraction layer, the present invention first utilizes
Figure BDA0002543706000000076
As a guide, local visual information is obtained, and thus adaptive local concepts are obtained, the process is derived as follows:
Figure BDA0002543706000000077
Figure BDA0002543706000000081
wherein,
Figure BDA0002543706000000082
Wv1∈Rk×d、Wh1∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors where all elements are 1, the Softmax function is a normalized exponential function. Thus, the importance of each candidate region can be obtained
Figure BDA0002543706000000083
To obtain the local visual features that the current model focuses on:
Figure BDA0002543706000000084
Figure BDA0002543706000000085
wherein,
Figure BDA0002543706000000086
i.e. the visual concept obtained, WvcTo achieve a pre-trained concept detection layer,
Figure BDA0002543706000000087
i.e. the visual concept that the model is focused on, σ is the activation function. Obtained
Figure BDA0002543706000000088
The quality of the attention mechanism can be well reflected, so the information is used for modifying the guide signal to improve the attention level, and the modification is as follows:
Figure BDA0002543706000000089
wherein [;]representing vector stitching, WhFor the parameter matrix to be trained, the process is the same as the first process, so that the final local concept can be obtained, and the process is as follows:
Figure BDA00025437060000000810
Figure BDA00025437060000000811
Figure BDA00025437060000000812
Figure BDA00025437060000000813
wherein,
Figure BDA00025437060000000814
Wv2∈Rk×d、Wh2∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors where all elements are 1, the Softmax function is a normalized exponential function.
4. Local concept cracking embedded module
The local concept is obtained through the above process, and then the local concept is embedded into the model through a vector splitting method, so as to effectively use the information to generate the image description, wherein the vector splitting process is as follows:
Figure BDA00025437060000000815
Figure BDA0002543706000000091
where diag (.) denotes vector diagonalization, xtIndicating the index corresponding to the word input at time t,
Figure BDA0002543706000000092
and
Figure BDA0002543706000000093
the local concepts are split and then information is embedded into the input words and hidden states. The information definition module input of the embedded local concept is as follows:
Figure BDA0002543706000000094
wherein [; 1; 1; .]A vector stitching operation is shown. Then, the input of the embedded information is mapped to obtain
Figure BDA0002543706000000095
it=σ(WiEi),ft=σ(WfEf)
Figure BDA0002543706000000097
Figure BDA0002543706000000098
Figure BDA0002543706000000099
Wherein, Wi、Ei、Wf、Ef、Wo、Eo、Wc、EcAll are parameter matrixes to be trained, and finally, we obtain the probability distribution of the next word through the information:
Figure BDA00025437060000000910
wherein WyThe hidden states are mapped to a vocabulary for the parameter matrix to be trained.
5. Global loss function construction
For a predicted sentence Y1∶TIn other words, the probability of generating the entire sentence can be multiplied by the probability of each word, i.e.:
Figure BDA00025437060000000911
where T is the sentence length. The invention trains the model in two stages, including supervised learning and reinforcement learning. The former employs cross entropy for a given target sentence
Figure BDA00025437060000000912
In terms of this, the loss function is defined as:
Figure BDA00025437060000000913
the latter is trained by reinforcement learning, and the loss function is defined as:
Figure BDA00025437060000000914
wherein
Figure BDA0002543706000000101
Represents sentences sampled by greedy method, and
Figure BDA0002543706000000102
representing sentences sampled by the monte carlo method.
The specific experimental results are as follows:
to verify the feasibility and advancement of the proposed model, we performed the evaluation of the model in the generic data set MSCOCO. The quantitative comparison with the latest image automatic description method is shown in table 1, and we can see that the performance of the proposed model has high advantages on various evaluation indexes. In addition, we can see that the text description generated by visualizing the input image, the description given by way of example is in english, and the chinese description is generated by the same automatic generation process (as shown in fig. 2), and that the model models the local visual information display, so that the model achieves obvious improvement on the image description. FIG. 3 is a pair of W*a TW*aThe results show that the method of the present invention embeds local concepts well into the model. Fig. 4 shows the visual regions concerned by the two module layers when each word is generated and the visual concept generated by the visual regions, and it can be seen that a more accurate visual concept can be obtained by modification. FIG. 5 labels the region of greatest model interest after the generation of a particular concept, which indicates that the method of the present invention can help overcome the semantic gap problem. The descriptions and concepts in fig. 2 to fig. 4 are all in english as an example, but the present invention can be directly extended to chinese description with the same mechanism.
TABLE 1 comparison of the method of the invention with the currently most advanced methods
Model B-1 B-4 M R C S
LSM-A 78.6 35.5 27.3 56.8 118.3 20.8
GCN-LSTM 80.5 38.2 28.5 58.5 128.3 22.0
Stack-Cap 78.6 36.1 27.4 56.9 120.4 20.9
SGAE 80.8 38.4 28.4 58.6 127.8 22.1
Up-Down 79.8 36.3 27.7 56.9 120.1 21.4
The method of the invention 80.6 39.0 28.6 58.8 128.3 22.3
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (4)

1. An image description method based on self-adaptive local concept embedding is characterized by comprising the following steps:
step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector;
step 2, inputting the features extracted in the step 1 into a trained neural network, thereby outputting a description result of the image to be described; wherein, the global loss function of the neural network is obtained by the following method;
step A1, preprocessing the text content in the training set to obtain a sentence sequence; for the images in the training set, a target detector is adopted to extract a plurality of candidate areas, and characteristics corresponding to the candidate areas are extracted
Figure DEST_PATH_IMAGE001
Wherein
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
Figure DEST_PATH_IMAGE004
Dimensions of each feature vector;
step A2, characterizing
Figure DEST_PATH_IMAGE005
Sending the signal into an adaptive pilot signal generation layer to generate an adaptive pilot signal;
step A3, acquiring local visual features by using an attention mechanism and using an adaptive pilot signal, and obtaining a local concept;
step A4, embedding the local concept into a generating model by a vector cracking method to obtain a current output word;
step A5, generating the whole sentence by iteration, and defining the loss function of the generated sentence;
in the step A2, based on characteristics
Figure 323461DEST_PATH_IMAGE005
The correlation formula for generating the adaptive pilot signal is as follows:
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
Figure DEST_PATH_IMAGE008
wherein,
Figure DEST_PATH_IMAGE009
is the first of a sentence sequence
Figure 377392DEST_PATH_IMAGE009
The number of the individual words,
Figure DEST_PATH_IMAGE010
generating a layer input for the adaptive pilot signal, an
Figure DEST_PATH_IMAGE011
Is a matrix of word vectors, which is,
Figure DEST_PATH_IMAGE012
is the pilot signal output by the layer and,
Figure DEST_PATH_IMAGE013
to represent
Figure 644294DEST_PATH_IMAGE009
Indexes corresponding to words input at any moment;
the specific process of the step A3 is as follows:
first according to the following formula:
Figure DEST_PATH_IMAGE014
Figure DEST_PATH_IMAGE015
wherein,
Figure DEST_PATH_IMAGE016
Figure DEST_PATH_IMAGE017
Figure DEST_PATH_IMAGE018
is a parameter that needs to be learned,
Figure DEST_PATH_IMAGE019
for vectors with all elements being 1, the Softmax function is a normalized exponential function; thereby obtaining the importance of each candidate region
Figure DEST_PATH_IMAGE020
And is used for obtaining the local visual characteristics concerned by the current model:
Figure DEST_PATH_IMAGE021
Figure DEST_PATH_IMAGE022
wherein,
Figure DEST_PATH_IMAGE023
i.e. the visual concept that is obtained,
Figure DEST_PATH_IMAGE024
to achieve a pre-trained concept detection layer,
Figure DEST_PATH_IMAGE025
i.e. the visual concept that the model focuses on,
Figure DEST_PATH_IMAGE026
is an activation function;
by using
Figure 324148DEST_PATH_IMAGE023
The adaptive pilot signal is modified as follows:
Figure DEST_PATH_IMAGE027
wherein
Figure DEST_PATH_IMAGE028
The concatenation of the vectors is represented and,
Figure DEST_PATH_IMAGE029
a parameter matrix needing to be trained;
the following iterations are then performed until the final local concept is obtained, as follows:
Figure DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
Figure DEST_PATH_IMAGE032
Figure DEST_PATH_IMAGE033
wherein,
Figure DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE035
Figure DEST_PATH_IMAGE036
is a parameter that needs to be learned by the user,
Figure 30329DEST_PATH_IMAGE019
for vectors with all elements being 1, the Softmax function is a normalized exponential function;
the specific process of the step A4 is as follows:
the following vector lysis was first performed:
Figure DEST_PATH_IMAGE037
Figure DEST_PATH_IMAGE038
wherein,
Figure DEST_PATH_IMAGE039
the representation vector is diagonalized and,
Figure 88284DEST_PATH_IMAGE013
to represent
Figure 22742DEST_PATH_IMAGE009
The index corresponding to the word input at the time,
Figure DEST_PATH_IMAGE040
and
Figure DEST_PATH_IMAGE041
is to crack the local concepts and thereafter embed the informationEntering an input word and a hidden state;
the following information definition module inputs for embedding local concepts:
Figure DEST_PATH_IMAGE042
wherein [ ] represents a vector splicing operation;
then, the input of the embedded information is mapped to obtain
Figure DEST_PATH_IMAGE043
Figure DEST_PATH_IMAGE044
Wherein,
Figure DEST_PATH_IMAGE045
Figure DEST_PATH_IMAGE046
Figure DEST_PATH_IMAGE047
Figure DEST_PATH_IMAGE048
Figure DEST_PATH_IMAGE049
Figure DEST_PATH_IMAGE050
Figure DEST_PATH_IMAGE051
Figure DEST_PATH_IMAGE052
all are the parameters needing trainingA number matrix;
finally, the probability distribution of the next word is obtained:
Figure DEST_PATH_IMAGE053
wherein
Figure DEST_PATH_IMAGE054
The hidden states are mapped to a vocabulary for the parameter matrix to be trained.
2. The image description method based on adaptive local concept embedding of claim 1, characterized in that: in step 1, the training method of the target detector comprises the following steps: the target detector adopts an fast R-CNN framework, a skeleton network of the target detector is a deep convolution residual error network, an end-to-end method is adopted to train in a classical target detection data set PASCAL VOC2007, and then a multi-modal data set Visual Genome is further trained to fine-tune network parameters.
3. The image description method based on adaptive local concept embedding of claim 1, characterized in that: in step a1, the specific process of preprocessing the text content in the training set to obtain a sentence sequence is as follows: firstly, performing stop word processing on text contents in a training set, and performing lowercase on all English words; then, segmenting the text content according to spaces, eliminating words with the occurrence frequency less than a threshold value in the description of the data set for the obtained words, and replacing the words with "< UNK >"; finally, the beginning and END of the sentence are added with the start "< BOS >" and the END "< END >" respectively.
4. The image description method based on adaptive local concept embedding of claim 1, characterized in that: the specific process of the step A5 is as follows:
for predicted sentence Y = Y1:TIn other words, the probability of generating an entire sentence isThe probability of each word is multiplied by:
Figure DEST_PATH_IMAGE055
wherein
Figure DEST_PATH_IMAGE056
Is the sentence length;
training the model through two stages of supervised learning and reinforcement learning; in the supervised learning phase, cross entropy is adopted for a given target sentence
Figure DEST_PATH_IMAGE057
In terms of this, the loss function is defined as:
Figure DEST_PATH_IMAGE058
in the reinforcement learning stage, reinforcement learning is adopted for training, and the loss function is defined as:
Figure DEST_PATH_IMAGE059
wherein
Figure DEST_PATH_IMAGE060
Represents sentences sampled by greedy method, and
Figure DEST_PATH_IMAGE061
representing sentences sampled by the monte carlo method.
CN202010554218.7A 2020-06-17 2020-06-17 Image description method based on self-adaptive local concept embedding Active CN111737511B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010554218.7A CN111737511B (en) 2020-06-17 2020-06-17 Image description method based on self-adaptive local concept embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010554218.7A CN111737511B (en) 2020-06-17 2020-06-17 Image description method based on self-adaptive local concept embedding

Publications (2)

Publication Number Publication Date
CN111737511A CN111737511A (en) 2020-10-02
CN111737511B true CN111737511B (en) 2022-06-07

Family

ID=72649581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010554218.7A Active CN111737511B (en) 2020-06-17 2020-06-17 Image description method based on self-adaptive local concept embedding

Country Status (1)

Country Link
CN (1) CN111737511B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329794B (en) * 2020-11-06 2024-03-12 北京工业大学 Image description method based on dual self-attention mechanism
CN112819013A (en) * 2021-01-29 2021-05-18 厦门大学 Image description method based on intra-layer and inter-layer joint global representation
CN112819012B (en) * 2021-01-29 2022-05-03 厦门大学 Image description generation method based on multi-source cooperative features
CN112861988B (en) * 2021-03-04 2022-03-11 西南科技大学 Feature matching method based on attention-seeking neural network
CN113158791B (en) * 2021-03-15 2022-08-16 上海交通大学 Human-centered image description labeling method, system, terminal and medium
CN113139378B (en) * 2021-03-18 2022-02-18 杭州电子科技大学 Image description method based on visual embedding and condition normalization
CN113283248B (en) * 2021-04-29 2022-06-21 桂林电子科技大学 Automatic natural language generation method and device for scatter diagram description
CN113837233B (en) * 2021-08-30 2023-11-17 厦门大学 Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN117423108B (en) * 2023-09-28 2024-05-24 中国科学院自动化研究所 Image fine granularity description method and system for instruction fine adjustment multi-mode large model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2296197A1 (en) * 1974-12-24 1976-07-23 Thomson Csf METHOD AND DEVICE USING A THERMO-OPTICAL EFFECT IN A THIN LAYER IN SMECTIC PHASE FOR THE REPRODUCTION OF IMAGES WITH MEMORY
DE102008008707A1 (en) * 2008-02-11 2009-08-13 Deutsches Zentrum für Luft- und Raumfahrt e.V. Digital image processing method, involves forming mixed model description depending upon verification, and calculating image values of processed images by considering imaging function from result of mixed model description
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN109376610A (en) * 2018-09-27 2019-02-22 南京邮电大学 Pedestrian's unsafe acts detection method in video monitoring based on image concept network
CN110268712A (en) * 2017-02-07 2019-09-20 皇家飞利浦有限公司 Method and apparatus for handling image attributes figure
CN110598713A (en) * 2019-08-06 2019-12-20 厦门大学 Intelligent image automatic description method based on deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2296197A1 (en) * 1974-12-24 1976-07-23 Thomson Csf METHOD AND DEVICE USING A THERMO-OPTICAL EFFECT IN A THIN LAYER IN SMECTIC PHASE FOR THE REPRODUCTION OF IMAGES WITH MEMORY
DE102008008707A1 (en) * 2008-02-11 2009-08-13 Deutsches Zentrum für Luft- und Raumfahrt e.V. Digital image processing method, involves forming mixed model description depending upon verification, and calculating image values of processed images by considering imaging function from result of mixed model description
CN110268712A (en) * 2017-02-07 2019-09-20 皇家飞利浦有限公司 Method and apparatus for handling image attributes figure
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN109376610A (en) * 2018-09-27 2019-02-22 南京邮电大学 Pedestrian's unsafe acts detection method in video monitoring based on image concept network
CN110598713A (en) * 2019-08-06 2019-12-20 厦门大学 Intelligent image automatic description method based on deep neural network

Also Published As

Publication number Publication date
CN111737511A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
CN111737511B (en) Image description method based on self-adaptive local concept embedding
CN111159454A (en) Picture description generation method and system based on Actor-Critic generation type countermeasure network
CN112819013A (en) Image description method based on intra-layer and inter-layer joint global representation
CN113035311B (en) Medical image report automatic generation method based on multi-mode attention mechanism
CN113837233B (en) Image description method of self-attention mechanism based on sample self-adaptive semantic guidance
CN111444367A (en) Image title generation method based on global and local attention mechanism
CN115982350A (en) False news detection method based on multi-mode Transformer
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113901170A (en) Event extraction method and system combining Bert model and template matching and electronic equipment
CN113283336A (en) Text recognition method and system
CN117746078B (en) Object detection method and system based on user-defined category
Wang et al. Recognizing handwritten mathematical expressions as LaTex sequences using a multiscale robust neural network
CN111680684A (en) Method, device and storage medium for recognizing spine text based on deep learning
CN110889276B (en) Method, system and computer medium for extracting pointer type extraction triplet information by complex fusion characteristics
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN110929013A (en) Image question-answer implementation method based on bottom-up entry and positioning information fusion
CN113837231B (en) Image description method based on data enhancement of mixed sample and label
CN115982629A (en) Image description method based on semantic guidance feature selection
CN113221870B (en) OCR (optical character recognition) method, device, storage medium and equipment for mobile terminal
Rafi et al. A linear sub-structure with co-variance shift for image captioning
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN113052156A (en) Optical character recognition method, device, electronic equipment and storage medium
CN113934922A (en) Intelligent recommendation method, device, equipment and computer storage medium
CN112329803A (en) Natural scene character recognition method based on standard font generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant