CN111737511B - Image description method based on self-adaptive local concept embedding - Google Patents
Image description method based on self-adaptive local concept embedding Download PDFInfo
- Publication number
- CN111737511B CN111737511B CN202010554218.7A CN202010554218A CN111737511B CN 111737511 B CN111737511 B CN 111737511B CN 202010554218 A CN202010554218 A CN 202010554218A CN 111737511 B CN111737511 B CN 111737511B
- Authority
- CN
- China
- Prior art keywords
- concept
- local
- sentence
- image
- adaptive
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000000007 visual effect Effects 0.000 claims abstract description 37
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 28
- 230000003044 adaptive effect Effects 0.000 claims description 27
- 239000013598 vector Substances 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 11
- 230000002787 reinforcement Effects 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 7
- 238000005336 cracking Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000000342 Monte Carlo simulation Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000009089 cytolysis Effects 0.000 claims description 2
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 claims description 2
- 240000004760 Pimpinella anisum Species 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 230000007547 defect Effects 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012800 visualization Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image description method based on self-adaptive local concept embedding, which belongs to the technical field of artificial intelligence and comprises the following steps: step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector; and 2, inputting the features extracted in the step 1 into the trained neural network, thereby outputting a description result of the image to be described. Aiming at the defect that the traditional image description method based on the attention mechanism does not explicitly model the relationship between the local area and the concept, the method provides a scheme for adaptively generating the visual area and the visual concept through a context mechanism, so that the connection between the vision and the language is strengthened, and the accuracy of the generation description is improved.
Description
Technical Field
The invention relates to automatic image description in the field of artificial intelligence, in particular to a method for researching an image description model based on adaptive local concept embedding and used for describing objective contents of images by natural language based on pictures.
Background
Image automatic description (Image capturing) is a machine ultimate intelligent task proposed in the artificial intelligence field in recent years, and the task is to describe the objective contents of an Image in a natural language for a given Image. With the development of computer vision technology, the task of completing target detection, identification, segmentation and the like cannot meet the production requirements of people, and the method has urgent need for automatically and objectively automatically describing image contents. Different from tasks such as target detection and semantic segmentation, the image automatic description is to integrally and objectively describe objects, attributes, relationships among the objects, corresponding scenes and the like in the image by using an automatic language, and the task is one of important directions of computer vision understanding and is regarded as an important mark of artificial intelligence.
The task of automatic description of images in the past, which was mainly achieved by template-based methods and retrieval-based methods, has been greatly advanced until recently inspired by natural language technology, starting with the use of encoder-decoder frameworks, attention mechanisms and objective functions based on reinforcement learning.
Xu et al [1] introduced for the first time a mechanism of attention in the picture description task to embed important visual attributes and scenes into the description generator. Following this, much work has been directed to improving attention mechanisms. For example, Chen [2] et al propose a spatial and channel attention mechanism to select salient regions and salient semantic patterns; lu et al [3] proposed the concept of visual sentinel to decide whether to pay attention to visual information or text information in the next step, greatly improving the accuracy of the model; anderson et al [4] first acquires the region by a pre-trained target detector and then adds this to the model to generate the image captions. However, these methods only focus on the context and visual characteristics of a specific task, and do not take into account the relationship between explicit modeled visual characteristics and concepts.
The references referred to are as follows:
[1].Xu,K.;Ba,J.;Kiros,R.;Cho,K.;Courville,A.;Salakhudinov,R.;Zemel,R.;and Bengio,Y.2015.Show,attend and tell:Neural image caption generation with visual attention.In ICML.
[2].Chen,L.;Zhang,H.;Xiao,J.;Nie,L.;Shao,J.;Liu,W.;and Chua,T.-S.2017b.Sca-cnn:Spatial and channel-wise attention in convolutional networks for image captioning.In CVPR.
[3].Lu,J.;Xiong,C.;Parikh,D.;and Socher,R.2017.Knowing when to look:Adaptive attention via a visual sentinel for image captioning.In CVPR.
[4].Anderson,P.;He,X.;Buehler,C.;Teney,D.;Johnson,M.;Gould,S.;and Zhang,L.2018.Bottom-up and top-down attention for image captioning and visual question answering.In CVPR.
disclosure of Invention
The invention aims to provide an image description method based on adaptive local concept embedding, and provides a scheme for adaptively generating a visual region and a visual concept thereby through a context mechanism aiming at the defect that the traditional image description method based on an attention mechanism does not explicitly model the relationship between a local region and a concept, so that the connection and the accuracy of vision to language are enhanced.
In order to achieve the above purpose, the solution of the invention is:
an image description method based on adaptive local concept embedding comprises the following steps:
step A1, preprocessing the text content in the training set to obtain a sentence sequence; for images in a training set, a target detector is adopted to extract a plurality of candidate areas, and characteristics V ═ V { V } corresponding to each candidate area are extracted1,v2...,vkIn which v isi∈Rd,i=1,2, k, d are dimensions of each feature vector;
step A2, sending the characteristic V into an adaptive pilot signal generation layer to generate an adaptive pilot signal;
step A3, acquiring local visual features by using an attention mechanism and using an adaptive pilot signal, and obtaining a local concept;
step A4, embedding the local concept into a generating model by a vector cracking method to obtain a current output word;
step a5, iteratively generate an entire sentence, and define a loss function that generates the sentence.
In step 1, the training method of the target detector includes: the target detector adopts an fast R-CNN framework, a skeleton network of the target detector is a deep convolution residual error network, an end-to-end method is adopted to train in a classical target detection data set PASCAL VOC2007, and then a multi-modal data set Visual Genome is further trained to fine-tune network parameters.
In the step a1, the specific process of preprocessing the text content in the training set to obtain the sentence sequence is as follows: firstly, performing stop word processing on text contents in a training set, and performing lowercase on all English words; then, segmenting the text content according to spaces, eliminating words with the occurrence frequency less than a threshold value in the description of the data set for the obtained words, and replacing the words with "< UNK >"; finally, the beginning and END of the sentence are added with the start "< BOS >" and the END "< END >" respectively.
In step a2, the correlation formula for generating the adaptive pilot signal based on the characteristic V is as follows:
wherein t is the t-th word of the sentence sequence,generating a layer input for the adaptive pilot signal, and WeIs a matrix of word vectors, which is,is the pilot signal, x, output by the layertIndicating the index corresponding to the word input at time t.
The specific process of the step a3 is as follows:
first according to the following formula:
wherein,Wv1∈Rk×d、Wh1∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors with all elements being 1, the Softmax function is a normalized exponential function; thereby obtaining the importance of each candidate regionTo obtain the local visual features that the current model focuses on:
wherein,the visual concept obtained, WvcTo achieve a pre-trained concept detection layer,the vision concept concerned by the model is defined, and sigma is an activation function;
wherein [;]representing vector stitching, WhA parameter matrix needing to be trained;
the following iterations are then performed until the final local concept is obtained, as follows:
wherein,Wv2∈Rk×d、Wh2∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors where all elements are 1, the Softmax function is a normalized exponential function.
The specific process of the step a4 is as follows:
the following vector lysis was first performed:
where diag (.) denotes vector diagonalization, xtIndicating the index corresponding to the word input at time t,andsplitting local concepts, and embedding information into input words and hidden states;
the following information definition module inputs for embedding local concepts:
wherein [; 1; 1; represents a vector stitching operation;
it=σ(WiEi),ft=σ(WfEf)
ot=σ(WoEo),ct=σ(WcEc)
Wherein, Wi、Ei、Wf、Ef、Wo、Eo、Wc、EcAll are parameter matrices that need to be trained;
finally, the probability distribution of the next word is obtained:
wherein WyThe hidden states are mapped to a vocabulary for the parameter matrix to be trained.
The specific process of the step a5 is as follows:
for a predicted sentence Y1∶TIn other words, the probability of generating an entire sentence is multiplied by the probability of each word, i.e.:
wherein T is the sentence length;
training the model through two stages of supervised learning and reinforcement learning; in the supervised learning phase, cross entropy is adopted for a given target sentenceIn terms of this, the loss function is defined as:
in the reinforcement learning stage, reinforcement learning is adopted for training, and the loss function is defined as:
whereinRepresents sentences sampled by greedy method, andrepresenting sentences sampled by the monte carlo method.
After the scheme is adopted, the invention has the following outstanding advantages:
(1) the method explicitly models the relation between the local visual area and the semantic concept, thereby providing accurate connection between vision and language, greatly reducing the semantic gap between image description tasks and greatly improving the accuracy and comprehensiveness of the generated sentences;
(2) the method has strong mobility, can be suitable for any image description model based on an attention mechanism, and improves the performance of the model;
(3) the improved image description integrity and accuracy are mainly applied to understanding the visual concept of a given picture, automatically generating description for the given picture, and having a great number of application prospects in the fields of image retrieval, blind navigation, automatic generation of medical reports and early education.
Drawings
FIG. 1 is a flow chart of the image automatic description method based on adaptive local concept embedding of the present invention;
wherein, RAM is a local concept extraction module, LCFM is a local concept cracking embedding module, and Attention is an Attention module;
FIG. 2 is a comparison of sentences generated by different image description models;
wherein UP-DOWN is a name named top-DOWN baseline method;
FIG. 3 is a result of similarity determination and visualization in column units of a mapping matrix used when embedding local concepts;
FIG. 4 is a semantic concept of the visualization of a region and the mapping of the region correspondingly visualized for the framework adaptive selection employed in the present invention;
fig. 5 is a visualization of correspondence of a certain semantic concept with a visual area.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
The invention aims to provide an image description method based on adaptive local concept embedding, which aims at overcoming the defect that the traditional image description method based on an attention mechanism does not explicitly model the relationship between a local region and a concept, provides a scheme for adaptively generating a visual region and a visual concept by a context mechanism, strengthens the connection and accuracy of vision to language, and provides an image description method based on adaptive local concept embedding. The specific algorithm flow is shown in fig. 1.
The invention comprises the following steps:
1) for the images in the image library, firstly, extracting corresponding image features by using a convolutional neural network;
2) adopting a cyclic neural network to map the current input word sum and the global image characteristics to the hidden layer for output, and taking the hidden layer as a guide signal;
3) obtaining the weight of each local image feature by using the guide signal by adopting an attention mechanism, adaptively obtaining local visual features, and extracting local concepts by using a trained concept extractor;
4) establishing a local concept cracking module, embedding a local concept into a generation model, and acquiring a current output word;
5) the iteration generates the whole sentence and defines the loss function of the generated sentence.
Each module is specifically as follows:
1. deep convolution feature extraction and description data preprocessing
Performing stop word processing on text contents in all training data, and performing lowercase on all English words; then, the text content is segmented according to spaces to obtain 9487 words, the words with the occurrence frequency less than five in the description of the data set are removed and replaced by "< UNK >", and meanwhile, a start symbol "< BOS >" and an END symbol "< END >" are added at the beginning and the END of the description sentence respectively.
Firstly, extracting 36 fixed candidate regions by using a pre-trained target detector, and extracting a feature V ═ V corresponding to each candidate region by using a residual deep convolution network1,v2...,vkIn which v isi∈RdI 1, 2, k, d are dimensions of the respective feature vectors, k is 36 and d is 2048.
2. Adaptive pilot generation layer
First, the first layer is a convolution loop network for generating an adaptive pilot signal to provide guidance for extracting local visual features later, and the layer inputs and processes are defined as follows:
wherein t is the t-th word of the sentence sequence,generating a layer input for the adaptive pilot signal, and WeIs a matrix of word vectors and is,is the pilot signal, x, output by the layertIndicating the corresponding index of the word input at time tAnd (3) introducing.
3. Local concept extraction
As shown in FIG. 1, following the local concept extraction layer, the present invention first utilizesAs a guide, local visual information is obtained, and thus adaptive local concepts are obtained, the process is derived as follows:
wherein,Wv1∈Rk×d、Wh1∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors where all elements are 1, the Softmax function is a normalized exponential function. Thus, the importance of each candidate region can be obtainedTo obtain the local visual features that the current model focuses on:
wherein,i.e. the visual concept obtained, WvcTo achieve a pre-trained concept detection layer,i.e. the visual concept that the model is focused on, σ is the activation function. ObtainedThe quality of the attention mechanism can be well reflected, so the information is used for modifying the guide signal to improve the attention level, and the modification is as follows:
wherein [;]representing vector stitching, WhFor the parameter matrix to be trained, the process is the same as the first process, so that the final local concept can be obtained, and the process is as follows:
wherein,Wv2∈Rk×d、Wh2∈Rk×dis a parameter to be learned, I is belonged to RkFor vectors where all elements are 1, the Softmax function is a normalized exponential function.
4. Local concept cracking embedded module
The local concept is obtained through the above process, and then the local concept is embedded into the model through a vector splitting method, so as to effectively use the information to generate the image description, wherein the vector splitting process is as follows:
where diag (.) denotes vector diagonalization, xtIndicating the index corresponding to the word input at time t,andthe local concepts are split and then information is embedded into the input words and hidden states. The information definition module input of the embedded local concept is as follows:
wherein [; 1; 1; .]A vector stitching operation is shown. Then, the input of the embedded information is mapped to obtain
it=σ(WiEi),ft=σ(WfEf)
Wherein, Wi、Ei、Wf、Ef、Wo、Eo、Wc、EcAll are parameter matrixes to be trained, and finally, we obtain the probability distribution of the next word through the information:
wherein WyThe hidden states are mapped to a vocabulary for the parameter matrix to be trained.
5. Global loss function construction
For a predicted sentence Y1∶TIn other words, the probability of generating the entire sentence can be multiplied by the probability of each word, i.e.:
where T is the sentence length. The invention trains the model in two stages, including supervised learning and reinforcement learning. The former employs cross entropy for a given target sentenceIn terms of this, the loss function is defined as:
the latter is trained by reinforcement learning, and the loss function is defined as:
whereinRepresents sentences sampled by greedy method, andrepresenting sentences sampled by the monte carlo method.
The specific experimental results are as follows:
to verify the feasibility and advancement of the proposed model, we performed the evaluation of the model in the generic data set MSCOCO. The quantitative comparison with the latest image automatic description method is shown in table 1, and we can see that the performance of the proposed model has high advantages on various evaluation indexes. In addition, we can see that the text description generated by visualizing the input image, the description given by way of example is in english, and the chinese description is generated by the same automatic generation process (as shown in fig. 2), and that the model models the local visual information display, so that the model achieves obvious improvement on the image description. FIG. 3 is a pair of W*a TW*aThe results show that the method of the present invention embeds local concepts well into the model. Fig. 4 shows the visual regions concerned by the two module layers when each word is generated and the visual concept generated by the visual regions, and it can be seen that a more accurate visual concept can be obtained by modification. FIG. 5 labels the region of greatest model interest after the generation of a particular concept, which indicates that the method of the present invention can help overcome the semantic gap problem. The descriptions and concepts in fig. 2 to fig. 4 are all in english as an example, but the present invention can be directly extended to chinese description with the same mechanism.
TABLE 1 comparison of the method of the invention with the currently most advanced methods
Model | B-1 | B-4 | M | R | C | S |
LSM-A | 78.6 | 35.5 | 27.3 | 56.8 | 118.3 | 20.8 |
GCN-LSTM | 80.5 | 38.2 | 28.5 | 58.5 | 128.3 | 22.0 |
Stack-Cap | 78.6 | 36.1 | 27.4 | 56.9 | 120.4 | 20.9 |
SGAE | 80.8 | 38.4 | 28.4 | 58.6 | 127.8 | 22.1 |
Up-Down | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
The method of the invention | 80.6 | 39.0 | 28.6 | 58.8 | 128.3 | 22.3 |
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.
Claims (4)
1. An image description method based on self-adaptive local concept embedding is characterized by comprising the following steps:
step 1, extracting a plurality of candidate areas of an image to be described and characteristics corresponding to the candidate areas by adopting a target detector;
step 2, inputting the features extracted in the step 1 into a trained neural network, thereby outputting a description result of the image to be described; wherein, the global loss function of the neural network is obtained by the following method;
step A1, preprocessing the text content in the training set to obtain a sentence sequence; for the images in the training set, a target detector is adopted to extract a plurality of candidate areas, and characteristics corresponding to the candidate areas are extractedWherein,,Dimensions of each feature vector;
step A2, characterizingSending the signal into an adaptive pilot signal generation layer to generate an adaptive pilot signal;
step A3, acquiring local visual features by using an attention mechanism and using an adaptive pilot signal, and obtaining a local concept;
step A4, embedding the local concept into a generating model by a vector cracking method to obtain a current output word;
step A5, generating the whole sentence by iteration, and defining the loss function of the generated sentence;
in the step A2, based on characteristicsThe correlation formula for generating the adaptive pilot signal is as follows:
wherein,is the first of a sentence sequenceThe number of the individual words,generating a layer input for the adaptive pilot signal, anIs a matrix of word vectors, which is,is the pilot signal output by the layer and,to representIndexes corresponding to words input at any moment;
the specific process of the step A3 is as follows:
first according to the following formula:
wherein,、、is a parameter that needs to be learned,for vectors with all elements being 1, the Softmax function is a normalized exponential function; thereby obtaining the importance of each candidate regionAnd is used for obtaining the local visual characteristics concerned by the current model:
wherein,i.e. the visual concept that is obtained,to achieve a pre-trained concept detection layer,i.e. the visual concept that the model focuses on,is an activation function;
whereinThe concatenation of the vectors is represented and,a parameter matrix needing to be trained;
the following iterations are then performed until the final local concept is obtained, as follows:
wherein,、、is a parameter that needs to be learned by the user,for vectors with all elements being 1, the Softmax function is a normalized exponential function;
the specific process of the step A4 is as follows:
the following vector lysis was first performed:
wherein,the representation vector is diagonalized and,to representThe index corresponding to the word input at the time,andis to crack the local concepts and thereafter embed the informationEntering an input word and a hidden state;
the following information definition module inputs for embedding local concepts:
wherein [ ] represents a vector splicing operation;
finally, the probability distribution of the next word is obtained:
2. The image description method based on adaptive local concept embedding of claim 1, characterized in that: in step 1, the training method of the target detector comprises the following steps: the target detector adopts an fast R-CNN framework, a skeleton network of the target detector is a deep convolution residual error network, an end-to-end method is adopted to train in a classical target detection data set PASCAL VOC2007, and then a multi-modal data set Visual Genome is further trained to fine-tune network parameters.
3. The image description method based on adaptive local concept embedding of claim 1, characterized in that: in step a1, the specific process of preprocessing the text content in the training set to obtain a sentence sequence is as follows: firstly, performing stop word processing on text contents in a training set, and performing lowercase on all English words; then, segmenting the text content according to spaces, eliminating words with the occurrence frequency less than a threshold value in the description of the data set for the obtained words, and replacing the words with "< UNK >"; finally, the beginning and END of the sentence are added with the start "< BOS >" and the END "< END >" respectively.
4. The image description method based on adaptive local concept embedding of claim 1, characterized in that: the specific process of the step A5 is as follows:
for predicted sentence Y = Y1:TIn other words, the probability of generating an entire sentence isThe probability of each word is multiplied by:
training the model through two stages of supervised learning and reinforcement learning; in the supervised learning phase, cross entropy is adopted for a given target sentenceIn terms of this, the loss function is defined as:
in the reinforcement learning stage, reinforcement learning is adopted for training, and the loss function is defined as:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010554218.7A CN111737511B (en) | 2020-06-17 | 2020-06-17 | Image description method based on self-adaptive local concept embedding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010554218.7A CN111737511B (en) | 2020-06-17 | 2020-06-17 | Image description method based on self-adaptive local concept embedding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111737511A CN111737511A (en) | 2020-10-02 |
CN111737511B true CN111737511B (en) | 2022-06-07 |
Family
ID=72649581
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010554218.7A Active CN111737511B (en) | 2020-06-17 | 2020-06-17 | Image description method based on self-adaptive local concept embedding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111737511B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112329794B (en) * | 2020-11-06 | 2024-03-12 | 北京工业大学 | Image description method based on dual self-attention mechanism |
CN112819013A (en) * | 2021-01-29 | 2021-05-18 | 厦门大学 | Image description method based on intra-layer and inter-layer joint global representation |
CN112819012B (en) * | 2021-01-29 | 2022-05-03 | 厦门大学 | Image description generation method based on multi-source cooperative features |
CN112861988B (en) * | 2021-03-04 | 2022-03-11 | 西南科技大学 | Feature matching method based on attention-seeking neural network |
CN113158791B (en) * | 2021-03-15 | 2022-08-16 | 上海交通大学 | Human-centered image description labeling method, system, terminal and medium |
CN113139378B (en) * | 2021-03-18 | 2022-02-18 | 杭州电子科技大学 | Image description method based on visual embedding and condition normalization |
CN113283248B (en) * | 2021-04-29 | 2022-06-21 | 桂林电子科技大学 | Automatic natural language generation method and device for scatter diagram description |
CN113837233B (en) * | 2021-08-30 | 2023-11-17 | 厦门大学 | Image description method of self-attention mechanism based on sample self-adaptive semantic guidance |
CN117423108B (en) * | 2023-09-28 | 2024-05-24 | 中国科学院自动化研究所 | Image fine granularity description method and system for instruction fine adjustment multi-mode large model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2296197A1 (en) * | 1974-12-24 | 1976-07-23 | Thomson Csf | METHOD AND DEVICE USING A THERMO-OPTICAL EFFECT IN A THIN LAYER IN SMECTIC PHASE FOR THE REPRODUCTION OF IMAGES WITH MEMORY |
DE102008008707A1 (en) * | 2008-02-11 | 2009-08-13 | Deutsches Zentrum für Luft- und Raumfahrt e.V. | Digital image processing method, involves forming mixed model description depending upon verification, and calculating image values of processed images by considering imaging function from result of mixed model description |
CN107066973A (en) * | 2017-04-17 | 2017-08-18 | 杭州电子科技大学 | A kind of video content description method of utilization spatio-temporal attention model |
CN109376610A (en) * | 2018-09-27 | 2019-02-22 | 南京邮电大学 | Pedestrian's unsafe acts detection method in video monitoring based on image concept network |
CN110268712A (en) * | 2017-02-07 | 2019-09-20 | 皇家飞利浦有限公司 | Method and apparatus for handling image attributes figure |
CN110598713A (en) * | 2019-08-06 | 2019-12-20 | 厦门大学 | Intelligent image automatic description method based on deep neural network |
-
2020
- 2020-06-17 CN CN202010554218.7A patent/CN111737511B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2296197A1 (en) * | 1974-12-24 | 1976-07-23 | Thomson Csf | METHOD AND DEVICE USING A THERMO-OPTICAL EFFECT IN A THIN LAYER IN SMECTIC PHASE FOR THE REPRODUCTION OF IMAGES WITH MEMORY |
DE102008008707A1 (en) * | 2008-02-11 | 2009-08-13 | Deutsches Zentrum für Luft- und Raumfahrt e.V. | Digital image processing method, involves forming mixed model description depending upon verification, and calculating image values of processed images by considering imaging function from result of mixed model description |
CN110268712A (en) * | 2017-02-07 | 2019-09-20 | 皇家飞利浦有限公司 | Method and apparatus for handling image attributes figure |
CN107066973A (en) * | 2017-04-17 | 2017-08-18 | 杭州电子科技大学 | A kind of video content description method of utilization spatio-temporal attention model |
CN109376610A (en) * | 2018-09-27 | 2019-02-22 | 南京邮电大学 | Pedestrian's unsafe acts detection method in video monitoring based on image concept network |
CN110598713A (en) * | 2019-08-06 | 2019-12-20 | 厦门大学 | Intelligent image automatic description method based on deep neural network |
Also Published As
Publication number | Publication date |
---|---|
CN111737511A (en) | 2020-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111737511B (en) | Image description method based on self-adaptive local concept embedding | |
CN111159454A (en) | Picture description generation method and system based on Actor-Critic generation type countermeasure network | |
CN112819013A (en) | Image description method based on intra-layer and inter-layer joint global representation | |
CN113035311B (en) | Medical image report automatic generation method based on multi-mode attention mechanism | |
CN113837233B (en) | Image description method of self-attention mechanism based on sample self-adaptive semantic guidance | |
CN111444367A (en) | Image title generation method based on global and local attention mechanism | |
CN115982350A (en) | False news detection method based on multi-mode Transformer | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN113901170A (en) | Event extraction method and system combining Bert model and template matching and electronic equipment | |
CN113283336A (en) | Text recognition method and system | |
CN117746078B (en) | Object detection method and system based on user-defined category | |
Wang et al. | Recognizing handwritten mathematical expressions as LaTex sequences using a multiscale robust neural network | |
CN111680684A (en) | Method, device and storage medium for recognizing spine text based on deep learning | |
CN110889276B (en) | Method, system and computer medium for extracting pointer type extraction triplet information by complex fusion characteristics | |
CN112084788A (en) | Automatic marking method and system for implicit emotional tendency of image captions | |
CN110929013A (en) | Image question-answer implementation method based on bottom-up entry and positioning information fusion | |
CN113837231B (en) | Image description method based on data enhancement of mixed sample and label | |
CN115982629A (en) | Image description method based on semantic guidance feature selection | |
CN113221870B (en) | OCR (optical character recognition) method, device, storage medium and equipment for mobile terminal | |
Rafi et al. | A linear sub-structure with co-variance shift for image captioning | |
CN115496134A (en) | Traffic scene video description generation method and device based on multi-modal feature fusion | |
CN113052156A (en) | Optical character recognition method, device, electronic equipment and storage medium | |
CN113934922A (en) | Intelligent recommendation method, device, equipment and computer storage medium | |
CN112329803A (en) | Natural scene character recognition method based on standard font generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |