CN109712108B - Visual positioning method for generating network based on diversity discrimination candidate frame - Google Patents
Visual positioning method for generating network based on diversity discrimination candidate frame Download PDFInfo
- Publication number
- CN109712108B CN109712108B CN201811305577.8A CN201811305577A CN109712108B CN 109712108 B CN109712108 B CN 109712108B CN 201811305577 A CN201811305577 A CN 201811305577A CN 109712108 B CN109712108 B CN 109712108B
- Authority
- CN
- China
- Prior art keywords
- vector
- candidate
- network
- frame
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a vision positioning method for generating a network based on a diversity discriminative candidate frame. The invention comprises the following steps: 1. training the diversity discriminative candidate boxes to generate a network. 2. And (5) extracting features of the image by using the trained DDPN network. 3. And extracting text data features. 4. And constructing a target vector and a target value of the regression box. 5. And constructing a deep neural network. 6. A loss function is set. 7. And (5) training the model. 8. And calculating a network predicted value. The algorithm provided by the invention, especially the DDPN network-based algorithm for extracting the features of the image, achieves a significant improvement effect on an image visual positioning task, and greatly exceeds all mainstream methods on the task at present. The feature extraction algorithm of the invention also has very important application value and great potential in other cross-modal related fields such as image content question answering and image description.
Description
Technical Field
The present invention relates to an algorithm based on a deep neural network for an image Visual localization (Visual grouping) problem, and more particularly, to an image feature extraction method based on a diversity and Discriminative candidate box generation network (DDPN) and a deep neural network structure for an image Visual localization problem.
Background
Visual targeting is a subtask in the field of "cross-media", and "cross-media" is a cross direction between computer vision and the field of natural language processing research, aiming at getting through the "semantic gap" between different media (such as images and texts) and establishing a uniform semantic expression. Based on a theoretical method of Cross-media uniform expression, some current popular research directions are derived, such as natural description generation (Image capturing), Image-Text Cross-media Retrieval (Image-Text Cross-media Retrieval), Image Question Answering (Image Question Answering) of Image content, Image Visual positioning (Visual grouping), and the like. The generation of the natural description of the image aims to summarize the content of one or more sentences of natural language for one image; image-text cross-media retrieval aims at finding the best matching text description for an image from a database, or finding the best matching image for a text description; the automatic question answering of the image content aims at inputting a picture and a question described by a natural language and outputting an answer described by the natural language; the visual positioning of the image aims to give a picture and a natural language description text, and relevant areas are selected in the picture according to the description text.
With the rapid development of deep learning in recent years, the use of deep Neural Networks, such as a deep Convolutional Neural Network (CNN) and a deep cyclic Neural network (RNN), has achieved quite good results in solving the problems of natural description generation and automatic question-answering of image contents. But has been slow in the visual positioning problem with very limited success. Therefore, the use of neural networks to solve the visual localization problem is a research problem worthy of intensive research.
In the aspect of practical application, the image visual positioning algorithm has a very wide application scene. The text-based question-answering system has been widely applied to the operating systems of smart phones and PCs as an important way of man-machine interaction, such as Siri of apple, Cortana of microsoft, Alexa of amazon, and the like. With the rapid development of wearable smart hardware (such as Google glasses and microsoft HoloLens) and augmented reality technology, in the near future, an image content positioning system based on visual perception may become an important way for human-computer interaction.
In conclusion, the image visual positioning algorithm is a direction worthy of intensive research, and the patent is switched from several key difficult problems in the task to solve the problems existing in the current method.
Due to the fact that image content under a natural scene is complex, a main body is various; the description text of natural language has high degree of freedom, which makes the image visual positioning algorithm face huge challenge. Specifically, there are two main difficulties:
(1) extracting appropriate features for the image: the extraction of proper features from images is a basic task of a neural network in solving the cross-mode problem, and at present, mainstream algorithms for image description, image question answering, visual positioning and other cross-mode problems comprise preprocessing images in advance and extracting features, and a plurality of related works show that the extraction algorithm of the image features can generate great influence on the performance of the neural network.
(2) The method is characterized in that the cross-media data of the image problem is modeled uniformly, and how to perform effective feature fusion is as follows: the multi-modal feature fusion problem is a classic and fundamental problem in cross-media expression, and commonly used methods are feature splicing, feature summation, or feature fusion using a multi-layer neural network and the like. In addition, the feature fusion model based on the bilinear model has a good effect in many fields such as image fine-grained classification, natural language processing and recommendation systems, but the model training is greatly challenged due to high computational complexity. Therefore, selecting a proper strategy when fusing cross-media data features ensures the high efficiency of calculation, and simultaneously, improving the expression capability of the fused features is a direction worthy of intensive research.
Disclosure of Invention
The invention provides an algorithm for extracting features of an image aiming at a Visual grouping task, which comprises the following steps: a Diversified and dispersive candidate box generating network (DDPN) and a deep neural network algorithm for visual positioning, and makes a great breakthrough on the visual positioning problem.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step (1), training diversity and Discriminative candidate box generation network (DDPN)
Using the fast-RCNN (an image detection algorithm) and adding the prediction of the object property values on the basis thereof, as shown in FIG. 1, it is trained on the Visual Genome data set until the network converges, and the resulting converged network is called DDPN network.
Step (2) extracting features of the image by using the trained DDPN network
For an input image, k candidate boxes containing objects in the image are calculated by using the DDPN network trained in the previous step. For each candidate frame, inputting the corresponding area of the candidate frame in the image into the DDPN network and extracting the output of a certain layer of the network as the characteristic p of the candidate framef,The features of all candidate frames in a picture are spliced to generate an overall feature i' f, whereinDue to the characteristics of the DDPN network, the steps can be completed in one-time forward calculation, and the practicability of the feature extraction algorithm is ensured.
Step (3) extracting text data characteristics
All texts in the image data set are segmented, a dictionary is built, the dictionary contains d words in total, and each input description text is converted into a dictionary sequence number list according to the dictionary, so that the texts are converted into a vector form.
Step (4), constructing target vectors and target values of regression frames
Target vector l is constructed for a picture and a given description textAnd (l) enabling each element to correspond to the candidate box in the step (2) one by one. And (3) calculating the overlapping degree (IOU) between each candidate frame and the real labeling frame according to the k candidate frames obtained in the step (2), and setting a target vector l according to the overlapping degree. And respectively calculating the regression target vector b of each candidate frame according to the difference value of the coordinate of each candidate frame and the coordinate value of the real labeled frame for the target value of the regression frame.
Step (5) constructing a deep neural network
The structure of the method is shown in fig. 2, for describing a text, firstly, a word vectorization (word vectorization) technology is used for converting a text vector obtained in the step (3) into a matrix qe. Problem matrix q after conversion into vectorseInputting into Long Short Term Memory (LSTM) network to select the vector q' output by the last unitCopying k parts of q' and splicing the k parts to form a text feature vector q, whereinProcessing the coordinates of each candidate frame in the image characteristics to generate a candidate frame position characteristic vector fspWhereinGenerating the position feature and image feature i 'of the corresponding candidate frame'fGenerating final image characteristics i by splicingfWhereinGeneration of text vectors q and image features i using stitching as cross-modal modelingfThe combined expression signature z. Mapping z to a hidden feature space by using a full-join function and an activation function to generate a feature z ', finally respectively inputting z' into two full-join functions to output two prediction values which are respectively scores s matched by k candidate boxes,and each candidate box regression value b.
Step (6), Loss Function (Loss Function)
And (4) respectively inputting the two prediction vectors output in the step (5) and the corresponding target vectors into corresponding loss functions, and respectively outputting two loss values (loss).
Step (7), training the model
And (5) training the model parameters of the neural network in the step (5) by using a back propagation algorithm according to the loss value (loss) generated by the loss function in the step (6) until the whole network model converges.
Step (8), calculating network predicted value
Sorting the candidate frames according to the s vectors output in the step (5), selecting the candidate frame with the highest score as a prediction frame, performing fine regression on the prediction frame according to the b vectors output in the step (5), and finally generating a prediction frame b of the networkp。
Training the DDPN network in the step (1), wherein a Visual Genome data set is preprocessed, and 1600 classes with the highest frequency of occurrence and 400 attribute values with the highest frequency are reserved.
The step (2) of extracting features of the image by using the trained DDPN network is as follows:
2-1. Each candidate box corresponds to the feature p of the image areaf,Splicing features of all candidate frames in one picture to generate overall feature i'f,The specific formula is as follows:
extracting text data features in the step (3), specifically as follows:
3-1, aiming at the problem text, firstly splitting the problem text into a word list q with fixed length (with the length being t)wThe concrete formula is as follows:
qω=(w1,w2,...,wi,...,ωt) (formula 2)
Wherein wiIs a string of word characters.
Listing words q from a word dictionarywConverts the words in (a) to index values, thereby converting the text to a fixed-length index vector qiThe concrete formula is as follows:
The target vector and the target value of the regression frame are constructed in the step (4), and the method specifically comprises the following steps:
4-1, calculating the overlapping degree (IOU) between each candidate frame and the real labeling frame, setting the elements in the l vector corresponding to the candidate frame with the overlapping degree larger than h as the corresponding IOU value, and finally normalizing the l vector to ensure that the sum of all the elements is 1. The calculation formula of the overlapping degree between the two frames is as follows:
where A ≧ B is the area of the intersection of frame A and frame B, and A ≦ B is the area of the union of frame A and frame B.
The formula for vector normalization is as follows:
where sum (l) is the sum over l, the output of which is a scalar.
4-3. calculating formula of regression target value of candidate box:
wgt=x2gt-x1gt(formula 6)
hgt=y2gt-y1gt(formula 7)
Wherein x1gt,y1gt,x2gt,y2gtThe coordinate values of the lower left corner and the upper right corner of the real labeling frame are respectively; omegagtWidth, h, of the real label boxgtRepresenting the height of the real marking frame;
w ═ x2-x1 (equation 8)
h as y2-y1 (equation 9)
Wherein x1, y1, x2 and y2 are coordinate values of the left lower corner and the right upper corner of the candidate box respectively; w represents the width of the candidate box, h represents the height of the candidate box;
xctrx1+0.5 xw (equation 12)
yctrY1+0.5 xh (equation 13)
Wherein xctr,yctr,The coordinate values of the centers of the candidate frame and the real mark frame are respectively.
b ═ dx, dy, dw, dh) (equation 18)
Where b is the final regression target vector for the candidate box.
Constructing a neural network in the step (5), specifically as follows:
5-1. the word vector (word embedding) specifically operates: indexing the text index vector q obtained in the step (3)iConverted into a Onehot vector qo,the so-called Onehot vector refers to the vector for qoEach vector in To middleEach element is 1 and the remaining elements are 0. Then the obtained q isoThe full-connected function clock which is input to an output v-dimensional vector has the following specific formula:
qe=qo·We(formula 19)
The obtained word vector matrix qeInputting the input into the LSTM to form a t × n dimensional output characteristic matrix, wherein the specific formula is as follows:
qlstm=LSTM(qe) (formula 20)
and copying k parts of q' and splicing to obtain a text vector q, wherein the specific formula is as follows:
q=(q′,q′,...,q′)T(formula 21)
5-2, calculating the position characteristic f of each candidate boxspThe concrete formula is as follows:
wherein x1,y1,x2,y2The coordinate values of the left lower corner and the right upper corner of the candidate box, omegaimg,himgRespectively the width and height of the input image.
5-3, corresponding image area feature i 'to candidate frame'fAnd the position characteristics f of the candidate framespStitching to generate final image features if,The formula is as follows:
if=(i′f,fsp) (formula 23)
5-4, combining the text vector q and the image characteristic ifThe joint expression characteristic z is generated by splicing, and the full connection and activation functions are mapped to the hidden characteristic space to generate the characteristic z', and the formula is as follows:
z=(q,if) (formula 24)
z ═ ReLU (FC (z)) (equation 25)
Where FC is the full connection function and ReLU is the activation function.
5-5, respectively inputting z' into two fully-connected layers and respectively outputting two prediction vectors s, b, whereinRespectively representing the matching degree of each candidate box and the regression value of each candidate box to the labeled box. The specific formula is as follows:
s ═ FC (z') (equation 26)
FC1 (z') (equation 27)
Where FC and FC1 represent two distinct fully connected layers.
The loss function in step (6) is specifically as follows:
6-1, calculating the difference (loss) between the matching score s of the candidate box and the true value, wherein the relative entropy (also called KL divergence) is used, and the specific formula is as follows:
wherein li,siThe i-th element in l, s respectively.
6-2. calculating the difference (Loss) between the regression value and the true value of the candidate box, wherein a smoothing L1 Loss function (Smooth L1 Loss) is used, and the specific formula is as follows:
wherein b isi,The i-th elements in b and b are respectively. L isbIs the loss value of the difference between the final regression box and the true annotation box.
Calculating the network predicted value in the step (8), specifically as follows:
8-1, sorting the candidate boxes according to the s vectors output in the step (5), and selecting the candidate box with the highest score as a prediction box.
8-2, setting the coordinate values of a prediction box and the corresponding regression values as (x1, y1, x2, y2) and (dx, dy, dw, dh), respectively, the prediction box b of the final network is obtainedpThe calculation formula is as follows:
wp=dwexw (formula 33)
hp=dheXh (formula 34)
bp=(x1p,y1p,x2p,y2p) (formula 39)
Wherein w, h are calculated by the formulas (7), (8). e is a natural constant.
The invention has the following beneficial effects:
the algorithm provided by the invention, especially the DDPN network-based algorithm for extracting the features of the image, achieves a significant improvement effect on an image visual positioning task, and greatly exceeds all mainstream methods on the task at present. In addition, the feature extraction algorithm of the invention also has very important application value and great potential in other cross-modal related fields such as Image Question Answering (IQA) and Image description (Image capture).
Drawings
FIG. 1 is a diagram of a fast R-CNN (image detection algorithm) network framework structure with added attribute value prediction according to the present invention;
FIG. 2 is a schematic diagram of a generalized and dispersive candidate generating network (DDPN).
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
Step (1), training diversity and Discriminative candidate box generation network (DDPN)
Using the fast-RCNN (an image detection algorithm) and adding the prediction of the object property values on the basis thereof, as shown in FIG. 1, it is trained on the Visual Genome data set until the network converges, and the resulting converged network is called DDPN network.
The step (2) of extracting features of the image by using the DDPN network specifically comprises the following steps:
2-1. here, the DDPN network is used to predict 100 candidate boxes in the input image.
2-2, inputting the image areas corresponding to 100 candidate boxes into the DDPN network, extracting output data of the Pool5 layer as the characteristics pf corresponding to the candidate boxes,and splicing the corresponding features of all candidate frames in one picture into i'f,
Extracting text data features in step (3)
3-1. for describing text data, we first participle the text and build a word dictionary describing the text. Only the first 15 words are taken for each description text, and if the question is less than 15 words, the text is supplemented with null characters. Each word is then replaced with the index value of the word in the word dictionary, so that each question is translated into a 15-dimensional word index vector.
Step (4) the target vector and the target value of the regression frame are constructed
4-1. constructing the target vector l according to the method described previously,where h is 0.5. And regression target values b of the candidate boxes,
constructing the deep neural network in the step (5), as shown in fig. 2, specifically as follows:
5-1, for the problem text feature, the text input here is the 15-dimensional index vector generated in step (3), and here, the word embedding technology is used to convert each word index into the corresponding word vector, and here, the word vector size we use is 300. Each question text thus becomes a matrix of size 15x300, after which we take this matrix as input to LSTM, a recurrent neural network structure, where its output is set to a 2048-dimensional vector, and take the output of the last element of LSTM as the text feature q',finally, k copies of q' are made and spliced to finally form a text characteristic q,
5-2, calculating the position characteristic f of each candidate boxspCalculating a position feature vector f according to the algorithmsp,
5-3, corresponding image area feature i 'to candidate frame'fAnd the position characteristics f of the candidate framespSplicing to obtain final characteristics i of input imagef,
5-4, combining the text vector q and the image characteristic ifThe splicing results in a joint expression profile Z,and sequentially input into a full join function and a ReLU function, which are output as 512-dimensional vectors, thereby mapping z to z',
5-5, inputting z' into the full-connected function with the output as a 1-dimensional vector to generate a candidate box matching score prediction vectorAt the same time, z' is input into the full-connected function with 4-dimensional vector output to generate the regression value vector of the candidate box
Data set | Flickr30k-Entities | Referit | Refcoco | Refcoco+ |
val | 72.78% | 63.77% | 76.61% | 64.34% |
test | 73.45% | 63.27% | 76.23% | 64.01% |
testA | 79.99% | 71.24% | ||
testB | 72.11% | 55.55% |
Table 1 the accuracy of the method described herein on each mainstream data set in the visual positioning task.
Where val, test, testA, testB are test sets in the data set. The open space indicates that the test set does not exist within the data set.
Claims (4)
1. A method for generating a network aiming at visual positioning based on diversity discrimination candidate boxes is characterized by comprising the following steps:
step (1), training the diversity discrimination candidate frame to generate network DDPN
Using the fast-RCNN, adding the prediction of the object attribute value on the basis of the fast-RCNN, training the fast-RCNN with the object attribute value prediction on a Visual Genome data set until the network converges, and obtaining a converged network called a DDPN network;
step (2) extracting features of the image by using the trained DDPN network
Calculating k candidate frames containing objects in the input image I by using a DDPN network; for each candidate frame, inputting the corresponding region of the candidate frame in the input image I into the DDPN network and extracting the output of a certain layer of the network as the characteristic of the candidate frameFeature splicing of all candidate frames in input image I 'to generate overall feature I'fWherein
Step (3) extracting text data characteristics
Dividing words of all texts in the image data set and constructing a dictionary, setting the dictionary to contain d words in total, converting each input description text into a dictionary sequence number list according to the dictionary, and converting the texts into a text vector form;
step (4), constructing target vectors and target values of regression frames
For each image and a given description text, an object vector l is constructed, whereinEach element in the target vector l corresponds to the candidate frame in the step (2) one by one; calculating the overlapping degree between each candidate frame and the real marking frame according to the k candidate frames obtained in the step (2), and setting a target vector l according to the overlapping degree; for the target value of the regression frame, respectively calculating to obtain a regression target vector b of each candidate frame according to the difference value of the coordinate value of each candidate frame and the coordinate value of the real labeled frame*;
Step (5) constructing a deep neural network
For the description text: firstly, converting the text vector obtained in the step (3) into a problem matrix q by using a word vectorization technologye(ii) a Problem matrix q after conversion into vectorseInputting the vector q' output by the last unit into the long-short term memory network, whereinCopying k parts of q' and splicing the k parts to form a text feature vector q, whereinProcessing the coordinates of each candidate frame in the image characteristics to generate a candidate frame position characteristic vector fspWhereinPosition feature vector f to be generatedspAnd image feature i 'of corresponding candidate frame'fGenerating final image characteristics i by splicingfWherein
Generation of text feature vectors q and image features i using stitching as cross-modal modelingfThe combined expression signature z; mapping z to a hidden feature space by using a full-join function and an activation function to generate a feature z ', finally respectively inputting z' into two full-join functions to output scores s matched with k candidate frames and regression values b of each candidate frame respectively of two prediction vectors,
step (6), loss function
Inputting the two prediction vectors output in the step (5) and the corresponding target vectors into corresponding loss functions respectively, and outputting two loss values respectively;
step (7), training the model
Training the model parameters of the neural network in the step (5) by using a back propagation algorithm according to the loss value generated by the loss function in the step (6) until the whole network model converges;
step (8), calculating network predicted value
Sorting the candidate frames according to the score s vector output in the step (5), selecting the candidate frame with the highest score as a prediction frame, performing fine regression on the prediction frame according to the regression value b vector output in the step (5), and finally generating a prediction frame b of the networkp;
The target vector and the target value of the regression frame are constructed in the step (4), and the method specifically comprises the following steps:
4-1, calculating the overlapping degree between each candidate frame and the real labeling frame, setting elements in the l vector corresponding to the candidate frame with the overlapping degree larger than the set threshold height as corresponding IOU values, and finally normalizing the l vector to ensure that the sum of all the elements is 1; the calculation formula of the overlapping degree between the two frames is as follows:
wherein A ≧ B is the area of the intersection of the candidate frame A and the candidate frame B, and A ≦ B is the area of the union of the candidate frame A and the candidate frame B;
the formula for vector normalization is as follows:
where sum (l) is the sum over l, the output of which is a scalar;
4-3. calculating formula of regression target value of candidate frame:
wgt=x2gt-x1gt(formula 3)
hgt=y2gt-y1gt(formula 4)
Wherein x1gt,y1gt,x2gt,y2gtThe coordinate values of the lower left corner and the upper right corner of the real labeling frame are respectively; w is agtWidth, h, of the real label boxgtRepresenting the height of the real marking frame;
w ═ x2-x1 (equation 5)
h as y2-y1 (equation 6)
Wherein x1, y1, x2 and y2 are coordinate values of the left lower corner and the right upper corner of the candidate frame respectively; w represents the width of the candidate box, and h represents the height of the candidate box;
xctrx1+0.5 xw (equation 9)
yctrY1+0.5 xh (equation 10)
Wherein xctr,yctr,Respectively the central coordinate values of the candidate frame and the real marking frame;
b*(dx, dy, dw, dh) (equation 15)
Wherein b is*Is the final regression target vector of the candidate frame;
constructing a deep neural network in the step (5), which comprises the following specific steps:
5-1. word vector specific operations: indexing the text index vector q obtained in the step (3)iConversion into Onehot vector qo,So-called Onehot vector refers to q foroEach vector in To middleEach element is 1, and the other elements are 0; then the obtained q isoThe input is a full-connected function of an output v-dimensional vector, and the specific formula is as follows:
qe=qo·We(formula 16)
The obtained word vector matrix qeInputting the input into the LSTM to form a t × n dimensional output characteristic matrix, wherein the specific formula is as follows:
qlstm=LSTM(qe) (formula 17)
and copying k parts of q' and splicing to obtain a text vector q, wherein the specific formula is as follows:
q=(q′,q′,...,q′)T (formula 18)
5-2, calculating the position characteristic f of each candidate framespThe concrete formula is as follows:
wherein x1,y1,x2,y2Coordinate values of the lower left and upper right corners of the candidate frame, wimg,himgWidth and height of the input image, respectively;
5-3, corresponding image region feature i 'to the candidate frame'fAnd candidate frame position feature fspStitching to generate final image features if,The formula is as follows:
if=(i′f,fsp) (formula 20)
5-4, combining the text vector q and the image characteristic ifThe joint expression characteristic z is generated by splicing, and the full connection and activation functions are mapped to the hidden characteristic space to generate the characteristic z', and the formula is as follows:
z=(q,if) (formula 21)
z ═ ReLU (FC (z)) (equation 22)
Where FC is the full connection function and ReLU is the activation function;
5-5, respectively inputting z' into two fully-connected layers and respectively outputting two prediction vectors s, b, whereinRespectively representing the matching degree of each candidate frame and the regression value of each candidate frame to the labeling frame; the specific formula is as follows:
s ═ FC (z') (equation 23)
FC1 (z') (equation 24)
Wherein FC and FC1 represent two distinct fully connected layers;
calculating the network predicted value in the step (8), specifically as follows:
8-1, sorting the candidate frames according to the s vectors output in the step (5), and selecting the candidate frame with the highest score as a prediction frame;
8-2, setting the coordinate values of a prediction box and the corresponding regression values as (x1 ', y 1', x2 ', y 2') and (dx ', dy', dw ', dh'), respectively, then the prediction box b of the final network is obtainedpThe calculation formula is as follows:
wp=dw′exw (formula 27)
hp=dh′eXh (formula 28)
bp=(x1p,y1p,x2p,y2p) (formula 33)
Wherein w, h are calculated by formulas (5), (6); e is a natural constant.
2. The method for visual localization according to claim 1, wherein the step (2) of extracting features from the image using the trained DDPN network comprises the following steps:
each candidate frame corresponds to a feature p of the image regionf,Splicing features of all candidate frames in one picture to generate overall feature i'f,The specific formula is as follows:
3. the method for visual localization based on the diversity-discrimination candidate box generation network according to claim 2, wherein the text data feature extraction in step (3) is as follows:
3-1, aiming at the problem text, firstly splitting the problem text into a word list q with fixed lengthwSetting the fixed length as t, the concrete formula is as follows:
qw=(w1,w2,...,wi,...,wt) (formula 35)
Wherein, wiIs a word string;
3-2, listing words q according to word dictionarywConverts the words in (a) to index values, thereby converting the text to a fixed-length index vector qiThe concrete formula is as follows:
4. The method for visual localization according to claim 3, wherein the loss function in step (6) is as follows:
6-1, calculating the difference between the matching score s of the candidate box and the true value by using the relative entropy, namely KL divergence, and adopting the following specific formula:
wherein li,siIs the ith element in l, s respectively;
6-2, calculating the difference between the regression value and the true value of the candidate box by using a smooth L1 loss function, wherein the specific formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811305577.8A CN109712108B (en) | 2018-11-05 | 2018-11-05 | Visual positioning method for generating network based on diversity discrimination candidate frame |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811305577.8A CN109712108B (en) | 2018-11-05 | 2018-11-05 | Visual positioning method for generating network based on diversity discrimination candidate frame |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109712108A CN109712108A (en) | 2019-05-03 |
CN109712108B true CN109712108B (en) | 2021-02-02 |
Family
ID=66254676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811305577.8A Active CN109712108B (en) | 2018-11-05 | 2018-11-05 | Visual positioning method for generating network based on diversity discrimination candidate frame |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109712108B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110263912B (en) * | 2019-05-14 | 2021-02-26 | 杭州电子科技大学 | Image question-answering method based on multi-target association depth reasoning |
CN110287814A (en) * | 2019-06-04 | 2019-09-27 | 北方工业大学 | Visual question-answering method based on image target characteristics and multilayer attention mechanism |
CN110234018B (en) * | 2019-07-09 | 2022-05-31 | 腾讯科技(深圳)有限公司 | Multimedia content description generation method, training method, device, equipment and medium |
CN112581723A (en) * | 2020-11-17 | 2021-03-30 | 芜湖美的厨卫电器制造有限公司 | Method and device for recognizing user gesture, processor and water heater |
CN112464016B (en) * | 2020-12-17 | 2022-04-01 | 杭州电子科技大学 | Scene graph generation method based on depth relation self-attention network |
CN113204666B (en) * | 2021-05-26 | 2022-04-05 | 杭州联汇科技股份有限公司 | Method for searching matched pictures based on characters |
CN113887585A (en) * | 2021-09-16 | 2022-01-04 | 南京信息工程大学 | Image-text multi-mode fusion method based on coding and decoding network |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965705B2 (en) * | 2015-11-03 | 2018-05-08 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering |
CN107239801B (en) * | 2017-06-28 | 2020-07-28 | 安徽大学 | Video attribute representation learning method and video character description automatic generation method |
CN107391609B (en) * | 2017-07-01 | 2020-07-31 | 南京理工大学 | Image description method of bidirectional multi-mode recursive network |
CN107480206B (en) * | 2017-07-25 | 2020-06-12 | 杭州电子科技大学 | Multi-mode low-rank bilinear pooling-based image content question-answering method |
CN107832765A (en) * | 2017-09-13 | 2018-03-23 | 百度在线网络技术(北京)有限公司 | Picture recognition to including word content and picture material |
-
2018
- 2018-11-05 CN CN201811305577.8A patent/CN109712108B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN109712108A (en) | 2019-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109712108B (en) | Visual positioning method for generating network based on diversity discrimination candidate frame | |
Han et al. | A survey on vision transformer | |
CN107480206B (en) | Multi-mode low-rank bilinear pooling-based image content question-answering method | |
Wang et al. | SaliencyGAN: Deep learning semisupervised salient object detection in the fog of IoT | |
CN110021051B (en) | Human image generation method based on generation of confrontation network through text guidance | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
Deng et al. | MVF-Net: A multi-view fusion network for event-based object classification | |
JP2023509031A (en) | Translation method, device, device and computer program based on multimodal machine learning | |
CN113378580B (en) | Document layout analysis method, model training method, device and equipment | |
CN111444889A (en) | Fine-grained action detection method of convolutional neural network based on multi-stage condition influence | |
CN110674741A (en) | Machine vision gesture recognition method based on dual-channel feature fusion | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
CN117033609B (en) | Text visual question-answering method, device, computer equipment and storage medium | |
CN108154156B (en) | Image set classification method and device based on neural topic model | |
CN111680550B (en) | Emotion information identification method and device, storage medium and computer equipment | |
CN112418235B (en) | Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement | |
CN113822340A (en) | Image-text emotion recognition method based on attention mechanism | |
CN112651940A (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN113792177A (en) | Scene character visual question-answering method based on knowledge-guided deep attention network | |
WO2023173552A1 (en) | Establishment method for target detection model, application method for target detection model, and device, apparatus and medium | |
CN111597816A (en) | Self-attention named entity recognition method, device, equipment and storage medium | |
CN115131801A (en) | Multi-modal-based document recognition method, device, equipment and storage medium | |
Ishmam et al. | From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities | |
Gao et al. | Generalized pyramid co-attention with learnable aggregation net for video question answering | |
CN114169408A (en) | Emotion classification method based on multi-mode attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |