CN109712108B - Visual positioning method for generating network based on diversity discrimination candidate frame - Google Patents

Visual positioning method for generating network based on diversity discrimination candidate frame Download PDF

Info

Publication number
CN109712108B
CN109712108B CN201811305577.8A CN201811305577A CN109712108B CN 109712108 B CN109712108 B CN 109712108B CN 201811305577 A CN201811305577 A CN 201811305577A CN 109712108 B CN109712108 B CN 109712108B
Authority
CN
China
Prior art keywords
vector
candidate
network
frame
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811305577.8A
Other languages
Chinese (zh)
Other versions
CN109712108A (en
Inventor
俞俊
余宙
项晨钞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201811305577.8A priority Critical patent/CN109712108B/en
Publication of CN109712108A publication Critical patent/CN109712108A/en
Application granted granted Critical
Publication of CN109712108B publication Critical patent/CN109712108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a vision positioning method for generating a network based on a diversity discriminative candidate frame. The invention comprises the following steps: 1. training the diversity discriminative candidate boxes to generate a network. 2. And (5) extracting features of the image by using the trained DDPN network. 3. And extracting text data features. 4. And constructing a target vector and a target value of the regression box. 5. And constructing a deep neural network. 6. A loss function is set. 7. And (5) training the model. 8. And calculating a network predicted value. The algorithm provided by the invention, especially the DDPN network-based algorithm for extracting the features of the image, achieves a significant improvement effect on an image visual positioning task, and greatly exceeds all mainstream methods on the task at present. The feature extraction algorithm of the invention also has very important application value and great potential in other cross-modal related fields such as image content question answering and image description.

Description

Visual positioning method for generating network based on diversity discrimination candidate frame
Technical Field
The present invention relates to an algorithm based on a deep neural network for an image Visual localization (Visual grouping) problem, and more particularly, to an image feature extraction method based on a diversity and Discriminative candidate box generation network (DDPN) and a deep neural network structure for an image Visual localization problem.
Background
Visual targeting is a subtask in the field of "cross-media", and "cross-media" is a cross direction between computer vision and the field of natural language processing research, aiming at getting through the "semantic gap" between different media (such as images and texts) and establishing a uniform semantic expression. Based on a theoretical method of Cross-media uniform expression, some current popular research directions are derived, such as natural description generation (Image capturing), Image-Text Cross-media Retrieval (Image-Text Cross-media Retrieval), Image Question Answering (Image Question Answering) of Image content, Image Visual positioning (Visual grouping), and the like. The generation of the natural description of the image aims to summarize the content of one or more sentences of natural language for one image; image-text cross-media retrieval aims at finding the best matching text description for an image from a database, or finding the best matching image for a text description; the automatic question answering of the image content aims at inputting a picture and a question described by a natural language and outputting an answer described by the natural language; the visual positioning of the image aims to give a picture and a natural language description text, and relevant areas are selected in the picture according to the description text.
With the rapid development of deep learning in recent years, the use of deep Neural Networks, such as a deep Convolutional Neural Network (CNN) and a deep cyclic Neural network (RNN), has achieved quite good results in solving the problems of natural description generation and automatic question-answering of image contents. But has been slow in the visual positioning problem with very limited success. Therefore, the use of neural networks to solve the visual localization problem is a research problem worthy of intensive research.
In the aspect of practical application, the image visual positioning algorithm has a very wide application scene. The text-based question-answering system has been widely applied to the operating systems of smart phones and PCs as an important way of man-machine interaction, such as Siri of apple, Cortana of microsoft, Alexa of amazon, and the like. With the rapid development of wearable smart hardware (such as Google glasses and microsoft HoloLens) and augmented reality technology, in the near future, an image content positioning system based on visual perception may become an important way for human-computer interaction.
In conclusion, the image visual positioning algorithm is a direction worthy of intensive research, and the patent is switched from several key difficult problems in the task to solve the problems existing in the current method.
Due to the fact that image content under a natural scene is complex, a main body is various; the description text of natural language has high degree of freedom, which makes the image visual positioning algorithm face huge challenge. Specifically, there are two main difficulties:
(1) extracting appropriate features for the image: the extraction of proper features from images is a basic task of a neural network in solving the cross-mode problem, and at present, mainstream algorithms for image description, image question answering, visual positioning and other cross-mode problems comprise preprocessing images in advance and extracting features, and a plurality of related works show that the extraction algorithm of the image features can generate great influence on the performance of the neural network.
(2) The method is characterized in that the cross-media data of the image problem is modeled uniformly, and how to perform effective feature fusion is as follows: the multi-modal feature fusion problem is a classic and fundamental problem in cross-media expression, and commonly used methods are feature splicing, feature summation, or feature fusion using a multi-layer neural network and the like. In addition, the feature fusion model based on the bilinear model has a good effect in many fields such as image fine-grained classification, natural language processing and recommendation systems, but the model training is greatly challenged due to high computational complexity. Therefore, selecting a proper strategy when fusing cross-media data features ensures the high efficiency of calculation, and simultaneously, improving the expression capability of the fused features is a direction worthy of intensive research.
Disclosure of Invention
The invention provides an algorithm for extracting features of an image aiming at a Visual grouping task, which comprises the following steps: a Diversified and dispersive candidate box generating network (DDPN) and a deep neural network algorithm for visual positioning, and makes a great breakthrough on the visual positioning problem.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step (1), training diversity and Discriminative candidate box generation network (DDPN)
Using the fast-RCNN (an image detection algorithm) and adding the prediction of the object property values on the basis thereof, as shown in FIG. 1, it is trained on the Visual Genome data set until the network converges, and the resulting converged network is called DDPN network.
Step (2) extracting features of the image by using the trained DDPN network
For an input image, k candidate boxes containing objects in the image are calculated by using the DDPN network trained in the previous step. For each candidate frame, inputting the corresponding area of the candidate frame in the image into the DDPN network and extracting the output of a certain layer of the network as the characteristic p of the candidate framef
Figure BDA0001853510130000031
The features of all candidate frames in a picture are spliced to generate an overall feature i' f, wherein
Figure BDA0001853510130000032
Due to the characteristics of the DDPN network, the steps can be completed in one-time forward calculation, and the practicability of the feature extraction algorithm is ensured.
Step (3) extracting text data characteristics
All texts in the image data set are segmented, a dictionary is built, the dictionary contains d words in total, and each input description text is converted into a dictionary sequence number list according to the dictionary, so that the texts are converted into a vector form.
Step (4), constructing target vectors and target values of regression frames
Target vector l is constructed for a picture and a given description text
Figure BDA0001853510130000033
And (l) enabling each element to correspond to the candidate box in the step (2) one by one. And (3) calculating the overlapping degree (IOU) between each candidate frame and the real labeling frame according to the k candidate frames obtained in the step (2), and setting a target vector l according to the overlapping degree. And respectively calculating the regression target vector b of each candidate frame according to the difference value of the coordinate of each candidate frame and the coordinate value of the real labeled frame for the target value of the regression frame.
Step (5) constructing a deep neural network
The structure of the method is shown in fig. 2, for describing a text, firstly, a word vectorization (word vectorization) technology is used for converting a text vector obtained in the step (3) into a matrix qe. Problem matrix q after conversion into vectorseInputting into Long Short Term Memory (LSTM) network to select the vector q' output by the last unit
Figure BDA0001853510130000041
Copying k parts of q' and splicing the k parts to form a text feature vector q, wherein
Figure BDA0001853510130000042
Processing the coordinates of each candidate frame in the image characteristics to generate a candidate frame position characteristic vector fspWherein
Figure BDA0001853510130000043
Generating the position feature and image feature i 'of the corresponding candidate frame'fGenerating final image characteristics i by splicingfWherein
Figure BDA0001853510130000044
Generation of text vectors q and image features i using stitching as cross-modal modelingfThe combined expression signature z. Mapping z to a hidden feature space by using a full-join function and an activation function to generate a feature z ', finally respectively inputting z' into two full-join functions to output two prediction values which are respectively scores s matched by k candidate boxes,
Figure BDA0001853510130000045
and each candidate box regression value b.
Figure BDA0001853510130000046
Step (6), Loss Function (Loss Function)
And (4) respectively inputting the two prediction vectors output in the step (5) and the corresponding target vectors into corresponding loss functions, and respectively outputting two loss values (loss).
Step (7), training the model
And (5) training the model parameters of the neural network in the step (5) by using a back propagation algorithm according to the loss value (loss) generated by the loss function in the step (6) until the whole network model converges.
Step (8), calculating network predicted value
Sorting the candidate frames according to the s vectors output in the step (5), selecting the candidate frame with the highest score as a prediction frame, performing fine regression on the prediction frame according to the b vectors output in the step (5), and finally generating a prediction frame b of the networkp
Training the DDPN network in the step (1), wherein a Visual Genome data set is preprocessed, and 1600 classes with the highest frequency of occurrence and 400 attribute values with the highest frequency are reserved.
The step (2) of extracting features of the image by using the trained DDPN network is as follows:
2-1. Each candidate box corresponds to the feature p of the image areaf
Figure BDA0001853510130000047
Splicing features of all candidate frames in one picture to generate overall feature i'f
Figure BDA0001853510130000048
The specific formula is as follows:
Figure BDA0001853510130000049
extracting text data features in the step (3), specifically as follows:
3-1, aiming at the problem text, firstly splitting the problem text into a word list q with fixed length (with the length being t)wThe concrete formula is as follows:
qω=(w1,w2,...,wi,...,ωt) (formula 2)
Wherein wiIs a string of word characters.
Listing words q from a word dictionarywConverts the words in (a) to index values, thereby converting the text to a fixed-length index vector qiThe concrete formula is as follows:
Figure BDA0001853510130000051
wherein
Figure BDA0001853510130000052
Is wkIn word dictionariesThe index value of (c).
The target vector and the target value of the regression frame are constructed in the step (4), and the method specifically comprises the following steps:
4-1, calculating the overlapping degree (IOU) between each candidate frame and the real labeling frame, setting the elements in the l vector corresponding to the candidate frame with the overlapping degree larger than h as the corresponding IOU value, and finally normalizing the l vector to ensure that the sum of all the elements is 1. The calculation formula of the overlapping degree between the two frames is as follows:
Figure BDA0001853510130000053
where A ≧ B is the area of the intersection of frame A and frame B, and A ≦ B is the area of the union of frame A and frame B.
The formula for vector normalization is as follows:
Figure BDA0001853510130000054
where sum (l) is the sum over l, the output of which is a scalar.
4-3. calculating formula of regression target value of candidate box:
wgt=x2gt-x1gt(formula 6)
hgt=y2gt-y1gt(formula 7)
Wherein x1gt,y1gt,x2gt,y2gtThe coordinate values of the lower left corner and the upper right corner of the real labeling frame are respectively; omegagtWidth, h, of the real label boxgtRepresenting the height of the real marking frame;
w ═ x2-x1 (equation 8)
h as y2-y1 (equation 9)
Wherein x1, y1, x2 and y2 are coordinate values of the left lower corner and the right upper corner of the candidate box respectively; w represents the width of the candidate box, h represents the height of the candidate box;
Figure BDA0001853510130000061
Figure BDA0001853510130000062
xctrx1+0.5 xw (equation 12)
yctrY1+0.5 xh (equation 13)
Wherein xctr,yctr
Figure BDA0001853510130000063
The coordinate values of the centers of the candidate frame and the real mark frame are respectively.
Figure BDA0001853510130000064
Figure BDA0001853510130000065
Figure BDA0001853510130000066
Figure BDA0001853510130000067
b ═ dx, dy, dw, dh) (equation 18)
Where b is the final regression target vector for the candidate box.
Constructing a neural network in the step (5), specifically as follows:
5-1. the word vector (word embedding) specifically operates: indexing the text index vector q obtained in the step (3)iConverted into a Onehot vector qo,
Figure BDA0001853510130000068
the so-called Onehot vector refers to the vector for qoEach vector in
Figure BDA0001853510130000069
Figure BDA00018535101300000610
To middle
Figure BDA00018535101300000611
Each element is 1 and the remaining elements are 0. Then the obtained q isoThe full-connected function clock which is input to an output v-dimensional vector has the following specific formula:
qe=qo·We(formula 19)
Wherein WeIs a parameter that needs to be learned,
Figure BDA00018535101300000612
output of
Figure BDA00018535101300000613
The obtained word vector matrix qeInputting the input into the LSTM to form a t × n dimensional output characteristic matrix, wherein the specific formula is as follows:
qlstm=LSTM(qe) (formula 20)
Wherein
Figure BDA00018535101300000614
Taking the output of the last unit of the LSTM as the text feature q',
Figure BDA00018535101300000615
and copying k parts of q' and splicing to obtain a text vector q, wherein the specific formula is as follows:
q=(q′,q′,...,q′)T(formula 21)
5-2, calculating the position characteristic f of each candidate boxspThe concrete formula is as follows:
Figure BDA0001853510130000071
wherein x1,y1,x2,y2The coordinate values of the left lower corner and the right upper corner of the candidate box, omegaimg,himgRespectively the width and height of the input image.
5-3, corresponding image area feature i 'to candidate frame'fAnd the position characteristics f of the candidate framespStitching to generate final image features if
Figure BDA0001853510130000072
The formula is as follows:
if=(i′f,fsp) (formula 23)
5-4, combining the text vector q and the image characteristic ifThe joint expression characteristic z is generated by splicing, and the full connection and activation functions are mapped to the hidden characteristic space to generate the characteristic z', and the formula is as follows:
z=(q,if) (formula 24)
z ═ ReLU (FC (z)) (equation 25)
Where FC is the full connection function and ReLU is the activation function.
5-5, respectively inputting z' into two fully-connected layers and respectively outputting two prediction vectors s, b, wherein
Figure BDA0001853510130000073
Respectively representing the matching degree of each candidate box and the regression value of each candidate box to the labeled box. The specific formula is as follows:
s ═ FC (z') (equation 26)
FC1 (z') (equation 27)
Where FC and FC1 represent two distinct fully connected layers.
The loss function in step (6) is specifically as follows:
6-1, calculating the difference (loss) between the matching score s of the candidate box and the true value, wherein the relative entropy (also called KL divergence) is used, and the specific formula is as follows:
Figure BDA0001853510130000074
wherein li,siThe i-th element in l, s respectively.
6-2. calculating the difference (Loss) between the regression value and the true value of the candidate box, wherein a smoothing L1 Loss function (Smooth L1 Loss) is used, and the specific formula is as follows:
Figure BDA0001853510130000081
Figure BDA0001853510130000089
wherein b isi
Figure BDA0001853510130000082
The i-th elements in b and b are respectively. L isbIs the loss value of the difference between the final regression box and the true annotation box.
Calculating the network predicted value in the step (8), specifically as follows:
8-1, sorting the candidate boxes according to the s vectors output in the step (5), and selecting the candidate box with the highest score as a prediction box.
8-2, setting the coordinate values of a prediction box and the corresponding regression values as (x1, y1, x2, y2) and (dx, dy, dw, dh), respectively, the prediction box b of the final network is obtainedpThe calculation formula is as follows:
Figure BDA0001853510130000083
Figure BDA0001853510130000084
wp=dwexw (formula 33)
hp=dheXh (formula 34)
Figure BDA0001853510130000085
Figure BDA0001853510130000086
Figure BDA0001853510130000087
Figure BDA0001853510130000088
bp=(x1p,y1p,x2p,y2p) (formula 39)
Wherein w, h are calculated by the formulas (7), (8). e is a natural constant.
The invention has the following beneficial effects:
the algorithm provided by the invention, especially the DDPN network-based algorithm for extracting the features of the image, achieves a significant improvement effect on an image visual positioning task, and greatly exceeds all mainstream methods on the task at present. In addition, the feature extraction algorithm of the invention also has very important application value and great potential in other cross-modal related fields such as Image Question Answering (IQA) and Image description (Image capture).
Drawings
FIG. 1 is a diagram of a fast R-CNN (image detection algorithm) network framework structure with added attribute value prediction according to the present invention;
FIG. 2 is a schematic diagram of a generalized and dispersive candidate generating network (DDPN).
Detailed Description
The following is a more detailed description of the detailed parameters of the present invention.
Step (1), training diversity and Discriminative candidate box generation network (DDPN)
Using the fast-RCNN (an image detection algorithm) and adding the prediction of the object property values on the basis thereof, as shown in FIG. 1, it is trained on the Visual Genome data set until the network converges, and the resulting converged network is called DDPN network.
The step (2) of extracting features of the image by using the DDPN network specifically comprises the following steps:
2-1. here, the DDPN network is used to predict 100 candidate boxes in the input image.
2-2, inputting the image areas corresponding to 100 candidate boxes into the DDPN network, extracting output data of the Pool5 layer as the characteristics pf corresponding to the candidate boxes,
Figure BDA0001853510130000091
and splicing the corresponding features of all candidate frames in one picture into i'f
Figure BDA0001853510130000092
Extracting text data features in step (3)
3-1. for describing text data, we first participle the text and build a word dictionary describing the text. Only the first 15 words are taken for each description text, and if the question is less than 15 words, the text is supplemented with null characters. Each word is then replaced with the index value of the word in the word dictionary, so that each question is translated into a 15-dimensional word index vector.
Step (4) the target vector and the target value of the regression frame are constructed
4-1. constructing the target vector l according to the method described previously,
Figure BDA0001853510130000093
where h is 0.5. And regression target values b of the candidate boxes,
Figure BDA0001853510130000094
constructing the deep neural network in the step (5), as shown in fig. 2, specifically as follows:
5-1, for the problem text feature, the text input here is the 15-dimensional index vector generated in step (3), and here, the word embedding technology is used to convert each word index into the corresponding word vector, and here, the word vector size we use is 300. Each question text thus becomes a matrix of size 15x300, after which we take this matrix as input to LSTM, a recurrent neural network structure, where its output is set to a 2048-dimensional vector, and take the output of the last element of LSTM as the text feature q',
Figure BDA0001853510130000101
finally, k copies of q' are made and spliced to finally form a text characteristic q,
Figure BDA0001853510130000102
5-2, calculating the position characteristic f of each candidate boxspCalculating a position feature vector f according to the algorithmsp
Figure BDA0001853510130000103
5-3, corresponding image area feature i 'to candidate frame'fAnd the position characteristics f of the candidate framespSplicing to obtain final characteristics i of input imagef
Figure BDA0001853510130000104
5-4, combining the text vector q and the image characteristic ifThe splicing results in a joint expression profile Z,
Figure BDA0001853510130000105
and sequentially input into a full join function and a ReLU function, which are output as 512-dimensional vectors, thereby mapping z to z',
Figure BDA0001853510130000106
5-5, inputting z' into the full-connected function with the output as a 1-dimensional vector to generate a candidate box matching score prediction vector
Figure BDA0001853510130000107
At the same time, z' is input into the full-connected function with 4-dimensional vector output to generate the regression value vector of the candidate box
Figure BDA0001853510130000108
Data set Flickr30k-Entities Referit Refcoco Refcoco+
val 72.78% 63.77% 76.61% 64.34%
test 73.45% 63.27% 76.23% 64.01%
testA 79.99% 71.24%
testB 72.11% 55.55%
Table 1 the accuracy of the method described herein on each mainstream data set in the visual positioning task.
Where val, test, testA, testB are test sets in the data set. The open space indicates that the test set does not exist within the data set.

Claims (4)

1. A method for generating a network aiming at visual positioning based on diversity discrimination candidate boxes is characterized by comprising the following steps:
step (1), training the diversity discrimination candidate frame to generate network DDPN
Using the fast-RCNN, adding the prediction of the object attribute value on the basis of the fast-RCNN, training the fast-RCNN with the object attribute value prediction on a Visual Genome data set until the network converges, and obtaining a converged network called a DDPN network;
step (2) extracting features of the image by using the trained DDPN network
Calculating k candidate frames containing objects in the input image I by using a DDPN network; for each candidate frame, inputting the corresponding region of the candidate frame in the input image I into the DDPN network and extracting the output of a certain layer of the network as the characteristic of the candidate frame
Figure FDA0002811413180000011
Feature splicing of all candidate frames in input image I 'to generate overall feature I'fWherein
Figure FDA0002811413180000012
Step (3) extracting text data characteristics
Dividing words of all texts in the image data set and constructing a dictionary, setting the dictionary to contain d words in total, converting each input description text into a dictionary sequence number list according to the dictionary, and converting the texts into a text vector form;
step (4), constructing target vectors and target values of regression frames
For each image and a given description text, an object vector l is constructed, wherein
Figure FDA0002811413180000013
Each element in the target vector l corresponds to the candidate frame in the step (2) one by one; calculating the overlapping degree between each candidate frame and the real marking frame according to the k candidate frames obtained in the step (2), and setting a target vector l according to the overlapping degree; for the target value of the regression frame, respectively calculating to obtain a regression target vector b of each candidate frame according to the difference value of the coordinate value of each candidate frame and the coordinate value of the real labeled frame*
Step (5) constructing a deep neural network
For the description text: firstly, converting the text vector obtained in the step (3) into a problem matrix q by using a word vectorization technologye(ii) a Problem matrix q after conversion into vectorseInputting the vector q' output by the last unit into the long-short term memory network, wherein
Figure FDA0002811413180000014
Copying k parts of q' and splicing the k parts to form a text feature vector q, wherein
Figure FDA0002811413180000021
Processing the coordinates of each candidate frame in the image characteristics to generate a candidate frame position characteristic vector fspWherein
Figure FDA0002811413180000022
Position feature vector f to be generatedspAnd image feature i 'of corresponding candidate frame'fGenerating final image characteristics i by splicingfWherein
Figure FDA0002811413180000023
Generation of text feature vectors q and image features i using stitching as cross-modal modelingfThe combined expression signature z; mapping z to a hidden feature space by using a full-join function and an activation function to generate a feature z ', finally respectively inputting z' into two full-join functions to output scores s matched with k candidate frames and regression values b of each candidate frame respectively of two prediction vectors,
Figure FDA0002811413180000024
step (6), loss function
Inputting the two prediction vectors output in the step (5) and the corresponding target vectors into corresponding loss functions respectively, and outputting two loss values respectively;
step (7), training the model
Training the model parameters of the neural network in the step (5) by using a back propagation algorithm according to the loss value generated by the loss function in the step (6) until the whole network model converges;
step (8), calculating network predicted value
Sorting the candidate frames according to the score s vector output in the step (5), selecting the candidate frame with the highest score as a prediction frame, performing fine regression on the prediction frame according to the regression value b vector output in the step (5), and finally generating a prediction frame b of the networkp
The target vector and the target value of the regression frame are constructed in the step (4), and the method specifically comprises the following steps:
4-1, calculating the overlapping degree between each candidate frame and the real labeling frame, setting elements in the l vector corresponding to the candidate frame with the overlapping degree larger than the set threshold height as corresponding IOU values, and finally normalizing the l vector to ensure that the sum of all the elements is 1; the calculation formula of the overlapping degree between the two frames is as follows:
Figure FDA0002811413180000025
wherein A ≧ B is the area of the intersection of the candidate frame A and the candidate frame B, and A ≦ B is the area of the union of the candidate frame A and the candidate frame B;
the formula for vector normalization is as follows:
Figure FDA0002811413180000031
where sum (l) is the sum over l, the output of which is a scalar;
4-3. calculating formula of regression target value of candidate frame:
wgt=x2gt-x1gt(formula 3)
hgt=y2gt-y1gt(formula 4)
Wherein x1gt,y1gt,x2gt,y2gtThe coordinate values of the lower left corner and the upper right corner of the real labeling frame are respectively; w is agtWidth, h, of the real label boxgtRepresenting the height of the real marking frame;
w ═ x2-x1 (equation 5)
h as y2-y1 (equation 6)
Wherein x1, y1, x2 and y2 are coordinate values of the left lower corner and the right upper corner of the candidate frame respectively; w represents the width of the candidate box, and h represents the height of the candidate box;
Figure FDA0002811413180000032
Figure FDA0002811413180000033
xctrx1+0.5 xw (equation 9)
yctrY1+0.5 xh (equation 10)
Wherein xctr,yctr
Figure FDA0002811413180000034
Respectively the central coordinate values of the candidate frame and the real marking frame;
Figure FDA0002811413180000035
Figure FDA0002811413180000036
Figure FDA0002811413180000037
Figure FDA0002811413180000038
b*(dx, dy, dw, dh) (equation 15)
Wherein b is*Is the final regression target vector of the candidate frame;
constructing a deep neural network in the step (5), which comprises the following specific steps:
5-1. word vector specific operations: indexing the text index vector q obtained in the step (3)iConversion into Onehot vector qo
Figure FDA0002811413180000039
So-called Onehot vector refers to q foroEach vector in
Figure FDA0002811413180000041
Figure FDA0002811413180000042
Figure FDA0002811413180000043
To middle
Figure FDA0002811413180000044
Each element is 1, and the other elements are 0; then the obtained q isoThe input is a full-connected function of an output v-dimensional vector, and the specific formula is as follows:
qe=qo·We(formula 16)
Wherein WeIs a parameter that needs to be learned,
Figure FDA0002811413180000045
output of
Figure FDA0002811413180000046
The obtained word vector matrix qeInputting the input into the LSTM to form a t × n dimensional output characteristic matrix, wherein the specific formula is as follows:
qlstm=LSTM(qe) (formula 17)
Wherein
Figure FDA0002811413180000047
Taking the output of the last unit of the LSTM as the text feature q',
Figure FDA0002811413180000048
and copying k parts of q' and splicing to obtain a text vector q, wherein the specific formula is as follows:
q=(q′,q′,...,q′)T (formula 18)
5-2, calculating the position characteristic f of each candidate framespThe concrete formula is as follows:
Figure FDA0002811413180000049
wherein x1,y1,x2,y2Coordinate values of the lower left and upper right corners of the candidate frame, wimg,himgWidth and height of the input image, respectively;
5-3, corresponding image region feature i 'to the candidate frame'fAnd candidate frame position feature fspStitching to generate final image features if
Figure FDA00028114131800000410
The formula is as follows:
if=(i′f,fsp) (formula 20)
5-4, combining the text vector q and the image characteristic ifThe joint expression characteristic z is generated by splicing, and the full connection and activation functions are mapped to the hidden characteristic space to generate the characteristic z', and the formula is as follows:
z=(q,if) (formula 21)
z ═ ReLU (FC (z)) (equation 22)
Where FC is the full connection function and ReLU is the activation function;
5-5, respectively inputting z' into two fully-connected layers and respectively outputting two prediction vectors s, b, wherein
Figure FDA0002811413180000051
Respectively representing the matching degree of each candidate frame and the regression value of each candidate frame to the labeling frame; the specific formula is as follows:
s ═ FC (z') (equation 23)
FC1 (z') (equation 24)
Wherein FC and FC1 represent two distinct fully connected layers;
calculating the network predicted value in the step (8), specifically as follows:
8-1, sorting the candidate frames according to the s vectors output in the step (5), and selecting the candidate frame with the highest score as a prediction frame;
8-2, setting the coordinate values of a prediction box and the corresponding regression values as (x1 ', y 1', x2 ', y 2') and (dx ', dy', dw ', dh'), respectively, then the prediction box b of the final network is obtainedpThe calculation formula is as follows:
Figure FDA0002811413180000052
Figure FDA0002811413180000053
wp=dw′exw (formula 27)
hp=dh′eXh (formula 28)
Figure FDA0002811413180000054
Figure FDA0002811413180000055
Figure FDA0002811413180000056
Figure FDA0002811413180000057
bp=(x1p,y1p,x2p,y2p) (formula 33)
Wherein w, h are calculated by formulas (5), (6); e is a natural constant.
2. The method for visual localization according to claim 1, wherein the step (2) of extracting features from the image using the trained DDPN network comprises the following steps:
each candidate frame corresponds to a feature p of the image regionf
Figure FDA0002811413180000061
Splicing features of all candidate frames in one picture to generate overall feature i'f
Figure FDA0002811413180000062
The specific formula is as follows:
Figure FDA0002811413180000063
3. the method for visual localization based on the diversity-discrimination candidate box generation network according to claim 2, wherein the text data feature extraction in step (3) is as follows:
3-1, aiming at the problem text, firstly splitting the problem text into a word list q with fixed lengthwSetting the fixed length as t, the concrete formula is as follows:
qw=(w1,w2,...,wi,...,wt) (formula 35)
Wherein, wiIs a word string;
3-2, listing words q according to word dictionarywConverts the words in (a) to index values, thereby converting the text to a fixed-length index vector qiThe concrete formula is as follows:
Figure FDA0002811413180000064
wherein
Figure FDA0002811413180000065
Is wkIndex values in the word dictionary.
4. The method for visual localization according to claim 3, wherein the loss function in step (6) is as follows:
6-1, calculating the difference between the matching score s of the candidate box and the true value by using the relative entropy, namely KL divergence, and adopting the following specific formula:
Figure FDA0002811413180000066
wherein li,siIs the ith element in l, s respectively;
6-2, calculating the difference between the regression value and the true value of the candidate box by using a smooth L1 loss function, wherein the specific formula is as follows:
Figure FDA0002811413180000067
Figure FDA0002811413180000068
wherein b isi
Figure FDA0002811413180000069
Are respectively b, b*The ith element; l isbIs the loss value of the difference between the final regression box and the true annotation box.
CN201811305577.8A 2018-11-05 2018-11-05 Visual positioning method for generating network based on diversity discrimination candidate frame Active CN109712108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811305577.8A CN109712108B (en) 2018-11-05 2018-11-05 Visual positioning method for generating network based on diversity discrimination candidate frame

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811305577.8A CN109712108B (en) 2018-11-05 2018-11-05 Visual positioning method for generating network based on diversity discrimination candidate frame

Publications (2)

Publication Number Publication Date
CN109712108A CN109712108A (en) 2019-05-03
CN109712108B true CN109712108B (en) 2021-02-02

Family

ID=66254676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811305577.8A Active CN109712108B (en) 2018-11-05 2018-11-05 Visual positioning method for generating network based on diversity discrimination candidate frame

Country Status (1)

Country Link
CN (1) CN109712108B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110263912B (en) * 2019-05-14 2021-02-26 杭州电子科技大学 Image question-answering method based on multi-target association depth reasoning
CN110287814A (en) * 2019-06-04 2019-09-27 北方工业大学 Visual question-answering method based on image target characteristics and multilayer attention mechanism
CN110234018B (en) * 2019-07-09 2022-05-31 腾讯科技(深圳)有限公司 Multimedia content description generation method, training method, device, equipment and medium
CN112581723A (en) * 2020-11-17 2021-03-30 芜湖美的厨卫电器制造有限公司 Method and device for recognizing user gesture, processor and water heater
CN112464016B (en) * 2020-12-17 2022-04-01 杭州电子科技大学 Scene graph generation method based on depth relation self-attention network
CN113204666B (en) * 2021-05-26 2022-04-05 杭州联汇科技股份有限公司 Method for searching matched pictures based on characters
CN113887585A (en) * 2021-09-16 2022-01-04 南京信息工程大学 Image-text multi-mode fusion method based on coding and decoding network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
CN107239801B (en) * 2017-06-28 2020-07-28 安徽大学 Video attribute representation learning method and video character description automatic generation method
CN107391609B (en) * 2017-07-01 2020-07-31 南京理工大学 Image description method of bidirectional multi-mode recursive network
CN107480206B (en) * 2017-07-25 2020-06-12 杭州电子科技大学 Multi-mode low-rank bilinear pooling-based image content question-answering method
CN107832765A (en) * 2017-09-13 2018-03-23 百度在线网络技术(北京)有限公司 Picture recognition to including word content and picture material

Also Published As

Publication number Publication date
CN109712108A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
CN109712108B (en) Visual positioning method for generating network based on diversity discrimination candidate frame
Han et al. A survey on vision transformer
CN107480206B (en) Multi-mode low-rank bilinear pooling-based image content question-answering method
Wang et al. SaliencyGAN: Deep learning semisupervised salient object detection in the fog of IoT
CN110021051B (en) Human image generation method based on generation of confrontation network through text guidance
CN109783666B (en) Image scene graph generation method based on iterative refinement
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
JP2023509031A (en) Translation method, device, device and computer program based on multimodal machine learning
CN113378580B (en) Document layout analysis method, model training method, device and equipment
CN111444889A (en) Fine-grained action detection method of convolutional neural network based on multi-stage condition influence
CN110674741A (en) Machine vision gesture recognition method based on dual-channel feature fusion
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN108154156B (en) Image set classification method and device based on neural topic model
CN111680550B (en) Emotion information identification method and device, storage medium and computer equipment
CN112418235B (en) Point cloud semantic segmentation method based on expansion nearest neighbor feature enhancement
CN113822340A (en) Image-text emotion recognition method based on attention mechanism
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113792177A (en) Scene character visual question-answering method based on knowledge-guided deep attention network
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN111597816A (en) Self-attention named entity recognition method, device, equipment and storage medium
CN115131801A (en) Multi-modal-based document recognition method, device, equipment and storage medium
Ishmam et al. From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
Gao et al. Generalized pyramid co-attention with learnable aggregation net for video question answering
CN114169408A (en) Emotion classification method based on multi-mode attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant