CN109712108B

CN109712108B - Visual positioning method for generating network based on diversity discrimination candidate frame

Info

Publication number: CN109712108B
Application number: CN201811305577.8A
Authority: CN
Inventors: 俞俊; 余宙; 项晨钞
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2021-02-02
Anticipated expiration: 2038-11-05
Also published as: CN109712108A

Abstract

The invention discloses a vision positioning method for generating a network based on a diversity discriminative candidate frame. The invention comprises the following steps: 1. training the diversity discriminative candidate boxes to generate a network. 2. And (5) extracting features of the image by using the trained DDPN network. 3. And extracting text data features. 4. And constructing a target vector and a target value of the regression box. 5. And constructing a deep neural network. 6. A loss function is set. 7. And (5) training the model. 8. And calculating a network predicted value. The algorithm provided by the invention, especially the DDPN network-based algorithm for extracting the features of the image, achieves a significant improvement effect on an image visual positioning task, and greatly exceeds all mainstream methods on the task at present. The feature extraction algorithm of the invention also has very important application value and great potential in other cross-modal related fields such as image content question answering and image description.

Description

Visual positioning method for generating network based on diversity discrimination candidate frame

Technical Field

The present invention relates to an algorithm based on a deep neural network for an image Visual localization (Visual grouping) problem, and more particularly, to an image feature extraction method based on a diversity and Discriminative candidate box generation network (DDPN) and a deep neural network structure for an image Visual localization problem.

Background

Visual targeting is a subtask in the field of "cross-media", and "cross-media" is a cross direction between computer vision and the field of natural language processing research, aiming at getting through the "semantic gap" between different media (such as images and texts) and establishing a uniform semantic expression. Based on a theoretical method of Cross-media uniform expression, some current popular research directions are derived, such as natural description generation (Image capturing), Image-Text Cross-media Retrieval (Image-Text Cross-media Retrieval), Image Question Answering (Image Question Answering) of Image content, Image Visual positioning (Visual grouping), and the like. The generation of the natural description of the image aims to summarize the content of one or more sentences of natural language for one image; image-text cross-media retrieval aims at finding the best matching text description for an image from a database, or finding the best matching image for a text description; the automatic question answering of the image content aims at inputting a picture and a question described by a natural language and outputting an answer described by the natural language; the visual positioning of the image aims to give a picture and a natural language description text, and relevant areas are selected in the picture according to the description text.

With the rapid development of deep learning in recent years, the use of deep Neural Networks, such as a deep Convolutional Neural Network (CNN) and a deep cyclic Neural network (RNN), has achieved quite good results in solving the problems of natural description generation and automatic question-answering of image contents. But has been slow in the visual positioning problem with very limited success. Therefore, the use of neural networks to solve the visual localization problem is a research problem worthy of intensive research.

In the aspect of practical application, the image visual positioning algorithm has a very wide application scene. The text-based question-answering system has been widely applied to the operating systems of smart phones and PCs as an important way of man-machine interaction, such as Siri of apple, Cortana of microsoft, Alexa of amazon, and the like. With the rapid development of wearable smart hardware (such as Google glasses and microsoft HoloLens) and augmented reality technology, in the near future, an image content positioning system based on visual perception may become an important way for human-computer interaction.

In conclusion, the image visual positioning algorithm is a direction worthy of intensive research, and the patent is switched from several key difficult problems in the task to solve the problems existing in the current method.

Due to the fact that image content under a natural scene is complex, a main body is various; the description text of natural language has high degree of freedom, which makes the image visual positioning algorithm face huge challenge. Specifically, there are two main difficulties:

(1) extracting appropriate features for the image: the extraction of proper features from images is a basic task of a neural network in solving the cross-mode problem, and at present, mainstream algorithms for image description, image question answering, visual positioning and other cross-mode problems comprise preprocessing images in advance and extracting features, and a plurality of related works show that the extraction algorithm of the image features can generate great influence on the performance of the neural network.

(2) The method is characterized in that the cross-media data of the image problem is modeled uniformly, and how to perform effective feature fusion is as follows: the multi-modal feature fusion problem is a classic and fundamental problem in cross-media expression, and commonly used methods are feature splicing, feature summation, or feature fusion using a multi-layer neural network and the like. In addition, the feature fusion model based on the bilinear model has a good effect in many fields such as image fine-grained classification, natural language processing and recommendation systems, but the model training is greatly challenged due to high computational complexity. Therefore, selecting a proper strategy when fusing cross-media data features ensures the high efficiency of calculation, and simultaneously, improving the expression capability of the fused features is a direction worthy of intensive research.

Disclosure of Invention

The invention provides an algorithm for extracting features of an image aiming at a Visual grouping task, which comprises the following steps: a Diversified and dispersive candidate box generating network (DDPN) and a deep neural network algorithm for visual positioning, and makes a great breakthrough on the visual positioning problem.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step (1), training diversity and Discriminative candidate box generation network (DDPN)

Using the fast-RCNN (an image detection algorithm) and adding the prediction of the object property values on the basis thereof, as shown in FIG. 1, it is trained on the Visual Genome data set until the network converges, and the resulting converged network is called DDPN network.

Step (2) extracting features of the image by using the trained DDPN network

For an input image, k candidate boxes containing objects in the image are calculated by using the DDPN network trained in the previous step. For each candidate frame, inputting the corresponding area of the candidate frame in the image into the DDPN network and extracting the output of a certain layer of the network as the characteristic p of the candidate frame_f，

The features of all candidate frames in a picture are spliced to generate an overall feature i' f, wherein

Due to the characteristics of the DDPN network, the steps can be completed in one-time forward calculation, and the practicability of the feature extraction algorithm is ensured.

Step (3) extracting text data characteristics

All texts in the image data set are segmented, a dictionary is built, the dictionary contains d words in total, and each input description text is converted into a dictionary sequence number list according to the dictionary, so that the texts are converted into a vector form.

Step (4), constructing target vectors and target values of regression frames

Target vector l is constructed for a picture and a given description text

And (l) enabling each element to correspond to the candidate box in the step (2) one by one. And (3) calculating the overlapping degree (IOU) between each candidate frame and the real labeling frame according to the k candidate frames obtained in the step (2), and setting a target vector l according to the overlapping degree. And respectively calculating the regression target vector b of each candidate frame according to the difference value of the coordinate of each candidate frame and the coordinate value of the real labeled frame for the target value of the regression frame.

Step (5) constructing a deep neural network

The structure of the method is shown in fig. 2, for describing a text, firstly, a word vectorization (word vectorization) technology is used for converting a text vector obtained in the step (3) into a matrix q_e. Problem matrix q after conversion into vectors_eInputting into Long Short Term Memory (LSTM) network to select the vector q' output by the last unit

Copying k parts of q' and splicing the k parts to form a text feature vector q, wherein

Processing the coordinates of each candidate frame in the image characteristics to generate a candidate frame position characteristic vector f_spWherein

Generating the position feature and image feature i 'of the corresponding candidate frame'_fGenerating final image characteristics i by splicing_fWherein

Generation of text vectors q and image features i using stitching as cross-modal modeling_fThe combined expression signature z. Mapping z to a hidden feature space by using a full-join function and an activation function to generate a feature z ', finally respectively inputting z' into two full-join functions to output two prediction values which are respectively scores s matched by k candidate boxes,

and each candidate box regression value b.

Step (6), Loss Function (Loss Function)

And (4) respectively inputting the two prediction vectors output in the step (5) and the corresponding target vectors into corresponding loss functions, and respectively outputting two loss values (loss).

Step (7), training the model

And (5) training the model parameters of the neural network in the step (5) by using a back propagation algorithm according to the loss value (loss) generated by the loss function in the step (6) until the whole network model converges.

Step (8), calculating network predicted value

Sorting the candidate frames according to the s vectors output in the step (5), selecting the candidate frame with the highest score as a prediction frame, performing fine regression on the prediction frame according to the b vectors output in the step (5), and finally generating a prediction frame b of the network_p。

Training the DDPN network in the step (1), wherein a Visual Genome data set is preprocessed, and 1600 classes with the highest frequency of occurrence and 400 attribute values with the highest frequency are reserved.

The step (2) of extracting features of the image by using the trained DDPN network is as follows:

2-1. Each candidate box corresponds to the feature p of the image area_f，

Splicing features of all candidate frames in one picture to generate overall feature i'_f，

The specific formula is as follows:

extracting text data features in the step (3), specifically as follows:

3-1, aiming at the problem text, firstly splitting the problem text into a word list q with fixed length (with the length being t)_wThe concrete formula is as follows:

q_ω＝(w₁，w₂，...，w_i，...，ω_t) (formula 2)

Wherein w_iIs a string of word characters.

Listing words q from a word dictionary_wConverts the words in (a) to index values, thereby converting the text to a fixed-length index vector q_iThe concrete formula is as follows:

wherein

Is w_kIn word dictionariesThe index value of (c).

The target vector and the target value of the regression frame are constructed in the step (4), and the method specifically comprises the following steps:

4-1, calculating the overlapping degree (IOU) between each candidate frame and the real labeling frame, setting the elements in the l vector corresponding to the candidate frame with the overlapping degree larger than h as the corresponding IOU value, and finally normalizing the l vector to ensure that the sum of all the elements is 1. The calculation formula of the overlapping degree between the two frames is as follows:

where A ≧ B is the area of the intersection of frame A and frame B, and A ≦ B is the area of the union of frame A and frame B.

The formula for vector normalization is as follows:

where sum (l) is the sum over l, the output of which is a scalar.

4-3. calculating formula of regression target value of candidate box:

w_gt＝x2_gt-x1_gt(formula 6)

h_gt＝y2_gt-y1_gt(formula 7)

Wherein x1_gt，y1_gt，x2_gt，y2_gtThe coordinate values of the lower left corner and the upper right corner of the real labeling frame are respectively; omega_gtWidth, h, of the real label box_gtRepresenting the height of the real marking frame;

w ═ x2-x1 (equation 8)

h as y2-y1 (equation 9)

Wherein x1, y1, x2 and y2 are coordinate values of the left lower corner and the right upper corner of the candidate box respectively; w represents the width of the candidate box, h represents the height of the candidate box;

x^ctrx1+0.5 xw (equation 12)

y^ctrY1+0.5 xh (equation 13)

Wherein x^ctr，y^ctr，

The coordinate values of the centers of the candidate frame and the real mark frame are respectively.

b ═ dx, dy, dw, dh) (equation 18)

Where b is the final regression target vector for the candidate box.

Constructing a neural network in the step (5), specifically as follows:

5-1. the word vector (word embedding) specifically operates: indexing the text index vector q obtained in the step (3)_iConverted into a Onehot vector qo,

the so-called Onehot vector refers to the vector for q_oEach vector in

To middle

Each element is 1 and the remaining elements are 0. Then the obtained q is_oThe full-connected function clock which is input to an output v-dimensional vector has the following specific formula:

q_e＝q_o·W_e(formula 19)

Wherein W_eIs a parameter that needs to be learned,

output of

The obtained word vector matrix q_eInputting the input into the LSTM to form a t × n dimensional output characteristic matrix, wherein the specific formula is as follows:

q_lstm＝LSTM(q_e) (formula 20)

Wherein

Taking the output of the last unit of the LSTM as the text feature q',

and copying k parts of q' and splicing to obtain a text vector q, wherein the specific formula is as follows:

q＝(q′，q′，...，q′)^T(formula 21)

5-2, calculating the position characteristic f of each candidate box_spThe concrete formula is as follows:

wherein x₁，y₁，x₂，y₂The coordinate values of the left lower corner and the right upper corner of the candidate box, omega_img，h_imgRespectively the width and height of the input image.

5-3, corresponding image area feature i 'to candidate frame'_fAnd the position characteristics f of the candidate frame_spStitching to generate final image features i_f，

The formula is as follows:

i_f＝(i′_f，f_sp) (formula 23)

5-4, combining the text vector q and the image characteristic i_fThe joint expression characteristic z is generated by splicing, and the full connection and activation functions are mapped to the hidden characteristic space to generate the characteristic z', and the formula is as follows:

z＝(q，i_f) (formula 24)

z ═ ReLU (FC (z)) (equation 25)

Where FC is the full connection function and ReLU is the activation function.

5-5, respectively inputting z' into two fully-connected layers and respectively outputting two prediction vectors s, b, wherein

Respectively representing the matching degree of each candidate box and the regression value of each candidate box to the labeled box. The specific formula is as follows:

s ═ FC (z') (equation 26)

FC1 (z') (equation 27)

Where FC and FC1 represent two distinct fully connected layers.

The loss function in step (6) is specifically as follows:

6-1, calculating the difference (loss) between the matching score s of the candidate box and the true value, wherein the relative entropy (also called KL divergence) is used, and the specific formula is as follows:

wherein l_i，s_iThe i-th element in l, s respectively.

6-2. calculating the difference (Loss) between the regression value and the true value of the candidate box, wherein a smoothing L1 Loss function (Smooth L1 Loss) is used, and the specific formula is as follows:

wherein b is_i，

The i-th elements in b and b are respectively. L is_bIs the loss value of the difference between the final regression box and the true annotation box.

Calculating the network predicted value in the step (8), specifically as follows:

8-1, sorting the candidate boxes according to the s vectors output in the step (5), and selecting the candidate box with the highest score as a prediction box.

8-2, setting the coordinate values of a prediction box and the corresponding regression values as (x1, y1, x2, y2) and (dx, dy, dw, dh), respectively, the prediction box b of the final network is obtained_pThe calculation formula is as follows:

w_p＝dw^exw (formula 33)

h_p＝dh^eXh (formula 34)

b_p＝(x1_p，y1_p，x2_p，y2_p) (formula 39)

Wherein w, h are calculated by the formulas (7), (8). e is a natural constant.

The invention has the following beneficial effects:

the algorithm provided by the invention, especially the DDPN network-based algorithm for extracting the features of the image, achieves a significant improvement effect on an image visual positioning task, and greatly exceeds all mainstream methods on the task at present. In addition, the feature extraction algorithm of the invention also has very important application value and great potential in other cross-modal related fields such as Image Question Answering (IQA) and Image description (Image capture).

Drawings

FIG. 1 is a diagram of a fast R-CNN (image detection algorithm) network framework structure with added attribute value prediction according to the present invention;

FIG. 2 is a schematic diagram of a generalized and dispersive candidate generating network (DDPN).

Detailed Description

The following is a more detailed description of the detailed parameters of the present invention.

The step (2) of extracting features of the image by using the DDPN network specifically comprises the following steps:

2-1. here, the DDPN network is used to predict 100 candidate boxes in the input image.

2-2, inputting the image areas corresponding to 100 candidate boxes into the DDPN network, extracting output data of the Pool5 layer as the characteristics pf corresponding to the candidate boxes,

and splicing the corresponding features of all candidate frames in one picture into i'_f，

Extracting text data features in step (3)

3-1. for describing text data, we first participle the text and build a word dictionary describing the text. Only the first 15 words are taken for each description text, and if the question is less than 15 words, the text is supplemented with null characters. Each word is then replaced with the index value of the word in the word dictionary, so that each question is translated into a 15-dimensional word index vector.

Step (4) the target vector and the target value of the regression frame are constructed

4-1. constructing the target vector l according to the method described previously,

where h is 0.5. And regression target values b of the candidate boxes,

constructing the deep neural network in the step (5), as shown in fig. 2, specifically as follows:

5-1, for the problem text feature, the text input here is the 15-dimensional index vector generated in step (3), and here, the word embedding technology is used to convert each word index into the corresponding word vector, and here, the word vector size we use is 300. Each question text thus becomes a matrix of size 15x300, after which we take this matrix as input to LSTM, a recurrent neural network structure, where its output is set to a 2048-dimensional vector, and take the output of the last element of LSTM as the text feature q',

finally, k copies of q' are made and spliced to finally form a text characteristic q,

5-2, calculating the position characteristic f of each candidate box_spCalculating a position feature vector f according to the algorithm_sp，

5-3, corresponding image area feature i 'to candidate frame'_fAnd the position characteristics f of the candidate frame_spSplicing to obtain final characteristics i of input image_f，

5-4, combining the text vector q and the image characteristic i_fThe splicing results in a joint expression profile Z,

and sequentially input into a full join function and a ReLU function, which are output as 512-dimensional vectors, thereby mapping z to z',

5-5, inputting z' into the full-connected function with the output as a 1-dimensional vector to generate a candidate box matching score prediction vector

At the same time, z' is input into the full-connected function with 4-dimensional vector output to generate the regression value vector of the candidate box

Data set	Flickr30k-Entities	Referit	Refcoco	Refcoco+
					val	72.78％	63.77％	76.61％	64.34％
test	73.45％	63.27％	76.23％	64.01％
					testA			79.99％	71.24％
testB			72.11％	55.55％

Table 1 the accuracy of the method described herein on each mainstream data set in the visual positioning task.

Where val, test, testA, testB are test sets in the data set. The open space indicates that the test set does not exist within the data set.

Claims

1. A method for generating a network aiming at visual positioning based on diversity discrimination candidate boxes is characterized by comprising the following steps:

step (1), training the diversity discrimination candidate frame to generate network DDPN

Using the fast-RCNN, adding the prediction of the object attribute value on the basis of the fast-RCNN, training the fast-RCNN with the object attribute value prediction on a Visual Genome data set until the network converges, and obtaining a converged network called a DDPN network;

step (2) extracting features of the image by using the trained DDPN network

Calculating k candidate frames containing objects in the input image I by using a DDPN network; for each candidate frame, inputting the corresponding region of the candidate frame in the input image I into the DDPN network and extracting the output of a certain layer of the network as the characteristic of the candidate frame

Feature splicing of all candidate frames in input image I 'to generate overall feature I'_fWherein

Step (3) extracting text data characteristics

Dividing words of all texts in the image data set and constructing a dictionary, setting the dictionary to contain d words in total, converting each input description text into a dictionary sequence number list according to the dictionary, and converting the texts into a text vector form;

step (4), constructing target vectors and target values of regression frames

For each image and a given description text, an object vector l is constructed, wherein

Each element in the target vector l corresponds to the candidate frame in the step (2) one by one; calculating the overlapping degree between each candidate frame and the real marking frame according to the k candidate frames obtained in the step (2), and setting a target vector l according to the overlapping degree; for the target value of the regression frame, respectively calculating to obtain a regression target vector b of each candidate frame according to the difference value of the coordinate value of each candidate frame and the coordinate value of the real labeled frame^*；

Step (5) constructing a deep neural network

For the description text: firstly, converting the text vector obtained in the step (3) into a problem matrix q by using a word vectorization technology_e(ii) a Problem matrix q after conversion into vectors_eInputting the vector q' output by the last unit into the long-short term memory network, wherein

Position feature vector f to be generated_spAnd image feature i 'of corresponding candidate frame'_fGenerating final image characteristics i by splicing_fWherein

Generation of text feature vectors q and image features i using stitching as cross-modal modeling_fThe combined expression signature z; mapping z to a hidden feature space by using a full-join function and an activation function to generate a feature z ', finally respectively inputting z' into two full-join functions to output scores s matched with k candidate frames and regression values b of each candidate frame respectively of two prediction vectors,

step (6), loss function

Inputting the two prediction vectors output in the step (5) and the corresponding target vectors into corresponding loss functions respectively, and outputting two loss values respectively;

step (7), training the model

Training the model parameters of the neural network in the step (5) by using a back propagation algorithm according to the loss value generated by the loss function in the step (6) until the whole network model converges;

step (8), calculating network predicted value

Sorting the candidate frames according to the score s vector output in the step (5), selecting the candidate frame with the highest score as a prediction frame, performing fine regression on the prediction frame according to the regression value b vector output in the step (5), and finally generating a prediction frame b of the network_p；

4-1, calculating the overlapping degree between each candidate frame and the real labeling frame, setting elements in the l vector corresponding to the candidate frame with the overlapping degree larger than the set threshold height as corresponding IOU values, and finally normalizing the l vector to ensure that the sum of all the elements is 1; the calculation formula of the overlapping degree between the two frames is as follows:

wherein A ≧ B is the area of the intersection of the candidate frame A and the candidate frame B, and A ≦ B is the area of the union of the candidate frame A and the candidate frame B;

the formula for vector normalization is as follows:

where sum (l) is the sum over l, the output of which is a scalar;

4-3. calculating formula of regression target value of candidate frame:

w_gt＝x2_gt-x1_gt(formula 3)

h_gt＝y2_gt-y1_gt(formula 4)

Wherein x1_gt，y1_gt，x2_gt，y2_gtThe coordinate values of the lower left corner and the upper right corner of the real labeling frame are respectively; w is a_gtWidth, h, of the real label box_gtRepresenting the height of the real marking frame;

w ═ x2-x1 (equation 5)

h as y2-y1 (equation 6)

Wherein x1, y1, x2 and y2 are coordinate values of the left lower corner and the right upper corner of the candidate frame respectively; w represents the width of the candidate box, and h represents the height of the candidate box;

x^ctrx1+0.5 xw (equation 9)

y^ctrY1+0.5 xh (equation 10)

Wherein x^ctr，y^ctr，

Respectively the central coordinate values of the candidate frame and the real marking frame;

b^*(dx, dy, dw, dh) (equation 15)

Wherein b is^*Is the final regression target vector of the candidate frame;

constructing a deep neural network in the step (5), which comprises the following specific steps:

5-1. word vector specific operations: indexing the text index vector q obtained in the step (3)_iConversion into Onehot vector q_o，

So-called Onehot vector refers to q for_oEach vector in

To middle

Each element is 1, and the other elements are 0; then the obtained q is_oThe input is a full-connected function of an output v-dimensional vector, and the specific formula is as follows:

q_e＝q_o·W_e(formula 16)

Wherein W_eIs a parameter that needs to be learned,

output of

q_lstm＝LSTM(q_e) (formula 17)

Wherein

Taking the output of the last unit of the LSTM as the text feature q',

q＝(q′，q′，...，q′)^T (formula 18)

5-2, calculating the position characteristic f of each candidate frame_spThe concrete formula is as follows:

wherein x₁，y₁，x₂，y₂Coordinate values of the lower left and upper right corners of the candidate frame, w_img，h_imgWidth and height of the input image, respectively;

5-3, corresponding image region feature i 'to the candidate frame'_fAnd candidate frame position feature f_spStitching to generate final image features i_f，

The formula is as follows:

i_f＝(i′_f，f_sp) (formula 20)

z＝(q，i_f) (formula 21)

z ═ ReLU (FC (z)) (equation 22)

Where FC is the full connection function and ReLU is the activation function;

Respectively representing the matching degree of each candidate frame and the regression value of each candidate frame to the labeling frame; the specific formula is as follows:

s ═ FC (z') (equation 23)

FC1 (z') (equation 24)

Wherein FC and FC1 represent two distinct fully connected layers;

8-1, sorting the candidate frames according to the s vectors output in the step (5), and selecting the candidate frame with the highest score as a prediction frame;

8-2, setting the coordinate values of a prediction box and the corresponding regression values as (x1 ', y 1', x2 ', y 2') and (dx ', dy', dw ', dh'), respectively, then the prediction box b of the final network is obtained_pThe calculation formula is as follows:

w_p＝dw′^exw (formula 27)

h_p＝dh′^eXh (formula 28)

b_p＝(x1_p，y1_p，x2_p，y2_p) (formula 33)

Wherein w, h are calculated by formulas (5), (6); e is a natural constant.

2. The method for visual localization according to claim 1, wherein the step (2) of extracting features from the image using the trained DDPN network comprises the following steps:

each candidate frame corresponds to a feature p of the image region_f，

The specific formula is as follows:

3. the method for visual localization based on the diversity-discrimination candidate box generation network according to claim 2, wherein the text data feature extraction in step (3) is as follows:

3-1, aiming at the problem text, firstly splitting the problem text into a word list q with fixed length_wSetting the fixed length as t, the concrete formula is as follows:

q_w＝(w₁，w₂，...，w_i，...，w_t) (formula 35)

Wherein, w_iIs a word string;

3-2, listing words q according to word dictionary_wConverts the words in (a) to index values, thereby converting the text to a fixed-length index vector q_iThe concrete formula is as follows:

wherein

Is w_kIndex values in the word dictionary.

4. The method for visual localization according to claim 3, wherein the loss function in step (6) is as follows:

6-1, calculating the difference between the matching score s of the candidate box and the true value by using the relative entropy, namely KL divergence, and adopting the following specific formula:

wherein l_i，s_iIs the ith element in l, s respectively;

6-2, calculating the difference between the regression value and the true value of the candidate box by using a smooth L1 loss function, wherein the specific formula is as follows:

wherein b is_i，

Are respectively b, b^*The ith element; l is_bIs the loss value of the difference between the final regression box and the true annotation box.