CN112419310A

CN112419310A - Target detection method based on intersection and fusion frame optimization

Info

Publication number: CN112419310A
Application number: CN202011447204.1A
Authority: CN
Inventors: 惠国保; 田万勇; 王瑜; 郭褚冰
Original assignee: CETC 20 Research Institute
Current assignee: CETC 20 Research Institute
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-02-26
Anticipated expiration: 2040-12-08
Also published as: CN112419310B

Abstract

The invention provides a target detection method based on intersection and fusion frame optimization, which is characterized in that a correct mask is selected based on an intersection and fusion frame coincidence rate calculation method, the offset vector position of a prediction frame closest to a calibration frame on a characteristic diagram is determined, the error value of the offset of the prediction frame and the reference offset is calculated in a targeted manner, and the target prediction precision and the positioning precision of a network are improved. The method can screen a more reliable prediction frame, calculate the back propagation error more accurately, and store the gradient value of the back propagation, so that the accuracy of the finally trained network model is higher, the intersection and fusion coincidence rate has invariance to the scale change, and the coincidence rate value can be mapped into a certain interval, which is only a subset of the traditional IOU method, so that the error regression can be truly optimized, and the space for improving the accuracy of the prediction frame is expanded.

Description

Target detection method based on intersection and fusion frame optimization

Technical Field

The invention relates to an image target detection technology, in particular to a target detection method for optimizing a training network by calculating a back propagation error according to a frame coincidence rate.

Background

An important aspect of object detection and identification on an image is accurate frame prediction. An input image is firstly used for framing an interested target from the image and carrying out the identification of the attribute of the target in the image. The predicted bounding box is required to frame the target as completely as possible. The important information source of the predicted frame is the feature vector of the region, the image feature region is divided according to a set mode and is determined by the pixel position of the feature image and the width and the height of the anchor frame.

How many feature boxes correspond to how many prediction boxes. The feature areas divided on the feature map are many, not every feature area covers the target, even if the target is not covered, the target is not covered exactly, and a prediction box deduced by the feature information has errors.

Therefore, a machine learning method is needed to improve the accuracy of the prediction frame in the learning process. In the learning process, the calibration frame is used as a target, a prediction frame or an anchor frame with high coincidence rate with the calibration frame is screened out, and the corresponding prediction frame and an offset error value thereof are selected as errors of back propagation. And performing iterative optimization learning for multiple times to enable the prediction frame to approach the calibration frame.

On the characteristic diagram, the number of prediction frames is large, each prediction frame corresponds to a plurality of masks (masks), and the key for calculating the back propagation error is to select the mask closest to the calibration frame. In the process of selecting the mask, the method involves calculating the coincidence rate of a prediction frame or an anchor frame and a calibration frame in a large quantity, sequencing the coincidence rates, and screening the mask corresponding to the frame with the highest coincidence rate.

The frame coincidence rate calculation is an important ring in the network model training and learning. Conventional coincidence rates are generally expressed in phase-to-phase ratios (IOU), but this presents some problems. The IOU is used as a measurement value of coincidence of the two frames, can be directly applied to back propagation and is used for objective function optimization, so people preferably use the IOU as an objective function to complete a two-dimensional target detection task.

The IOU can be either a direct loss function or an indirect loss function, but either has an important issue: if the two frames do not coincide, the IOU value is 0, and how far the two frames are apart cannot be reflected. In such a non-coincidence situation, if the IOU is used as a loss function, the gradient value is 0, and no optimization is performed.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a preferred target detection method based on an intersection and fusion frame. In order to solve the problem of the traditional IOU weakness, the invention expands the IOU weakness to the non-coincidence condition and provides an intersection and fusion coincidence rate calculation method. The calculation method of the intersection and fusion coincidence rate provided by the invention is used as a target detection core module, and the positioning precision is further promoted, because the frame regression loss is not directly represented by the coincidence rate. The frame regression loss is measured by using the frame position, the width and height offset and the attribute probability change error value.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

1) extracting characteristics;

because the structure of the neural network and the scale of each layer are preset, firstly, the width and the height of an input initial image are adjusted to be suitable for the width and the height of a network input port, and the width and the height of the image are scaled to be suitable for the width and the height of the network input port, and the method specifically comprises the following steps:

a. firstly, the ratio of the width and height (W, H) of the initial image to the diameter and height (W ', H') of the network input port is calculated

b. Get

And

the middle and smaller ratio is a reference ratio, the initial image is zoomed according to the reference ratio, one side of the reference ratio is just zoomed to the width and the height of the input aperture of the network, and the margin is left after the other side is zoomed;

c. filling the remaining white part with a fixed pixel value, wherein the fixed pixel value is the highest half of the gray value, namely 0.5 × 256 to 128;

after the width and height of an input initial image are adjusted, a final feature mapping chart is obtained through multi-layer feature extraction of a neural network; the final feature map reflects the feature quantity required for predicting the frame, and the structural diagram of the final feature map is shown in fig. 1.

2) Obtaining a prediction frame;

on the final feature mapping chart obtained in the step 1), the corresponding feature vector of each feature point is divided into three equal-length segments, each segment corresponds to a mask (mask), and the mask is a film with an anchor frame covered on the image; each feature point on the feature map is used as the center of an anchor frame, the same anchor frame mask of all the feature points is constructed on the feature map, and three anchor frames have three masks, namely correspond to three prediction frames;

each segment of feature vector comprises two segments, one is a prediction frame offset segment and a prediction frame attribute probability value segment, wherein the prediction frame offset comprises four components, namely 2 horizontal and vertical offsets and 2 width and height offsets of a position, and the prediction frame attribute comprises a target judgment probability value (one component) and a target type prediction probability value (a plurality of components of a target type); the predicted frame position, width and height and attributes are scaled by the respective component values:

respectively and correspondingly adding the horizontal and vertical coordinates of the center point of the anchor frame mask on the final feature map with the horizontal and vertical offsets of the position of the prediction frame to convert the position of the prediction frame; multiplying the width and height of the anchor frame by the offset of the width and height of the prediction frame to convert the width and height of the prediction frame; judging the attribute of the prediction box by the probability value of the existence of the target, wherein the probability value is larger than a threshold value, and considering that the prediction box contains the target, otherwise, judging that the target does not exist; if there is a target, the type corresponding to the maximum component in the target type prediction probability values is determined as the attribute of the prediction frame, and in this way, the prediction frame (including the position, width, height and attribute of the prediction frame) is converted for each feature point vector on the final feature map.

3) Calculating the coincidence rate of the prediction frame and the calibration frame;

obtaining a prediction frame of each feature point on the final feature map by 2), screening the prediction frames, wherein the screening refers to calculating the coincidence rate of each feature point prediction frame and a calibration frame on the final feature map, sequencing a plurality of prediction frames according to the coincidence rate, and selecting the prediction frame with the maximum coincidence rate, namely selecting the prediction frame closest to the calibration frame, and the coincidence schematic diagram of two frames is shown in FIG. 3;

4) calculating a back propagation error;

during model training, calculating errors of the offset and the attribute probability value of the prediction frame, wherein the errors take the calibration frame as a benchmark, calculating the offset error of the position of the prediction frame, the offset error of width and height, the target-free judgment error and the type prediction error, and the meaning of each error corresponds to the characteristic point vector in the step 2); the specific steps for calculating the error are as follows:

a) calculating the offset (t) of the calibration frame on the feature map corresponding to the position of the feature point_x,t_y) And corresponding anchor frame width and height (A)_w,A_h) Amount of change (t) of_w,t_h) (ii) a The corresponding feature point is the upper left point (G) of the calibration frame on the original image_x,G_y) Mapping to the feature point coordinate with the closest distance on the feature map, wherein the anchor frame is the anchor frame corresponding to the prediction frame with the maximum coincidence rate of the calibration frames;

b) calculating an error value between the predicted value and the reference value;

the feature vector at the feature point coordinate (i, j) on the feature map, and the four foremost variables are the predicted values (b) of the frame position and the width and height variation_x,b_y,b_w,b_h) Then the error value is:

Δi＝s(t_i-b_i)，i∈{x,y,w,h}

wherein s is a scale factor for equalizing the proportion of the small boxes,

then, the presence or absence of a target determination error Δ in the prediction frame is calculated_objAnd the class prediction error Δ_ct(ct(t∈{1,2,3})；

And finally, the back propagation error is the sum of the error values of the position offset, the width and height offset and the attribute judgment of the predicted frame and the calibrated frame.

The feature extraction of the neural network is a convolution and pooling alternating operation.

The specific steps of selecting the prediction frame closest to the calibration frame are as follows:

firstly, calculating a minimum bounding frame of a prediction frame and a calibration frame, which refers to a fusion region of two frames P and GT in FIG. 3, namely, taking a rectangular region bounded by the horizontal coordinate of the two frames P and GT with the minimum left side, the maximum right side and the minimum upper side and the maximum lower side of the vertical coordinate, wherein the bounding region is called as a fusion region and is marked as U;

secondly, calculating an intersection area of the prediction frame and the calibration frame, which is an intersection of the P area and the GT area and is marked as I;

then, the ratio I/U of the intersection area of the prediction frame and the calibration frame and the fusion area is the coincidence ratio, and the range of the coincidence ratio is [0,1 ];

and aiming at one calibration frame, sequencing all the prediction frames according to the coincidence rate, and taking the prediction frame with the maximum coincidence rate as the closest calibration frame.

The step of mapping the coordinate points of the calibration frame to the final feature map comprises the following steps:

the first step is as follows: the original width and height (G) of the calibration frame_w,G_h) And (3) normalizing relative to the width and height (W, H) of the original image, namely, the width and height of the calibration frame are respectively higher than those of the original image, so as to obtain the normalized width and height of the calibration frame:

the second step is that: multiplying position coordinate point after normalization of calibration frame by feature diagram width and height (F)_w,F_h) Obtaining the horizontal and vertical coordinates (g) of the calibration frame on the feature map_x,g_y) Is a floating point number:

the third step: find the distance between the horizontal and vertical positions on the feature mapCoordinate (g)_x,g_y) The characteristic point with the closest distance is taken as a reference point, the coordinates of the reference point are (i, j), the coordinate values on the characteristic map are integers, and i, j are integers; and calculating to obtain:

t_x＝g_x-i

t_y＝g_y-j

wherein, the difference value between the position coordinate mapped on the characteristic diagram by the calibration frame and the characteristic point coordinate with the nearest distance is used as the position reference offset (t)_x,t_y) Calibrating the width and height of the frame and the maximum width and height t of the anchor frame with the coincidence rate_w,t_hLogarithm of the ratio as the width-height reference offset (t)_w,t_h)。

The presence or absence of the target determination error Δ_objAnd the class prediction error Δ_ctThe specific calculation method of (ct (t epsilon {1,2,3}) is as follows:

presence or absence of target determination error Δ_objDetermining the corresponding difference value of probability values for the existence of targets in attribute components in a prediction frame and a calibration frame on the feature point vector, and predicting the error delta of the type_ct(ct (t epsilon {1,2,3}) is the corresponding difference value of the target type prediction probability values in the attribute components in the prediction frame and the calibration frame on the feature point vector.

The threshold value is 0.5.

The method has the advantages that the IOU method is expanded, the intersection and fusion coincidence rate calculation method is provided, the method is compatible with the situation that frames are not coincident, more reliable prediction frames can be screened, the back propagation error can be calculated more accurately, the gradient value of the back propagation is stored, and the accuracy of the finally trained network model is higher.

The method for calculating the intersection and fusion coincidence rate supports the translation and scaling attributes of frame distance measurement, the intersection and fusion coincidence rate has invariance to scale change, and the coincidence rate value can be mapped into a certain interval, which is only a subset of the conventional IOU method.

A margin or a blind area still exists between the traditional IOU value and regression error optimization, and the calculation method of the intersection and fusion coincidence rate obtained by expanding on the basis of the IOU method can cover the blind area, so that the error regression can be truly optimized, and the space for improving the accuracy of the predicted frame is expanded.

Drawings

Fig. 1 is a schematic diagram of a final layer feature diagram.

FIG. 2 is a diagram of a feature map storage format in system memory.

FIG. 3 is a schematic diagram of the intersection and fusion of a prediction box and a calibration box.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

Specifically, the quantities to be adjusted are mapped onto feature vectors of a feature map, each segment of the vector corresponds to several masks (masks), each mask is a frame position, a width and height offset and an attribute probability value, and errors of the quantities are propagated reversely as shown in fig. 1. The intersection and fusion coincidence rate is used for selecting a mask to be adjusted from a plurality of masks. Because the back propagation error function is a linear function, the IOU or the intersection and fusion coincidence rate can be indirectly used as an error, and the optimization of the deep neural network model is realized.

However, in all non-coincident cases, the IOU has a gradient value of 0, which affects the training quality and convergence speed. In contrast, the cross-merge coincidence ratio has a gradient value in both coincident and non-coincident situations, which is characteristic of the IOU. Even when the IOU value is high, the intersection and fusion coincidence rate can have the IOU characteristic.

The intersection and fusion coincidence rate provided by the invention is defined as: for two rectangles with any shapes, finding out a smallest enclosure frame of two frames, wherein an area in the enclosure frame is called a fusion area C; then determining the intersection overlapping part of the two frames, wherein the area is called an intersection area I; and calculating the proportion I/C of the area of the intersection of the two frames in the area of the fusion area, namely the intersection and fusion coincidence rate.

The error value is 1-I/C obtained based on the intersection and fusion coincidence rate, the error value has close relation with the traditional IOU, and the relation between the error value and the traditional IOU can be deduced according to the following formula:

where U is the merged region of the two borders, U ═ P + GT-I, it can be seen that U is an amount that cannot be directly calculated.

As can be seen from the above equation, the non-union region proportion in the union region minus the union-union ratio multiplied by a coefficient, which is the IOU, is the error value of union-union.

The axial direction of the rectangular frame on the image is consistent with the axial direction of the image, and the rectangular frame is changed only in the aspects of position, width and height and not in the aspect of rotation. Both arbitrary rectangular boxes can find the bounding box that minimally encloses them. The smallest enclosing frame is the frame limited according to the smallest and largest boundaries of the horizontal and vertical coordinates in the two frames.

The intersection and fusion coincidence rate provided by the invention is calculated and the following generality is ensured: 1) keeping consistent with the definition related to the IOU, including shape description and area description; 2) the scale change is kept unchanged as the IOU; 3) the two frames overlapping condition that the IOU has is compatible.

The frame coincidence rate calculation method based on intersection and fusion is designed and realized, and guidance is provided for selecting a proper mask. And selecting a correct mask, determining the offset vector position of the prediction frame closest to the calibration frame on the characteristic diagram, and calculating the error value of the offset of the prediction frame and the reference offset in a targeted manner to improve the target prediction precision and positioning precision of the network. The embodiment provided by the invention comprises the following steps:

step 1, taking a feature vector at a feature point (i, j) on a feature map (shown in figure 1), wherein the feature vector is composed of 3 segments of masks (shown in figure 2), and each segment of mask corresponds to a prediction frame (p)_x,p_y,p_w,p_h) Position, width and height offset value (b)_x,b_y,b_w,b_h) And attribute prediction probability (b)_obj,b_c1,b_c2,b_c3) Where x, y, w, h correspond to the center position coordinates and width and height of the frame, and obj, c1, c2, and c3 correspond to probability values (in the present embodiment, the object types are set to three types) that the frame contains the object probability and the object belongs to (c1, c2, and c3), respectively.

Step 2, counting all calibration frames on the input image, wherein each calibration frame and the prediction frame P obtained in the step 1 calculate the coincidence rate (IOU) in the following way:

1) determining a fusion area of the prediction frame and the calibration frame, and calculating the area U of the fusion area;

2) then calculating the intersection area of the prediction frame and the calibration frame, and calculating the area of the intersection area as I;

3) and calculating the intersection and fusion coincidence rate of the prediction frame and the calibration frame as I/U, sequencing according to the coincidence rate, and recording the number of the calibration frame with the maximum coincidence rate.

Step 3, setting a prediction frame error value: if the maximum merging and merging coincidence rate obtained in the step 2 is greater than the neglect threshold (0.5), the target probability error value delta of the predicted frame obtained in the step 1 is used_objThe value is set to 0, meaning disregarded, otherwise Δ_obj＝0-b_obj. If the maximum coincidence rate obtained in the step 2 is greater than the real threshold value (0.9), setting as follows:

1) predicting the target probability of the frame by the error value delta_obj＝1-b_obj；

2) According to the number of the calibration frame obtained in the step 3) of the step 2), finding out a target type value ct (t is belonged to {1,2,3}) corresponding to the calibration frame from all calibration data, and a prediction frame target type prediction error value delta consistent with ct_ct＝1-b_ctThe prediction error value of other target types in the prediction box is set as delta_ct＝0-b_ct(ii) a The position and width and height of the frame are calibrated and converted into a reference position offset.

3) Finding the offset (t) relative to the reference position from all calibration data according to the calibration frame number obtained in the step 2_x,t_y) And an offset amount (t) from the reference width and height_w,t_h)：

The reference position offset is the corresponding difference between the position of the calibration frame mapped on the feature map and the feature point (i, j).

The reference width-height offset is a logarithmic value of the ratio of the width to the height of the calibration frame to the anchor frame.

The predicted frame position and width to height offset error value is (t)_x,t_y,t_w,t_h) And (b)_x,b_y,b_w,b_h) The difference of the corresponding term.

And 4, circularly executing the steps 1 to 3 on all the feature points on the feature map, and finally obtaining the prediction error value of each mask section of the feature vector on all the feature points.

Step 5, for each calibration frame, determining the position (k, l) of the frame mapped on the feature map, and taking an integer; and simultaneously calculating the coincidence rate of the frame and each anchor frame at the same central point, wherein the calculation mode of the coincidence rate is the same as 2 steps. And recording the anchor frame number corresponding to the maximum coincidence rate.

And 6, according to the anchor frame number obtained in the step 5, finding a number corresponding to the anchor frame number in the mask of the feature vector 3 section corresponding to the feature point (k, l), if the number exists, recording the mask number, and if the number does not exist, returning to exit.

Step 7, extracting the predicted frame offset amount (b ') corresponding to the mask number obtained in step 6'_x,b'_y,b'_w,b'_h) And attribute probability value (b'_obj,b′_c1,b′_c2,b′_c3) The error amounts of these values are calculated as follows:

1) the mask prediction box contains a target probability error value delta'_obj＝1-b'_obj；

2) The method for calculating the classification error of the mask prediction box is the same as the method of the step 2) in the step 3: target type values c't (t epsilon {1,2,3}) corresponding to the calibration frames and prediction frame target type prediction error values delta' consistent with c't are found from all calibration data'_ct＝1-b′_ctPredicted box target type prediction error value Δ ' inconsistent with c't '_ct＝0-b′_ct；

3) The method for calculating the error value of the offset of the prediction frame is the same as that in the step 3) of the step 3, except that the position of the feature point is (k, l) when the offset of the reference position is calculated, and the anchor frame is the anchor frame marked in the step 5 when the offset of the reference width and the height is calculated.

And 8, circularly executing the steps 5 to 7 until all the calibration frames are executed. And finally, obtaining vector error values of masks corresponding to all the calibration frames on the feature map.

And 9, according to the eigenvector error obtained in the step, performing back propagation by adopting a gradient descent method, and adjusting a network weight value.

Claims

1. A target detection method based on intersection and fusion frame optimization is characterized by comprising the following steps:

1) extracting characteristics;

b. Get

And

c. filling the remaining residual white part with a fixed pixel value, wherein the fixed pixel value is half of the highest gray value;

after the width and height of an input initial image are adjusted, a final feature mapping chart is obtained through multi-layer feature extraction of a neural network;

2) obtaining a prediction frame;

on the final feature mapping chart obtained in the step 1), the corresponding feature vector of each feature point is divided into three equal-length sections, each section corresponds to a mask, and the mask is a film with an anchor frame on the image; each feature point on the feature map is used as the center of an anchor frame, the same anchor frame mask of all the feature points is constructed on the feature map, and three anchor frames have three masks, namely correspond to three prediction frames;

each segment of feature vector comprises two segments, one segment is a prediction frame offset segment and a prediction frame attribute probability value segment, wherein the prediction frame offset comprises four components, namely 2 horizontal and vertical offsets and 2 width and height offsets of a position, and the prediction frame attribute comprises a target judgment probability value and a target type prediction probability value; the predicted frame position, width and height and attributes are scaled by the respective component values:

respectively and correspondingly adding the horizontal and vertical coordinates of the center point of the anchor frame mask on the final feature map with the horizontal and vertical offsets of the position of the prediction frame to convert the position of the prediction frame; multiplying the width and height of the anchor frame by the offset of the width and height of the prediction frame to convert the width and height of the prediction frame; judging the attribute of the prediction box by the probability value of the existence of the target, wherein the probability value is larger than a threshold value, and considering that the prediction box contains the target, otherwise, judging that the target does not exist; if the target exists, judging the type corresponding to the maximum component in the target type prediction probability value as the attribute of the prediction frame, and converting each characteristic point vector on the final characteristic graph into the prediction frame by the operation;

obtaining a prediction frame of each feature point on the final feature map by 2), and screening the prediction frames, wherein the screening refers to calculating the coincidence rate of each feature point prediction frame and a calibration frame on the final feature map, sequencing a plurality of prediction frames according to the coincidence rate, and selecting the prediction frame with the maximum coincidence rate, namely selecting the prediction frame closest to the calibration frame;

4) calculating a back propagation error;

Δi＝s(t_i-b_i)，i∈{x,y,w,h}

wherein s is a scale factor for equalizing the proportion of the small boxes,

2. The optimal target detection method based on the intersection and fusion frame as claimed in claim 1, wherein:

3. The optimal target detection method based on the intersection and fusion frame as claimed in claim 1, wherein:

firstly, calculating a minimum surrounding frame of a prediction frame and a calibration frame, wherein the minimum surrounding frame refers to a fusion region of the prediction frame and the calibration frame, namely a rectangular region surrounded by the prediction frame and the calibration frame is taken, and the surrounding region is called as a fusion region and is marked as U;

4. The optimal target detection method based on the intersection and fusion frame as claimed in claim 1, wherein:

the third step: finding the distance on the feature map from the horizontal and vertical coordinates (g)_x,g_y) The characteristic point with the closest distance is taken as a reference point, the coordinates of the reference point are (i, j), the coordinate values on the characteristic map are integers, and i, j are integers; and calculating to obtain:

t_x＝g_x-i

t_y＝g_y-j

5. The optimal target detection method based on the intersection and fusion frame as claimed in claim 1, wherein:

6. The optimal target detection method based on the intersection and fusion frame as claimed in claim 1, wherein:

the threshold value is 0.5.