CN113673540A

CN113673540A - Target detection method based on positioning information guidance

Info

Publication number: CN113673540A
Application number: CN202110960804.6A
Authority: CN
Inventors: 缪玲娟; 明奇; 周志强
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-19

Abstract

The invention provides a target detection method based on positioning information guidance, which is characterized in that positioning information represented by an intersection ratio is embedded into a classification task to serve as a training label, and a hierarchical intersection ratio function enables continuous intersection ratio to better adapt to the classification task than discretization; on one hand, compared with discrete intersection ratio, learning continuous intersection ratio in the classification task does not bring great performance gain, but makes the optimization process slower or even not converge; on the other hand, for the anchor frame with low overlap ratio with the object, learning intersection is meaningless; therefore, the invention guides the classification task by improving the real label of the classification branch and adopting the cross-comparison index with the positioning accuracy consistent with that of the detection frame, and meanwhile, the invention also adopts the hierarchical cross-comparison function to carry out hierarchical processing on the cross-comparison vector, thereby keeping the uniformity of the classification and regression tasks and effectively improving the detection accuracy.

Description

Target detection method based on positioning information guidance

Technical Field

The invention belongs to the technical fields of computer vision identification technology, artificial intelligence and target detection, and particularly relates to a target detection method based on positioning information guidance.

Background

In the last decade, deep learning has been rapidly developed and gradually becomes the mainstream development direction in the field of artificial intelligence. Artificial intelligence methods represented by deep learning have brought breakthrough development in the fields of computer vision, natural language processing and the like. Computer vision has been a major area in computer technology and has been receiving widespread attention in the academic and industrial fields. Thanks to the revolutionary progress of deep learning, artificial intelligence also brings new ideas and methods for the field of computer vision and obtains good results.

Object detection is a basic task of computer vision, which aims at identifying and detecting objects appearing in an image. Its tasks may be specifically defined as: and giving an input image, judging whether the image contains the target or not, and if so, giving the position and the category of the target. The target detection is the basis of the applications such as target tracking, object segmentation, auxiliary navigation and the like, so the method has wide application scenes and great significance. However, the conventional target detection algorithm has a bottleneck in both detection accuracy and operation speed, and cannot meet wider application requirements, and a target detection technology needs to be urgently innovated.

The successful application of the deep learning technology in the field breaks through the elbow control of the traditional method, and great success is achieved. A series of algorithms represented by a convolutional neural network can efficiently and automatically extract features in an image for a target detection task, and the speed and the precision far exceeding those of the traditional algorithms are realized. The model based on the convolutional neural network first extracts a feature map from an image to be detected through convolution and pooling operations. Then, on the extracted multi-scale features, the deviation of an initially set prior frame (also called an anchor frame) relative to the target in the image to be detected is predicted, and the category of the target is predicted. When the detection result is output, the redundant detection frames are inhibited (also called non-maximum inhibition) to obtain sparse and effective detection frames, so that the detection of the target in the image is completed.

At present, most algorithms based on the convolutional neural network divide a target detection task into two subtasks of classification and positioning, and respectively predict without mutual interference. And in the non-maximum value inhibition stage, the classified prediction result is selected as a basis for the prediction frame, and the detection result is output. However, the classification and localization are handled separately by the algorithm, which results in a large divergence between the two tasks, i.e. the classification score does not characterize the localization accuracy of the prediction box. This will cause the output detection result after non-maximum suppression to be unreliable, reducing the detection performance.

Disclosure of Invention

In order to solve the problems, the invention provides a target detection method based on positioning information guidance, which can better adapt to classification tasks and improve detection precision.

A target detection method based on positioning information guidance inputs an image to be detected into a trained target detection model to obtain the category and the positioning result of an object contained in the image to be detected; the training method of the target detection model comprises the following steps:

s1: performing feature extraction on the sample image by adopting a first convolution branch to obtain a feature map, and meanwhile, setting a plurality of prior rectangular anchor frames on the feature map, wherein the number and the positions of the prior rectangular anchor frames meet the coverage of the whole feature map;

s2: performing convolution operation on the characteristic diagram by respectively adopting a second convolution branch and a third convolution branch to correspondingly obtain a classification matrix S of a prior rectangular anchor frame and an offset matrix D of the prior rectangular anchor frame, wherein B is a position matrix of the prior rectangular anchor frame, and P is an offset matrix of the prior rectangular anchor frame relative to the prior rectangular anchor frame;

s3: constructing a total loss function L, calculating and judging whether the total loss function L is smaller than a set value, if not, changing convolution layer parameters of the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch, and repeating the steps S1 to S3 until the total loss function L is smaller than the set value; if the target detection result is less than the target detection result, the first convolution branch, the second convolution branch and the third convolution branch are final target detection models;

the total loss function L is constructed by the following method:

s31: respectively obtaining the intersection and parallel ratio between each prior rectangular anchor frame and each object surrounding frame in the sample image to obtain an M multiplied by N dimensional intersection and parallel ratio matrix IoU, wherein M is the number of the prior rectangular anchor frames, and N is the number of the object surrounding frames;

s32: extracting maximum elements from each row of the intersection ratio matrix IoU to construct an M × 1-dimensional intersection ratio vector IoU_maxRespectively matching the positions of the prior rectangular anchor frames corresponding to the rows where the maximum elements are located with the positions of the object surrounding frames corresponding to the columns where the maximum elements are located to obtain an object position matrix G 'expected to be learned by the prior rectangular anchor frames, and meanwhile, forming an M multiplied by 1-dimensional class vector CLS by the class number of each object in the G';

s33: cross-over vector IoU using hierarchical cross-over function_maxCarrying out hierarchical processing to obtain a hierarchical vector hIoU;

s34: constructing a total loss function L of the target detection model according to the hierarchical vector hIoU, the class vector CLS, the classification matrix S and the offset matrix D as follows:

L＝|S-hIoU·hIoU^T·OneHot(CLS)|+|D[pos]-G′[pos]|

wherein, T is the transpose, onehot (CLS) represents the conversion of CLS into one-hot code, D [ pos ] is the position coordinate after the shift of the prior rectangular anchor frame selected as the positive sample in the shift matrix D, and G' [ pos ] is the position coordinate of the object expected to be learned by the prior rectangular anchor frame selected as the positive sample.

Further, the method for selecting the prior rectangular anchor frame as the positive sample comprises the following steps:

the cross-over-ratio vector IoU is cross-over-ratio using a hierarchical cross-over-ratio function as follows_maxCarrying out hierarchical processing to obtain a hierarchical vector hIoU:

IoU_max(m) is a cross-over ratio vector IoU_maxThe mth element in the hierarchical vector hlou, (m) is the mth element in the hierarchical vector hlou, δ is a set interval division threshold, and E (·) is a lower rounding function;

will satisfy IoU_max(m)＞A priori rectangular anchor box of 0.5 is used as a positive sample for training.

Further, the calculation formula of the intersection ratio between each prior rectangular anchor frame and each object enclosure frame in the sample image is as follows:

wherein IoU represents the intersection and comparison between any two prior rectangular anchor frames and the object enclosure frame, A ^ B represents the intersection of any two prior rectangular anchor frames and the object enclosure frame, and AomebB represents the union of any two prior rectangular anchor frames and the object enclosure frame.

And further, after the image to be detected is input into the trained target detection model, performing non-maximum suppression and redundant frame elimination on the obtained belonged category and positioning result, and taking the result after the non-maximum suppression and the redundant frame elimination as the final belonged category and positioning result of the object contained in the image to be detected.

Further, a classification matrix S of the prior rectangular anchor frame is an M x C dimensional matrix, wherein C is the number of all possible classes of the object, and each element in the classification matrix S represents the probability value of each prior rectangular anchor frame belonging to each class;

and taking the category corresponding to the maximum element in each row of the classification matrix S as the category of the prior rectangular anchor frame corresponding to each row, wherein each maximum element is the category score of the category of each prior rectangular anchor frame.

Has the advantages that:

1. the invention provides a target detection method based on positioning information guidance, which is characterized in that positioning information represented by an intersection ratio is embedded into a classification task to serve as a training label, and a hierarchical intersection ratio function enables continuous intersection ratio to better adapt to the classification task than discretization; on one hand, compared with discrete intersection ratio, learning continuous intersection ratio in the classification task does not bring great performance gain, but makes the optimization process slower or even not converge; on the other hand, for the anchor frame with low overlap ratio with the object, learning intersection is meaningless; therefore, the invention guides the classification task by improving the real label of the classification branch and adopting the cross-comparison index with the positioning accuracy consistent with that of the detection frame, and meanwhile, the invention also adopts the hierarchical cross-comparison function to carry out hierarchical processing on the cross-comparison vector, thereby keeping the uniformity of the classification and regression tasks and effectively improving the detection accuracy.

2. The invention provides a target detection method based on positioning information guidance, which expands each category number in a category vector CLS into a one-hot coding form, and obtains an expected discrete cross-over ratio after hierarchical cross-over ratio weighting by an OneHot (CLS), thereby realizing the optimization of the representation form of the cross-over ratio in a classification task; then, the merging ratio is used as a real label to replace the traditional one-hot coding, so that the fine-grained discrete positioning information supervision is introduced into the classification task, the classification task can be better adapted, and the more accurate performance evaluation of the prediction result can be realized.

3. The invention provides a target detection method based on positioning information guidance, which directly sets the cross-comparison labels of the anchor frames to be 0 by adopting a hierarchical cross-comparison processing function, thereby facilitating the optimization and convergence of the algorithm and not influencing the final detection performance.

4. The invention provides a target detection method based on positioning information guidance, wherein each element in a classification matrix S represents the probability value of each prior rectangular anchor frame belonging to each category, so that the target detection model not only can obtain the final belonging category of an object contained in an image to be detected, but also can obtain the score of the final belonging category of the object contained in the image to be detected, and is favorable for more accurate evaluation of a classification prediction result.

Drawings

Fig. 1 is a flowchart of a training method of a target detection model provided in the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Most of the existing target detection algorithms based on the convolutional neural network divide a detection task into an independent multi-classification task and a target position positioning task, so that the two tasks are excessively split, and misjudgment is easy to occur when a non-maximum value is used for inhibiting and screening repeated detection frames. The invention provides a target detection method based on positioning information guidance, which improves the real label of a classification branch on one hand, and guides a classification task by adopting an intersection ratio index consistent with the positioning accuracy of a detection frame; on the other hand, the representation form of the intersection in the classification task is optimized, and the discrete positioning information supervision is adopted, so that the method is better suitable for the classification task. Finally, the purposes of keeping the uniformity of classification and regression tasks and improving the detection precision are achieved.

A target detection method based on positioning information guidance comprises the following specific implementation steps:

inputting the image to be detected into a trained target detection model to obtain the initial category and the initial positioning result of the object contained in the image to be detected; performing Non-maximum suppression (Non maximum suppression) and redundant frame elimination on the obtained initial belonged category and initial positioning result, and taking the result after the Non-maximum suppression and the redundant frame elimination as the final belonged category cls of the object contained in the image to be detected^*And the positioning result D^*。

As shown in fig. 1, the training method of the target detection model includes the following steps:

s1: and performing feature extraction on the sample image by adopting a first convolution branch to obtain a feature map, and meanwhile, setting a plurality of prior rectangular anchor frames on the feature map, wherein the number and the position of the prior rectangular anchor frames meet the coverage of the whole feature map.

It should be noted that, in the present invention, for an input image I, a convolution operation is adopted to extract features, and five times of downsampling are performed to obtain a feature map, where a formula of the convolution operation is:

o(i,j)＝∑_m∑_nX(m,n)·K(i-m,j-n)

in the formula, X (m, n) represents a numerical value of each pixel point position of an input image or feature map; k (i-m, j-n) represents the value of the convolution kernel location at the corresponding location.

Meanwhile, the invention marks a single anchor frame as b ═ (x, y, w, h). b is a four-dimensional vector, where (x, y) represents the position of the center point of the rectangular anchor box on the feature map, and (w, h) represents the width and height of the rectangular anchor box. All anchor frames are represented as B, which is an M multiplied by 4 matrix, and M represents the total number of the anchor frames; all objects in the image are denoted as G, which is an N × 4 matrix, where N represents the total number of objects.

S2: and performing convolution operation on the characteristic diagram by respectively adopting a second convolution branch and a third convolution branch to correspondingly obtain a classification matrix S of the prior rectangular anchor frame and an offset matrix D of the prior rectangular anchor frame, wherein B is a position matrix of the prior rectangular anchor frame, and P is an offset matrix of the prior rectangular anchor frame relative to the prior rectangular anchor frame.

It should be noted that the second convolution branch is used to obtain the classification prediction result as the classification branch, and the output S of the second convolution branch is an M × C matrix, where C is the number of all possible classes of the object. S represents the probability value that each anchor box belongs to each category; further, the final category score for each anchor box may be calculated by the following formula:

s＝max(S,0)

cls＝arg max(S,0)

the above expression represents that the category corresponding to the largest element in each row of the classification matrix S is used as the category to which the prior rectangular anchor frame corresponding to each row belongs, and each largest element is the category score of the category to which each prior rectangular anchor frame belongs and the probability scores S of all anchor frames at the same time, and the corresponding category index cls is obtained. Therefore, the target detection model of the invention obtains the final class cls of the object contained in the image to be detected^*The score, that is, the probability, of the category to which the object included in the image to be measured finally belongs can also be obtained.

The third convolution branch is used as a positioning branch and is used for obtaining a positioning prediction result, and the output D of the third convolution branch is an M multiplied by 4 matrix which represents the sum of the anchor frame matrix B and the offset P; that is, the third convolution branch first obtains the offset of each anchor frame, and then superimposes the offset on the original position coordinates of each anchor frame to obtain the predicted coordinates of each anchor frame.

S3: constructing a total loss function L, calculating and judging whether the total loss function L is smaller than a set value, if not, changing convolution layer parameters of the first convolution branch, the second convolution branch, the third convolution branch and the fourth convolution branch, and repeating the steps S1 to S3 until the total loss function L is smaller than the set value; and if the target detection model is smaller than the preset target detection model, the first convolution branch, the second convolution branch and the third convolution branch are final target detection models.

The total loss function L is constructed by the following method:

s31: and respectively obtaining the intersection and parallel ratio between each prior rectangular anchor frame and each object surrounding frame in the sample image to obtain an M multiplied by N dimensional intersection and parallel ratio matrix IoU, wherein M is the number of the prior rectangular anchor frames, and N is the number of the object surrounding frames.

It should be noted that the intersection ratio is used to measure the coincidence degree between two bounding boxes, and is calculated as follows:

wherein A and B represent two different rectangular bounding boxes; a ≈ B represents the intersection of A and B; and A ≦ B represents the union of A and B. The intersection ratio between all anchor frames and the target object in the image to be detected is obtained and is represented as IoU, and the formula is as follows:

IoU＝IoU(B,G)

where IoU is a matrix of dimension M x N.

S32: extracting maximum elements from each row of the intersection ratio matrix IoU to construct an M × 1-dimensional intersection ratio vector IoU_maxAnd respectively matching the positions of the prior rectangular anchor frame corresponding to the row where each maximum element is located with the positions of the object surrounding frames corresponding to the columns where the maximum elements are located to obtain an object position matrix G 'expected to be learned by each prior rectangular anchor frame, and meanwhile, forming an M multiplied by 1-dimensional class vector CLS by the class number of each object in G'.

That is, the present invention finds the largest element of each row and the position index of the row where the elements form an M × 1-dimensional intersectionRatio vector, noted IoU_maxAnd allocating a target with the maximum intersection ratio for each anchor frame according to the position index and enabling the anchor frame to be responsible for predicting the target. These objects constitute a matrix G 'of dimension M × 4, and G' represents a matrix of position coordinates of the object that the anchor frame is expected to learn. Meanwhile, the class number of each object in G' constitutes an M × 1-dimensional class vector, which is denoted as CLS.

S33: cross-over vector IoU using hierarchical cross-over function_maxAnd carrying out hierarchical processing to obtain a hierarchical vector hIoU.

Wherein, the hierarchical intersection ratio function is as follows:

IoU_max(m) is a cross-over ratio vector IoU_maxThe mth element in the hierarchical vector hlou, (m) is the mth element in the hierarchical vector hlou, δ is a set interval division threshold value responsible for dividing intersection into discrete intervals, where the value is 0.1, and E (·) is the following rounding function, and the specific formula is as follows:

further, IoU will be satisfied_maxAnd (m) > 0.5, taking a priori rectangular anchor box as a positive sample used for training, participating in the calculation of the positioning loss, and recording the index of the anchor box in all anchor boxes as pos.

L＝|S-hIoU·hIoU^T·OneHot(CLS)|+|D[pos]-G′[pos]|

wherein, T is transposition, D [ pos ] is the position coordinate after the deviation of the prior rectangular anchor frame selected as the positive sample in the deviation matrix D, and G' [ pos ] is the position coordinate of the object expected to be learned by the prior rectangular anchor frame selected as the positive sample. Onehot (CLS) is an M × C matrix, which represents that CLS is converted into one-hot coding, that is, only the position of the category number is 1, and the other positions are 0 row vectors; after the matrix is subjected to hierarchical intersection ratio weighting, an expected discrete intersection ratio can be obtained, the intersection ratio is used as a real label to replace the traditional one-hot coding, fine-grained positioning information supervision is introduced into a classification task, and more accurate performance evaluation of a prediction result can be realized.

It should be noted that the total loss function L is actually composed of a classification loss function and a positioning loss function, and specifically, the following is:

L＝L_cls+L_loc

L_cls＝|S-hIoU·hIoU^T·OneHot(CLS)|

L_loc＝|D[pos]-G′[pos]|

wherein L is_clsRepresenting the deviation between the prediction result of the classification task and the real label for the classification loss function; l is_locFor the localization loss function, the deviation between the predicted position of the anchor frame as a positive sample and the position of the object that the anchor frame is expected to learn is represented.

Therefore, the common target detection algorithm calculates the difference between the prediction result and the real category by adopting the cross entropy loss of the one-hot coding for the classification task, but the index and the positioning task have no relevance, so that the difference exists between the two, and the detection performance is influenced. According to the invention, the positioning information represented by the intersection ratio is embedded into the classification task as a training label, and the hierarchical intersection ratio function adapts continuous intersection to the classification task better than discretization. On the one hand, learning a continuous cross-over ratio in the classification task does not bring a large performance gain, but rather makes the optimization process slower or even non-converging, compared to a discrete cross-over ratio. On the other hand, the intersection comparison is meaningless when the anchor frame with low overlap ratio with the object is learned, so that the intersection comparison label of the anchor frame is directly set to be 0 by the hierarchical intersection comparison processing function, the optimization convergence of the algorithm is facilitated, and the final detection performance is not influenced.

The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it will be understood by those skilled in the art that various changes and modifications may be made herein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A target detection method based on positioning information guidance is characterized in that an image to be detected is input into a trained target detection model, and the category and the positioning result of an object contained in the image to be detected are obtained; the training method of the target detection model comprises the following steps:

the total loss function L is constructed by the following method:

s32: extracting maximum elements from each row of the intersection ratio matrix IoU to construct an M × 1-dimensional intersection ratio vector IoU_maxAnd respectively corresponding the prior rectangular anchor frame corresponding to the row of each maximum element to the column thereofThe positions of the object surrounding frames are matched to obtain an object position matrix G 'expected to be learned by each prior rectangular anchor frame, and simultaneously, the class number of each object in G' is formed into an M multiplied by 1 dimensional class vector CLS;

L＝|S-hIoU·hIoU^T·OneHot(CLS)|+|D[pos]-G′[pos]|

2. The target detection method based on the guidance of the positioning information as claimed in claim 1, wherein the prior rectangular anchor frame as the positive sample is selected by:

will satisfy IoU_maxThe a priori rectangular anchor box with (m) > 0.5 serves as a positive sample for training.

3. The positioning-information-guided target detection method as claimed in claim 1, wherein the calculation formula of the intersection-to-parallel ratio between each prior rectangular anchor frame and each object bounding frame in the sample image is as follows:

4. The method as claimed in claim 1, wherein the image to be detected is input into a trained target detection model, and then the obtained belonged category and positioning result are subjected to non-maximum suppression and redundant frame elimination, and the result after the non-maximum suppression and redundant frame elimination is used as the final belonged category and positioning result of the object contained in the image to be detected.

5. The positioning information guidance-based object detection method as claimed in claim 1, wherein the classification matrix S of the prior rectangular anchor boxes is an M × C matrix, where C is the number of all possible classes of the object, and each element in the classification matrix S represents the probability value of each prior rectangular anchor box belonging to each class;