CN117058476A

CN117058476A - Target detection method based on random uncertainty

Info

Publication number: CN117058476A
Application number: CN202310778187.7A
Authority: CN
Inventors: 赵峰; 郭雪松
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-06-28
Filing date: 2023-06-28
Publication date: 2023-11-14

Abstract

The application relates to the field of target detection, in particular to a target detection method based on random uncertainty. The method inputs an image to be identified into a constructed target detection model to obtain the category and the coordinates of an object in the image, and training of the target detection model comprises the following steps: extracting original classification features and original regression features from the training data and inputting the original classification features and the original regression features into a self-adaptive feature alignment module to obtain optimized classification features; calculating the general distribution of the coordinates of the detection frame and the determined value of the coordinates of the detection frame according to the original regression characteristics; the original regression features, the optimized classification features and the determined values of the detection frame coordinates are input into a prediction frame weighted average module to obtain the optimized detection frame coordinates; inputting the general distribution of the optimized classification features and the coordinates of the detection frame into a target class prediction network to obtain optimized class scores; and training a target detection model according to the loss function. The application can improve the detection precision in complex scenes and forecast high-quality detection frames.

Description

Target detection method based on random uncertainty

Technical Field

The application relates to the field of target detection, in particular to a target detection method based on random uncertainty.

Background

Target detection is one of important tasks in the field of computer vision, and is widely applied to the fields of automatic driving, object tracking and the like. In recent years, the object detection method based on deep learning greatly improves the accuracy and the reasoning speed of the model. The main stream target detection method consists of two modules, namely a feature extraction module and a detector module, wherein the detector module usually consists of a classification branch and a regression branch. Most of the target detection methods based on deep learning propose a deterministic target detection model, the coordinates of a detection frame are represented as a determined value, and the convolution sampling process of the classification branch of the detector is modeled as a determined process.

However, due to reasons related to the observed data itself, such as signal acquisition noise, data labeling errors, etc., random uncertainty exists in the deep learning method. The target detection method based on deep learning also has the problem of random uncertainty, and the problem of random uncertainty can be further divided into space uncertainty and semantic uncertainty according to a regression task and a classification task in target detection.

First, for the regression task, there are problems of objects being truncated, being blocked, input image blurring, etc., so that the boundary of the detection frame is uncertain, that is, there is spatial uncertainty in the target detection task. Then, for the classification task, the shape of each object in the input image is random, the convolution receptive field of the classification branch of the detector is determined, the convolution characteristic is not aligned with the position of the object, so that uncertainty exists in the class of the object, namely semantic uncertainty exists in the target detection task, and finally, the class prediction of the object is inaccurate.

Secondly, the parallel structure of the classification and regression branches of the target detector also leads to misalignment of spatial predictions, affecting the detection performance of the model.

Finally, the mainstream target detection method only uses the category score as the quality representation score of the detection frame, but ignores the position quality of the detection frame, and cannot accurately represent the quality of the detection frame, so that the phenomenon that the high-quality detection frame is deleted by mistake in the target detection post-processing process is caused, namely, the target detection result is inaccurate and incomplete. Quality refers to the accuracy and reliability of the detection frame, and a high-quality detection frame designates the detection frame with accurate position, proper size and accurate target object category and confidence prediction.

Disclosure of Invention

In order to solve the problems, the application provides a target detection method based on random uncertainty.

The method constructs a target detection model, inputs an image to be identified into the target detection model, outputs the category and the coordinates of an object in the image, and trains the target detection model, wherein the method comprises the following steps:

firstly, preparing image data, marking target categories and category scores, marking coordinates of a detection frame, and preprocessing marked images to serve as training data;

step two, inputting training data into a feature extraction network to extract the spatial semantic features of the training data;

step three, inputting the space semantic features into a classification branch feature extraction network and a regression branch feature extraction network respectively to obtain original classification features X ^cls With original regression feature X ^reg ；

Step four, the original classification characteristic X ^cls With original regression feature X ^reg Input to the self-adaptive feature alignment module to obtain optimized classification features

Step five, according to the original regression feature X ^reg Calculating a general distribution of detection frame coordinatesAnd detecting the coordinates of the frameConstant value y ^dtrmd ；

Step six, original regression feature X ^reg Optimizing classification characteristicsDetermination value y of detection frame coordinates ^dtrmd Inputting the optimized detection frame coordinate r into a prediction frame weighted average module ^refine ；

Step seven, optimizing the classification characteristicsAnd the general distribution of the coordinates of the detection frame +.>Inputting the optimized category scores into a target category prediction network;

and step eight, training a target detection model according to the classification loss function FocalLoss and the regression loss function GIoULSS until a preset training completion condition is reached.

Further, the second step specifically includes inputting training data to a convolution feature extraction network to obtain multi-layer convolution features, and inputting the multi-layer convolution features to a spatial semantic feature enhancement network to obtain spatial semantic features.

Further, the convolution feature extraction network is ResNet-50 or ResNet-101.

Further, the spatial semantic feature enhancement network is a multi-level feature pyramid network FPN.

Further, the fourth step specifically includes:

regression of original regression feature X ^reg Inputting the random offset P into a convolution layer;

to the random offset P and the original classification characteristic X ^cls Performing random sampling operation to obtain aligned classification features X ^align ：

Wherein m is the number of convolution sampling points and p _i Representing the position of the center point of the current convolution kernel, R is the sampling position set convolved on the characteristic diagram, and p _m Represents each position on R, Δp _m Represents p _m Offset, w (p) _m ) Representing p on convolution kernel _m Weighting of the location;

to the original classification characteristic X ^cls And aligned classification feature X ^align Fusing to obtain optimized classification characteristics

Where α represents the original classification characteristic coefficient.

Further, the fifth step specifically includes:

defining a general distribution approximation model of the detection frame coordinates asWherein y is _i The distance from the feature point position of the current detection frame to the boundary of the detection frame is i, P () is a probability density function, and n represents the number of general distribution discrete values;

according to a general distribution approximation model of the coordinates of the detection frame, the original regression characteristic X ^reg Inputting the characteristic images into a layer of convolution network to obtain characteristic images;

inputting the feature map into a Softmax activation function to obtain the general distribution of the coordinates of the detection frame

General distribution of coordinates of a detection frameInputting the coordinate data into a mathematical expectation calculation module to obtain a determined value y of the coordinates of the detection frame ^dtrmd 。

Further, the sixth step specifically includes:

regression of original regression feature X ^reg Optimizing classification featuresSplicing in the channel dimension to obtain fusion characteristic X ^concat ；

Will fuse feature X ^concat Inputting the data into a layer of convolution network to generate a detection frame position sampling offset O;

sampling offset O of detection frame position and determination value y of detection frame coordinates ^dtrmd Inputting the detection frame coordinate r into a deformable convolution network to obtain an optimized detection frame coordinate r ^refine ：

Wherein j represents the sequence of the current deformable convolution sampling points, l represents the number of the deformable convolution sampling points, r represents the original predicted frame coordinate value, x and y respectively represent the horizontal direction coordinate and the vertical direction coordinate of the current point, and Deltax _j And Deltay _j The horizontal and vertical offsets of the current point are indicated, and k indicates the number of the detection frame coordinates.

Further, the seventh step specifically includes:

will optimize the classification characteristicsInputting the category scores into a layer of convolutional neural network to obtain a logical operator, and inputting the logical operator into a Sigmoid activation function to obtain category scores;

from a general distribution of detection frame coordinatesExtracting the maximum three probability values, the mean value and the variance, and inputting the maximum three probability values, the mean value and the variance into a probability guidance module to obtain position quality estimation;

multiplying the location quality estimate by the category score to obtain an optimized category score.

Further, in the step eight, training the target detection model according to the classification loss function FocalLoss and the regression loss function GIoULoss, specifically including:

the input of the classification loss function FocalLoss is the target class of the training data label, the class score of the training data label and the optimized class score obtained in the step seven;

the input of the regression loss function GIoULoss is the detection frame coordinate marked by training data and the optimized detection frame coordinate r ^refine 。

One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:

1. aiming at the space uncertainty existing in the target detection task, the application designs a detection frame coordinate modeling module based on general distribution, models the detection frame coordinate as a probability distribution form, can capture the boundary uncertainty information in a complex scene, obviously improves the position quality of the detection frame, relieves the space uncertainty problem in the target detection task, and improves the position prediction accuracy of the detection frame.

2. Aiming at the semantic uncertainty existing in the target detection task, the application designs the self-adaptive feature alignment module based on random sampling, which can adaptively learn the optimal offset of each sampling position of convolution operation, align all feature points on the whole feature map, align classification features, improve the accuracy of category prediction, alleviate the problem of the semantic uncertainty of the target detection task and improve the accuracy of category prediction.

3. Aiming at the problem of misalignment of spatial prediction of a target detector, the application designs a prediction frame weighted average module based on random sampling, and utilizes higher-quality surrounding detection frame coordinates to optimize the detection frame coordinates of the current feature points, thereby improving the position quality of the detection frame and improving the model precision.

4. Aiming at the quality representation problem of the detection frame in the target detection task, the application designs a probability guiding module, and obtains position quality estimation by utilizing position information contained in general distribution of the coordinates of the detection frame, thereby optimizing the quality representation of the detection frame and improving the model precision.

In summary, the method provided by the application can improve the detection precision in complex scenes, predicts a high-quality detection frame, and provides more accurate position information for downstream decision of a target detection task.

Drawings

FIG. 1 is a schematic diagram of a target detection method based on random uncertainty provided in an embodiment of the present application;

FIG. 2 is a diagram of a random uncertainty based detector network architecture provided by an embodiment of the present application;

FIG. 3 is a block diagram of a random sampling based adaptive feature alignment module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a network of detection frame coordinate modeling modules based on general distribution according to an embodiment of the present application;

FIG. 5 is a block diagram of a prediction block weighted average module based on random sampling according to an embodiment of the present application;

fig. 6 is a diagram of a target class prediction network according to an embodiment of the present application.

Detailed Description

The present application will be described in detail below with reference to the drawings and detailed embodiments, and before the technical solutions of the embodiments of the present application are described in detail, the terms and terms involved will be explained, and in the present specification, the components with the same names or the same reference numerals represent similar or identical structures, and are only limited for illustrative purposes.

Aiming at the problems of space uncertainty and semantic uncertainty existing in a target detection scene in the current practical application, the application provides a high-performance target detection method based on random uncertainty, which is shown in figure 1, wherein a target detection model is constructed, an image to be identified is input into the target detection model, and the category and the coordinates of an object in the image are output. The object detection model includes a feature extraction network and a random uncertainty based detector network. The application introduces probability distribution and random sampling modes into a random uncertainty-based detector network based on random uncertainty theory. The method comprises the following steps:

1. data preparation

1.1 data set preparation

Firstly, collecting image data under an automatic driving scene, then labeling object types and type score marks of objects such as pedestrians, automobiles, bicycles, traffic lights and trees, and coordinate marks of detection frames, wherein the type score marks refer to confidence degrees of the object types in the image data, and finally, dividing training sets and testing sets according to the proportion of 9:1, wherein the training sets and the testing sets are respectively used for training and testing models.

1.2 data enhancement

During training and testing, data enhancement is needed to be carried out on the input images so as to adapt to different input image sizes in an actual scene, and the robustness of the target detection model is enhanced. During training, a multi-scale image enhancement technology is used, the short sides are randomly scaled between 480 and 960, the long sides are scaled, the maximum is set to 1333, and one scale input target detection model is selected for training; and in the test, a multi-scale image enhancement technology is still used, short sides are respectively scaled to 480, 600, 720, 840 and 960, long sides are scaled in proportion, the maximum is set to 1333, the images with the five scales are input into a target detection model, and then the detection results of the model are simultaneously input into a post-processing module, so that a final test result is obtained.

2. Extracting spatial semantic features of an input image

As shown in fig. 1, the image subjected to data enhancement in step 1.2 is input into a feature extraction network for extracting spatial semantic features of the input image.

The feature extraction network comprises a skeleton network and a spatial semantic feature enhancement network, wherein an image is firstly input into the skeleton network to obtain multi-layer convolution features of the image, and the skeleton network uses common convolution feature extraction networks such as ResNet-50, resNet-101 and the like; and then inputting the multi-layer convolution features into a spatial semantic feature enhancement network to obtain spatial semantic features of the image, wherein the spatial semantic feature enhancement network is a multi-level feature pyramid network FPN and is used for obtaining low-level spatial features and high-level semantic features of the image and optimizing the detection precision of the multi-scale targets.

3. Acquiring target detection results based on random uncertainty

As shown in fig. 1, the spatial semantic features obtained in the step 2 are input into a random uncertainty-based detector network to obtain a detection result, wherein the detection result comprises a target category, a score thereof and a detection frame coordinate.

As shown in fig. 2, the random uncertainty-based detector network comprises a classification branch and a regression branch, wherein the classification branch comprises a classification branch feature extraction network, a random sampling-based adaptive feature alignment module, a probability guidance module and a target class prediction network, and the regression branch comprises a regression branch feature extraction network, a general distribution-based detection frame coordinate modeling module and a random sampling-based prediction frame weighted average module;

3.1 extracting classification and regression characteristics

As shown in fig. 2, the spatial semantic features obtained in the step 2 are used as input feature graphs, and are respectively input into a classification branch feature extraction network and a regression branch feature extraction network to output original classification features X ^cls With original regression feature X ^reg The classified branch feature extraction network and the regression branch feature extraction network are two parallel branches, and are composed of four layers of convolution networks.

3.2 optimizing classification characteristics based on random sampling

As shown in FIG. 2, the original classification characteristic X ^cls With original regression feature X ^reg For input, input to self-adaptive feature alignment module based on random sampling at the same time, output optimized classification featureUsing original regression feature X ^reg The position information contained in the whole feature map is adaptively learned, and the optimal offset of each sampling position of the convolution operation is adaptively learned, so that all feature points on the whole feature map are aligned to relieve the problem of semantic uncertainty of a target detection task; the self-adaptive characteristic alignment module based on random sampling consists of a layer of common convolution network, a layer of deformable convolution network and a layer of deformable convolution networkThe secondary point-to-point accumulation operation.

The specific process is shown in fig. 3, and is divided into three steps:

first, the original regression feature X ^reg Input into the normal convolution, a random offset p=conv (X ^reg ) Wherein Conv (x) represents a normal convolution operation;

then, the random offset P and the original classification characteristic X ^cls Inputting the data into deformable convolution to perform random sampling operation to obtain aligned classification featuresWherein m is the number of convolution sampling points, which is set to 9, p in the embodiment of the application _i Representing the position of the center point of the current convolution kernel, wherein R is a sampling position set of common convolution on a characteristic diagram, and p _m Represents each position on R, Δp _m Represents p _m Offset, w (p) _m ) Representing p on convolution kernel _m Weighting of the location;

finally, the original classification characteristic X ^cls And aligned classification feature X ^align Fusing to obtain optimized classification characteristicsFor the final classification task, where α represents the original classification characteristic coefficient, set to 0.3 in the embodiment of the present application.

3.3 detection frame coordinates based on general distribution modeling

As shown in FIG. 2, the original regression feature X ^reg Inputting the detection frame coordinates to a detection frame coordinate modeling module based on general distribution, and sequentially outputting the general distribution of the detection frame coordinates and the determined values of the detection frame coordinates; modeling the detection frame coordinates into a probability distribution form, and capturing boundary uncertainty information in a complex scene to relieve the problem of spatial uncertainty of a target detection task; the detection frame coordinate modeling module based on general distribution consists of a layer of common convolution network, a Softmax activation function and a mathematical expectation calculation module.

The specific process is shown in fig. 4, and is divided into three steps:

(1) Detection frame coordinates based on general distribution modeling

The general distribution model of the detection frame coordinates is expressed asWherein P () is a probability density function of a general distribution, q represents a distance from a feature point position of a current detection frame to a detection frame boundary, and then a minimum value y of an upper and lower integral limit is set ₀ Is 0 and maximum value y _n 16, n is the number of discrete values of the general distribution, n=16, finally discretizing the continuous random distribution, wherein the interval of the discrete random variables is 1, the number of the discrete values is n+1, the discrete values are {0,1,2, …,16}, and the general distribution approximation model of the detection frame coordinates is expressed as->Wherein y is _i Indicating that the distance from the feature point position used to predict the current detection frame to the detection frame boundary is i, P (y _i ) Representing the discrete value y _i The corresponding probabilities;

(2) General distribution of output detection frame coordinates

The original regression feature X is firstly calculated according to the general distribution approximate expression of the detection frame coordinates deduced in the step (1) ^reg Inputting the characteristic diagram into a layer of convolution network to obtain a characteristic diagram with the output channel number of n+1, and then inputting the characteristic diagram into a Softmax activation function to obtain the general distribution of the coordinates of the detection frameThe Softmax activation function is used to ensure that the sum of the probabilities of all discrete random variables predicted by the network is 1, i.e.>

(3) Calculating mathematical expectation of general distribution to obtain a determined value of the coordinates of the detection frame

Inputting the general distribution of the coordinates of the detection frame obtained in the step (2) into a mathematical expectation calculation module to obtainTo the determined value y of the detection frame coordinates ^dtrmd The mathematical expectation is to perform an approximation calculation according to the general distribution approximation model expression formula of the coordinates of the detection frame in step (1).

3.4 optimizing detection frame coordinates based on random sampling

As shown in FIG. 2, the original regression feature X ^reg Optimizing classification characteristicsDetermination value y of detection frame coordinates ^dtrmd Inputting the detection frame coordinates to a prediction frame weighted average module based on random sampling, and outputting the optimized detection frame coordinates; use of optimized classification features->Target class information and original regression feature X contained in the model ^reg The position related information contained in the method adaptively captures the surrounding detection frame coordinates with higher quality of each feature point, so as to optimize the detection frame coordinates of the current feature point, and relieve the problem of misalignment of spatial prediction; the prediction frame weighted average module based on random sampling consists of splicing and fusion operation, a layer of common convolution network and a layer of deformable convolution network.

The specific process is shown in fig. 5, and is divided into three steps:

first, the original regression feature X ^reg Optimizing classification featuresSplicing in the channel dimension to obtain fusion characteristics +.>concat (x) represents a feature map stitching operation;

then the fusion feature X ^concat Input into a layer of common convolution network to generate detection frame position sampling offset O=Conv (X) ^concat )；

Finally, sampling offset O of the detection frame position and a determined value y of the detection frame coordinate ^dtrmd Input to a layer of deformable convolutional meshIn the complex, the surrounding high-quality detection frames of the current detection frame are weighted and averaged to obtain the optimized detection frame coordinatesWherein j represents the sequence of the current deformable convolution sampling points, l represents the number of the deformable convolution sampling points, r represents the original predicted frame coordinate value, x and y respectively represent the horizontal direction coordinate and the vertical direction coordinate of the current point, and Deltax _j And Deltay _j The offset in the horizontal direction and the offset in the vertical direction of the current point are respectively represented, k represents the number of the coordinates of the detection frame, and in the embodiment of the present application, the set of k is {0,1,2,3}.

3.5 optimizing class scores with general distribution of detection boxes

As shown in FIG. 2, the classification features will be optimizedAnd the general distribution of the coordinates of the detection frame +.>Inputting the optimized category scores into a target category prediction network, and outputting the optimized category scores; general distribution of coordinates using detection frame->The position information contained in the target detection post-processing method is used for obtaining position quality estimation, optimizing quality representation of the detection frame, obtaining more accurate quality representation of the detection frame, and relieving the phenomenon that the high-quality detection frame is deleted by mistake in the target detection post-processing process; the target class prediction network consists of a layer of common convolution network, a Sigmoid activation function, a probability guidance module and a point multiplication operation.

The specific process is shown in fig. 6, and is divided into three steps:

first, the classification characteristics are optimizedInputting into a layer of convolutional neural network to obtain a logical operator, and then inputting the logical operator into a Sigmoid activation function to obtainTo category score;

then from the general distribution of the coordinates of the detection frameExtracting the maximum three probability values, mean values and variances as one-dimensional statistics, inputting the one-dimensional statistics into a probability guiding module to obtain position quality estimation, wherein the probability guiding module consists of a full-connection layer and a Sigmoid activation function;

and finally, multiplying the position quality estimation by the category score to obtain the final optimized category score.

To sum up, the output of the current model is the optimized detection frame coordinate r predicted in step 3.4 ^refine And the final optimized class score predicted in step 3.5.

4. Loss function

The loss function of the application is composed of two parts, namely a classification loss function FocalLoss used for training a classification branch and a regression loss function GIoULSS used for training a regression branch, wherein the loss functions are all universal loss functions in a target detection task; the input of the classification loss function FocalLoss is the target class marked in the step 1.1 and the score thereof, the final optimized class score in the step 3.5, and the input of the regression loss function GIoULSS is the detection frame coordinate marked in the step 1.1 and the optimized detection frame coordinate r predicted in the step 3.4 ^refine 。

5. Training a target detection model based on a loss function

As shown in fig. 1, in the training process, an input image is enhanced according to the training link data enhancement method described in step 1.2, and then the image after data enhancement is input into a feature extraction network and a random uncertainty-based detector network to obtain a prediction result of a model, namely a category score and a detection frame coordinate; and (3) calculating the loss of the model by using the loss function in the step (4), updating network parameters by adopting an SGD (generalized algorithm-based digital simulation) optimizer, setting the initial learning rate to be 0.01, setting the momentum to be 0.9, and training for 12 periods by using a Warmup training strategy to obtain a trained target detection model.

6. The model is put into practical use

In the test process, as shown in fig. 1, an input image is enhanced according to the test link data enhancement method described in step 1.2, and then is input into a trained target detection model to obtain a prediction result of the model, namely, category scores and detection frame coordinates; finally, removing redundant detection frames through target detection post-processing to obtain accurate category scores and high-quality detection frame positions, wherein an IoU threshold is set to be 0.6 in the embodiment of the application by using a post-processing model NMS commonly used for target detection.

The above embodiments are merely illustrative of the preferred embodiments of the present application and are not intended to limit the scope of the present application, and various modifications and improvements made by those skilled in the art to the technical solution of the present application should fall within the protection scope defined by the claims of the present application without departing from the design spirit of the present application.

Claims

1. A target detection method based on random uncertainty constructs a target detection model, inputs an image to be identified into the target detection model, outputs the category and the coordinates of an object in the image, and trains the target detection model, wherein the training method comprises the following steps:

Step five, according to the original regression feature X ^reg Calculating a general distribution of detection frame coordinatesAnd a determination value y of the detection frame coordinates ^dtrmd ；

2. The method for detecting the target based on random uncertainty according to claim 1, wherein the step two specifically comprises inputting training data into a convolution feature extraction network to obtain multi-layer convolution features, and inputting the multi-layer convolution features into a spatial semantic feature enhancement network to obtain spatial semantic features.

3. The random uncertainty-based target detection method of claim 2, wherein the convolutional feature extraction network is res net-50 or res net-101.

4. The random uncertainty-based target detection method of claim 2, wherein the spatial semantic feature enhancement network is a multi-level feature pyramid network FPN.

5. The method for detecting a target based on random uncertainty as claimed in claim 1, wherein the fourth step comprises:

Where α represents the original classification characteristic coefficient.

6. The method for detecting a target based on random uncertainty as claimed in claim 1, wherein the fifth step comprises:

7. The method for detecting a target based on random uncertainty as claimed in claim 1, wherein the sixth step specifically comprises:

8. The method for detecting a target based on random uncertainty as claimed in claim 1, wherein the step seven specifically comprises:

from a general distribution of detection frame coordinatesExtracting the maximum three probability values, the average value and the square shame, and inputting the three probability values, the average value and the square shame into a probability guiding module to obtain position quality estimation;

9. The method for detecting an object based on random uncertainty as claimed in claim 1, wherein the training the object detection model according to the classification loss function FocalLoss and the regression loss function GIoULoss in step eight specifically comprises: