CN115272814B

CN115272814B - Long-distance space self-adaptive multi-scale small target detection method

Info

Publication number: CN115272814B
Application number: CN202211188231.0A
Authority: CN
Inventors: 甘胜丰; 胡磊; 刘世超; 李露; 闵高; 雷维新; 张仁; 周蓓; 徐朝玉
Original assignee: Huazhong Agricultural University; Nanchang Institute of Technology
Current assignee: Huazhong Agricultural University; Nanchang Institute of Technology
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2022-12-27
Anticipated expiration: 2042-09-28
Also published as: CN115272814A

Abstract

The invention discloses a remote space self-adaptive multi-scale small target detection method, which comprises two stages: a multi-scale target detection model determining stage and a multi-scale target detection model predicting stage; in the multi-scale target detection model determining stage, a multi-scale target detection model structure corresponding to the type of task is obtained by carrying out data set analysis on different target detection tasks; in a multi-scale target detection model prediction stage, directly calling a corresponding structure through a corresponding target detection task type; and when the type of the detection task is unknown, obtaining a multi-scale target detection model structure of the corresponding detection task through an OSTU algorithm and a decision tree, and completing prediction. The invention has the beneficial effects that: the method can carry out various target detection in real time and in a self-adaptive manner, improves the universality of target detection and ensures the precision of target detection.

Description

Long-distance space self-adaptive multi-scale small target detection method

Technical Field

The invention relates to the field of image target detection, in particular to a remote space self-adaptive multi-scale small target detection method.

Background

The target detection is one of important tasks of computer vision, and under the drive of deep learning, a target detection model gradually becomes mature and stable, and is successfully applied to the fields of national defense safety, intelligent transportation, industrial automation and the like. At present, a general target detection model is optimized on a public data set, and the quality of the model is judged by using detection indexes of the public data set. However, in an actual application scenario, the difference between the scene data set and the common data set is large, and it is often necessary to make the model more efficient by adjusting the model.

For example, in the detection of large workpieces, the model has the characteristics of large area and small quantity of targets to be identified, and the targets can obtain a very good detection effect through the universal model.

For another example, aerial remote sensing images and unmanned aerial vehicle high-altitude images are generally shot from a height of hundreds of meters to nearly ten thousand meters, many targets in the images are small targets (dozens of or even several pixels), so that the target information amount is not large, the image view field is large (usually, the coverage range of several square kilometers) and the view field may contain various backgrounds, strong interference is generated on target detection, and the target is difficult to distinguish from the background or similar targets.

At present, common excellent target detection models include YOLOX, YOLOV5, fast RCNN, centeret and the like, a significant difference exists between the detection performance of a small target and a large target, the detection performance of the small target is usually only half of that of the large target, and the model is difficult to be applied to target detection in the field of remote sensing.

Disclosure of Invention

Aiming at the technical problems of poor universality, low precision and low efficiency of small target detection in aerial remote sensing images, the invention provides a remote space self-adaptive multi-scale small target detection method, which can adjust corresponding feature fusion structures and multi-scale detection heads to the characteristics of data sets of different detection targets to greatly optimize the model efficiency.

The method comprises two stages, which are respectively:

a multi-scale target detection model determining stage and a multi-scale target detection model predicting stage;

the multi-scale target detection model determining stage comprises the following processes:

s1, constructing a multi-scale target detection model, wherein the multi-scale target detection model comprises three parts, namely a two-layer feature fusion structure, a three-layer feature fusion structure and a four-layer feature fusion structure; the two-layer feature fusion structure, the three-layer feature fusion structure and the four-layer feature fusion structure are trained in advance;

s2, acquiring the type of a target detection task and a corresponding training set, and labeling targets needing to be detected in the training set by adopting a target boundary box to obtain the coordinate of the upper left corner of target information as (x ₁ ,y ₁ ) The coordinate of the lower right corner is (x ₂ ,y ₂ )；

S3, calculating the ratio of the area of the target boundary frame to the area of the image: (x ₂ -x ₁ )*(y ₂ -y ₁ )/W*H(ii) a WhereinW、HRespectively the width and the height of the images in the training set;

s4: the evolution of the ratio of the area of the target bounding box to the area of the image is less than a preset first threshold valuea ₁ When the target is a small target; the evolution of the ratio of the area of the target boundary frame to the area of the image is larger than a preset second threshold valuea ₂ When the target is a large target; the ratio of the area of the target bounding box to the image area is deriveda ₁ And witha ₂ In between, is a common target;

s5, determining a multi-scale target detection model structure by adopting a decision tree method, which specifically comprises the following steps:

calculating the occupation ratios of a large target, a small target and a common target respectivelyC ₁ 、C ₂ 、C ₃ According to the occupation ratios of the targets and the set occupation ratio threshold, judging the adaptive structure of the multi-scale target detection model, which specifically comprises the following steps: when the ratio of the small target number exceeds the preset percentage of the whole datapThen, adjusting the multi-scale target detection model into a four-layer feature fusion structure; when the ratio of the large target quantity exceeds the preset percentagepWhen the multi-scale target detection model is adjusted to be a two-layer feature fusion structure, otherwise, the multi-scale target detection model is adjusted to be a two-layer feature fusion structure

The whole multi-scale target detection model is of a three-layer characteristic fusion structure;

a multi-scale target detection model prediction stage:

s6: acquiring target data to be predicted;

s7: if the target data to be predicted belong to the target detection task type of the multi-scale target detection model determining stage, calling a target detection model structure correspondingly determined by the multi-scale target detection model determining stage to directly predict to obtain a target prediction result;

s8: if the target data to be predicted does not belong to the target detection task type at the multi-scale target detection model determination stage, the target data to be predicted is processed by adopting an OSTU threshold segmentation method, the image is divided into a background part and a foreground part according to the gray characteristic of the image, the type of the target to be predicted is determined according to the ratio of the foreground target pixel value to the whole image pixel value, the proportion of various predicted targets is counted, the structure of the multi-scale target detection model is determined again according to the method in the step S5, and the corresponding structure is called to complete target detection.

The beneficial effects provided by the invention are as follows:

aiming at large target detection, the method is adjusted into two-layer scale prediction, the parameter quantity can be greatly reduced, the real-time detection of the edge equipment end is realized, and a novel feature fusion algorithm structure is provided, aiming at small target detection, the detection precision can be greatly improved by adding a little extra time overhead, and the method has extremely high application value to actual industrial scenes;

in addition, the method can be applied to the dynamic detection process, for example, along with the rise of the aerial photographing height of the unmanned aerial vehicle, the initial small target is a bicycle, and along with the increase of the aerial photographing height, the small target gradually changes into a house, namely, the small target in the method is a dynamic or relative concept;

finally, the method and the device can carry out various target detection in real time and in a self-adaptive mode, improve the universality of target detection and ensure the precision of target detection.

Drawings

FIG. 1 is a simple flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a two-layer feature fusion architecture;

FIG. 3 is a schematic diagram of a four-layer feature fusion architecture;

FIG. 4 is a diagram of context hopping connection feature fusion;

FIG. 5 is a schematic structural diagram of an SSHF receptive field superposition module;

FIG. 6 is a schematic representation of a data set after being subjected to a decision tree, which should be classified as one of the categories;

FIG. 7 is a schematic diagram of a decision result;

FIG. 8 is a diagram illustrating the effect of the OSTU threshold splitting algorithm;

FIG. 9 is a schematic diagram of a detailed process of the method of the present invention;

fig. 10 is a schematic diagram of the small object detection effect obtained by using a conventional object detection three-layer network structure;

fig. 11 is a schematic diagram of the small target detection effect obtained by using the improved four-layer feature fusion structure of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a simplified flow chart of the method of the present invention.

The invention provides a remote space self-adaptive multi-scale small target detection method, which comprises the following two stages:

it should be noted that the multi-scale target detection model in the present application includes 3 parts, which are respectively a two-layer feature fusion structure, a three-layer feature fusion structure, and a four-layer feature fusion structure. The three different feature fusion structures are respectively used for large target detection, common target detection and small target detection;

three different configurations are set forth in sequence below.

Referring to fig. 2, fig. 2 is a schematic diagram of a two-layer feature fusion structure;

the two-layer feature fusion structure comprises: the system comprises a backbone network, a CA attention mechanism module, a two-layer feature fusion module and a decoupling output module;

the method comprises the steps that an input image is subjected to downsampling feature extraction through a backbone network, and two downsampling feature layers with different scales from shallow to deep are obtained and are respectively a first feature layer and a second feature layer;

the first characteristic layer and the second characteristic layer respectively pass through a CA attention mechanism module to obtain a first enhancement characteristic and a second enhancement characteristic;

the first enhanced feature and the second enhanced feature are subjected to feature fusion through a feature fusion module to obtain a fusion feature;

the fusion characteristics are processed by a decoupling output module to obtain a large target detection result.

Specifically, the two-layer feature fusion module includes: the device comprises a convolution unit, a transposition convolution unit, two Contact + CSPLAyer structures and a downsampling unit; the specific process of the feature fusion module for feature fusion is as follows:

after the second enhanced features sequentially pass through the convolution unit and the transposition convolution unit, the obtained first convolution result is fused with the first enhanced features to obtain a first fusion result;

the first fusion result is divided into two branches through a Contact + CSPLAyer structure, and one branch is directly decoupled through a decoupling output module to obtain first decoupling information; the other branch carries out down-sampling by a down-sampling unit to obtain a down-sampling characteristic of a first fusion result;

fusing a second convolution result obtained after the downsampling feature of the first fusion result and the second enhancement feature pass through a convolution unit to obtain a second fusion result;

after the second fusion result passes through another Contact + CSPLAyer structure, the second fusion result is directly processed by a decoupling output module to obtain second decoupling information;

and superposing the first decoupling information and the second decoupling information to obtain a large target detection result.

When a large target in a data set meets the condition of a decision tree, the two-layer network structure is started, because the highest layer in the common three-layer scale mainly aims at the detection of a smaller target, and the output of the three-layer scale is found to have a large amount of calculation redundancy on a large target detection task according to experiments, the use of the two-layer scale structure can reduce the large amount of calculation redundancy on the large target detection, thereby accelerating the reasoning speed.

As an example of the two-layer feature fusion structure, in the two-layer feature fusion structure, CSPDarknet53 is used as a main network for feature extraction, and the input image has a size of 640 × 640, in fig. 2, feat1 represents a feature layer (first feature layer) obtained by down-sampling the input image 16 times through the main network, the size is 40 × 40, and feat2 represents a feature layer (second feature layer) obtained by down-sampling the input image 32 times, and the size is 20 × 20.

FIG. 2 is a convolution unit for reducing the number of passes; the transposition convolution unit is used for expanding the width and the height of the characteristic layer, and the up-sampling operation is carried out by utilizing a convolution mode, so that the characteristic splicing of an upper layer and a lower layer is facilitated;

the Contact + CSPLAyer structure divides the original input into two branches, respectively carries out convolution operation to reduce the number of channels by half, then carries out a plurality of residual error structure operations on one branch, and then splices the two branches to enable the model to learn more characteristics;

the down-sampling unit is used for compressing the width and height of the characteristic layer by down-sampling operation;

the decoupling output module is used for outputting a decoupling head, and different convolution operations are used for respectively obtaining the output of the category, the confidence coefficient and the coordinate predicted value.

A lightweight CA attention mechanism is added behind feat1 and feat2, a CA attention mechanism module aims to enhance the expression capability of the learning characteristics of the mobile network, and can convert and change any intermediate characteristic tensor in the network and output the tensor with the same size.

The CA attention mechanism enables the lightweight network to pay attention in a larger area by embedding position information into channel attention, avoids generating a large amount of calculation overhead, can capture not only cross-channel information, but also information of direction perception and position perception, enables the model to pay more attention to the area of a target, reduces background interference, improves model precision, and does not change the width and height of a feature layer and the number of channels.

And (3) performing the feature fusion operation shown in fig. 2 after the two feature layers of feat1 and feat2 pass through the CA attention module, and then inputting the two feature layers obtained after feature fusion into a YoloHead to output a decoupling head so as to obtain prediction information.

The three-layer feature fusion structure adopts a PAFPN structure in a yolo network series;

in the present application, a three-layer feature fusion structure is used to detect objects of common size; as an example, the three-layer feature fusion structure in the present application is a PAFPN structure in yolo network series, which is a general structure of yolo series, and therefore, it is not described in too much detail here, and those skilled in the art can select other types of general structures according to the actual situation.

Referring to fig. 3, fig. 3 is a schematic diagram of a four-layer feature fusion structure;

the four-layer feature fusion structure comprises: the system comprises a backbone network, a CA attention mechanism module, a four-layer feature fusion module, an SSHF receptive field superposition module and a decoupling output module;

the method comprises the steps that an input image is subjected to downsampling feature extraction through a backbone network, and four downsampling feature layers with different scales from shallow to deep are obtained and are respectively a first feature layer, a second feature layer, a third feature layer and a fourth feature layer;

after feature fusion is carried out on the first enhanced feature, the second enhanced feature, the third feature layer and the fourth feature layer through a four-layer feature fusion module, a fusion feature is obtained;

the fusion characteristics are processed by an SSHF receptive field superposition module and a decoupling output module in sequence to obtain a small target detection result.

Specifically, the four-layer feature fusion module includes: the system comprises three convolution units, three transposition convolution units, six Contact + CSPLAyer structures and three down-sampling units; the four-layer feature fusion module adopts a context jump connection mechanism, and the specific fusion process is as follows:

the fourth feature layer passes through a first convolution unit to obtain a first convolution result; after the first convolution result passes through a first transposition convolution unit, a first transposition convolution result is obtained; the first conversion convolution result and the third characteristic layer obtain a first fusion result through a first Contact + CSP layer structure;

the first fusion result is divided into two branches, and in the first branch, the first fusion result passes through a second convolution unit to obtain a second convolution result; in the second branch, the first fusion result and the first convolution result are input into a third Contact + CSPLAyer structure together;

the second convolution result passes through a second transposition convolution unit to obtain a second transposition convolution result; the second transposition convolution result, the second enhancement feature and the first convolution result pass through a second Contact + CSPLAyer structure to obtain a second fusion result;

the second fusion result is divided into two branches; in the first branch, the second fusion result sequentially passes through a third convolution unit and a third transposition convolution unit to obtain a third transposition convolution result; in the other branch, the second fusion result is input into a fourth Contact + CSP layer structure;

the first enhanced feature, the third transposed convolution result, the first fusion result and the first convolution result pass through a third Contact + CSPLAyer structure to obtain a third fusion result;

the third fusion result is divided into two branches, and in the first branch, the third fusion result sequentially passes through the SSHF receptive field superposition module and the first decoupling output module to obtain first decoupling information; in the other branch, the third fusion result is subjected to first downsampling operation through a first downsampling unit, and the obtained first downsampling result is input into a fourth Contact + CSP layer structure;

the second fusion result and the first down-sampling result pass through a fourth Contact + CSPLAyer structure to obtain a fourth fusion result; the fourth fusion result is divided into two branches, and in the first branch, the fourth fusion result sequentially passes through the SSHF receptive field superposition module and the second decoupling output module to obtain second decoupling information; in the other branch, the fourth fusion result is subjected to second down-sampling operation through a second down-sampling unit, and the obtained second down-sampling result is input into a fifth Contact + CSP layer structure;

the second convolution result and the second down-sampling result pass through a fifth Contact + CSPLAyer structure to obtain a fifth fusion result; the fifth fusion result is divided into two branches, and in the first branch, the fifth fusion result passes through a third decoupling output module to obtain third decoupling information; in the other branch, the fifth fusion result is subjected to third down-sampling operation through a third down-sampling unit, and the obtained third down-sampling result is input into a sixth Contact + CSP layer structure;

the first convolution result and the third down-sampling result pass through a sixth Contact + CSPLAyer structure to obtain a sixth fusion result; the sixth fusion result passes through a fourth decoupling output module to obtain fourth decoupling information;

and superposing the first decoupling information, the second decoupling information, the third decoupling information and the fourth decoupling information to obtain a small target detection result.

As an embodiment of the four-layer feature fusion structure, the four-layer feature fusion structure capable of being used universally is specially designed for small target detection tasks, and the test effect on a data set of high-altitude images of the unmanned aerial vehicle is superior to that of an original structure.

The CSPDarknet53 is used for feature extraction of the trunk network, a feature layer with a smaller downsampling multiple is additionally extracted during feature extraction of the trunk network, compared with other three feature layers, the spatial information of the layer is richer and is suitable for detecting and detecting a tiny target, and the four feature layers are respectively represented by flat 1 (a first feature layer), flat 2 (a second feature layer), flat 3 (a third feature layer) and flat 4 (a fourth feature layer) from shallow depth to deep. That is, feat1-feat4 are four feature layers with different scales obtained by feature extraction of the trunk network CSPDarknet 53;

and aiming at two shallow feature layers of flat 1-flat 2, a CA attention module is added to reduce the interference of the background.

In fig. 3, the convolution unit is a general convolution module for reducing the number of channels; the transposition convolution unit is used for expanding the width and the height of the characteristic layer, and the up-sampling operation is carried out by utilizing a convolution mode, so that the splicing of the characteristics of the upper layer and the lower layer is facilitated;

the down-sampling unit is used for compressing the width and height of the characteristic layer and performing down-sampling operation;

the SSHF receptive field superposition module is a characteristic strengthening module and is used for fusing the characteristics of receptive fields with different sizes;

It should be noted that, in order to efficiently improve the precision of the tiny target and the small target, the present application focuses on the improvement of feat1 and feat2 in the feature fusion part.

Because the semantic information of the shallow feature layer is weak, the feature information of the shallow feature layer is extended by adopting a cross-layer fusion mode of context information, please refer to fig. 4, where fig. 4 is a schematic diagram of context jump connection feature fusion. Wherein the emphasized part in fig. 4 is the jumper connection of the right half. The left half of fig. 4 is divided into 4 feature layers with different scales, wherein feat1-feat2 are not connected to the right Concat + CSPlayer via the CA attention mechanism module, which does not mean that fig. 4 is inconsistent with fig. 3, and fig. 4 is only a schematic illustration, so the CA attention mechanism module is omitted.

The context jump join feature fusion is specifically described as follows:

firstly, transpose convolution is used for replacing common up-sampling operation, so that the feature loss of artificial feature engineering to small targets is reduced, besides layer-by-layer fusion of similar feature pyramids, feat4 is subjected to up-sampling for three times and then fused with feat1, feat3 is subjected to up-sampling for two times and then fused with feat2, and therefore spatial information and semantic information of feat1 and feat2 are rich. The multi-scale features of different network depths are extracted in the training process by using the up-sampling and jumping connection, the high-resolution images are processed by using the shallow network, and the low-resolution images are processed by using the deep network, so that more semantic information is extracted while the position information of the small target is kept as much as possible, and the detection performance of the small target is improved under the condition of reducing the calculation cost.

In addition, it should be noted that, in order to further enhance the detection performance of the small target by performing feature enhancement on the shallow feature layer, an SSHF receptive field superposition module is added in the present application.

The module uses convolution kernel serial structures of different sizes to stack, strengthens the receptive field on the characteristic layer, and makes the characteristic information of two shallow characteristic layers richer.

Please refer to fig. 5, fig. 5 is a schematic structural diagram of the SSHF receptive field stacking module.

The traditional SSHF receptive field superposition module adopts convolution kernels of 3x3,5x5 and 7x7 to extract the characteristics, and the context modeling in the mode increases the receptive field of the corresponding layer, is in direct proportion to the step length of the corresponding layer, is in proportion to the target scale of each detection module, and increases the target scale in each detection module.

In the application, in order to reduce the number of model parameters and improve the inference speed, compared with the conventional SSHF receptive field superposition module, the application adopts a plurality of 3x3 convolution kernels stacked in series to replace 5x5 and 7x7 convolution kernels.

After the feature fusion is completed, four feature layers with different sizes can be obtained, the input feature layers feat obtained after the feature fusion are input into an SSHF receptive field superposition module, and then different receptive fields are extractedThe obtained feature layer is spliced and concat superposed to obtain an output feature layer with enhanced features

. Output feature layer

And outputting the target prediction result through a decoupling head of the decoupling output module.

The above part explains three different structures and structural principles of the multi-scale target detection model of the present application; three different structures can be trained in advance and used;

the following further introduces an analysis process of the data set, and the purpose of analyzing the data set is to obtain which structure of the multi-scale target detection model the data set of the type is suitable for, so that the structure can be directly adopted for prediction in a prediction stage for the data of the same type;

s2, acquiring the type of a target detection task and a corresponding training set, and labeling a target to be detected in the training set by adopting a target bounding box to obtain a coordinate of the upper left corner of target information (A)x ₁ ,y ₁ ) The coordinates of the lower right corner are: (x ₂ ,y ₂ )；

Regarding the target detection task type, the present application is explained as follows:

firstly, a high-altitude unmanned aerial vehicle is used for shooting a target image in an actual scene, and different detection targets under the visual angle of the unmanned aerial vehicle are respectively collected for distinguishing detection tasks.

When urban greening construction is set as a task (namely, the type of the target detection task is urban greening construction), target images such as trees, green lands and the like need to be acquired at high altitude through an unmanned aerial vehicle;

when an urban traffic intelligent planning task is used (namely, the type of the target detection task is urban traffic intelligent planning), a high-altitude unmanned aerial vehicle is required to be used for acquiring road conditions of various roads, and vehicles and pedestrians are used as detection targets.

Aiming at different target detection task types, the structures of the multi-scale target detection models adopted by the method are different, and the method relates to the process of analyzing a data set, namely steps S2-S5;

through the steps S2-S5, the structure of what multi-scale target detection model is adopted by the urban greening construction task can be determined, or the structure of what multi-scale target detection model is adopted by the urban traffic intelligent planning task can be determined; in the subsequent prediction stage, if the prediction task to be carried out is known in advance or is a task established by urban greening or an urban traffic intelligent planning task, the corresponding multi-scale target detection model structure is directly called.

In different detection task types, the method and the device distinguish according to different ratios of the area of the target boundary frame to the area of the image, analyze the corresponding data set, and accordingly divide the detection tasks into a large target detection task, a common target detection task and a small target detection task. The specific classification process is shown in the following steps S3-S5;

in addition, the image is preprocessed before the image is labeled. In an actual detection scene, illumination change is a very common factor which can affect the identification performance, so that data enhancement operation of brightness and contrast change needs to be performed on collected image data, the diversity of the image data is enhanced, the actual environment of a target is simulated, and the robustness of a model is improved.

The means employed for image enhancement in this application are as follows:

generally, the contrast and brightness of an image are changed in a pixel-by-pixel manner, an input original image is set to be f (x), and the image with the changed contrast and brightness is set to beg(x) The regulating formula isg(x)=αf(x)+βWhereinαThe contrast ratio is adjusted by adjusting the contrast ratio,βand adjusting the brightness, namely performing brightness and contrast change operation on the acquired image by adjusting the values of alpha and beta, and finishing the image enhancement operation according to the actual condition.

s4: the evolution of the ratio of the area of the target boundary frame to the area of the image is less than a preset first threshold valuea ₁ When the target is a small target; the evolution of the ratio of the area of the target boundary frame to the area of the image is larger than a preset second threshold valuea ₂ When the target is a large target; the ratio of the area of the target bounding box to the image area is squareda ₁ Anda ₂ in between, is a common target; in the embodiment of the present application,a ₁ the content of the organic acid is 0.03,a ₂ is 0.2; of course, in some other embodiments, the threshold of the ratio of the target bounding box area to the image area may be set by itself according to the actual situation or different detection tasks, or may be set and adjusted automatically in an adaptive manner according to an adaptive algorithm. This is only schematically illustrated in the present application.

It should be noted that, in the steps S3 to S4, the number of various types of targets in the data set is determined, but it is not determined which target detection model the data set is suitable for; determining which detection model is set forth in step S5;

calculating the occupation ratios of a large target, a small target and a common target respectivelyC ₁ 、C ₂ 、C ₃ According to the occupation ratio of each target and the set occupation ratio threshold, judging the self-adaptive structure of the multi-scale target detection model, which specifically comprises the following steps: when the small target number ratio exceeds the preset percentage of the whole datapThen, adjusting the multi-scale target detection model into a four-layer feature fusion structure; when the ratio of the large target quantity exceeds the preset percentagepIf not, adjusting the multi-scale target detection model into a three-layer feature fusion structure; in the embodiment of the present application,ptaking 33.33%;

referring to FIGS. 6-7, FIG. 6 is a schematic diagram of a data set that should be categorized into one type after being subjected to a decision tree;

it is simply understood that the inputs to the decision tree algorithm are: the occupation ratios of different detection tasks are 3 occupation ratios which are respectively a small target occupation ratio, a common target occupation ratio and a large target occupation ratio;

the output of the decision tree algorithm is: the number of different multi-scale target detection models is 3, and the models are respectively a two-layer structure model, a three-layer structure model and a four-layer structure model.

Regarding the decision tree, the decision tree is a basic classification and regression method, and is in a tree structure, and is composed of a root node, non-leaf nodes, branches and leaf nodes, wherein the root node represents a first selection point, the non-leaf nodes represent tests on a feature attribute, each branch represents the output of the feature attribute on a value range, and each leaf node stores a category to represent the final decision result.

Firstly, a decision tree is built for a past detection task data set, and a decision tree model is built according to a given training data set, so that the decision tree model can correctly classify examples; the ratio of the number of various targets of the detection task in the past period is a characteristic value, and the number of model layers is a label value.

In the present application, the decision tree algorithm uses ID3 to learn the decision tree, i.e. using information gain as a judgment condition.

Information gain = information entropy-conditional entropy

The entropy is usually used to describe the average value of the information amount brought by the whole random distribution, and has more statistical properties. The specific learning process is as follows:

set of hypothetical samplesDTo middleiThe proportion of the class sample isp _i (i =1,2, \ 8943;, N), whereinNIs the total number of sample classes, then the sample setDThe information entropy of (a) is:

assume that samples are assembledDBy attributeaPartition is made, assuming attributesaIs provided withvThe possible values are thenaWith split-out attributesvA subset (i.e. in a tree)vA branch) thereinVRepresenting the total number of subsets (branches), each possible set of values beingD ^v （|D ^v L and LD| represents the number of elements in the set), thenaThe method for calculating the conditional entropy of the attribute comprises the following steps:

the information gain expression is then:

and selecting the attribute with the maximum information gain as a classification attribute by using an information entropy principle, recursively expanding branches of the decision tree, and completing the construction of the decision tree.

FIG. 7 is a schematic diagram of a decision result; generally speaking, the target number ratio is used as a root node, the hierarchical structure is used as a leaf node, a decision is made according to the target number ratio, and the decision result is as follows:

when the ratio of the number of the small targets exceeds 1/3 of the whole data, adjusting the small targets into a four-layer feature fusion structure; and when the ratio of the number of the large targets exceeds 1/3, adjusting the large targets into a two-layer feature fusion structure, otherwise, adjusting the large targets into a three-layer feature fusion structure. Of course, in some other embodiments, the duty ratio threshold may be self-adjusting and is not intended to be limiting.

A multi-scale target detection model prediction stage:

s6: acquiring target data to be predicted;

s8: if the target data to be predicted does not belong to the target detection task type in the multi-scale target detection model determination stage, the target data to be predicted is processed by adopting an Otsu threshold segmentation method, the image is divided into a background part and a foreground part according to the gray characteristic of the image, the type of the target to be predicted is determined according to the ratio of the foreground target pixel value to the whole image pixel value, the occupation ratio of various predicted targets is counted, the structure of the multi-scale target detection model is determined again according to the method in the step S5, and the corresponding structure is called to complete target detection.

Firstly, as an embodiment, for example, for field type detection in an unmanned aerial vehicle image, the task type is a task type corresponding to a multi-scale target detection model determining stage, and in the determining stage, the task for field type detection is determined and classified as a large target detection task, so that in a predicting stage, a two-layer feature fusion structure is directly called for prediction. The experimental results are shown in table 1 below.

TABLE 1 field test results

Aiming at the known target detection task type, for example, the task of removing 27448is taken as a large target detection task by a decision tree, and the decision uses a two-layer model structure to realize the training and prediction of the task. During training, a data set enters a feature fusion part after feature layers are extracted through a backbone network, two-layer feature fusion is carried out on two feature layers extracted through the backbone network, and then the two feature layers are decoupled and output to obtain predicted valuespreThe two-layer model structure has 400 frames (20 × 20) of 2000 prediction frames, corresponding to an anchor frame size of 32 × 32, and 1600 prediction frames (40 × 40), corresponding to an anchor frame size of 16 × 16. When the information of 2000 prediction frames exists, each picture also has the information of a labeled target frame, namely the predicted value can be obtainedpreCalculating the loss with the label value target to obtain the classification losscls_lossAnd regression lossreg_lossTotal loss ofLoss=cls_loss+reg_loss. And inputting the data into an optimizer for back propagation, and updating model parameters to complete a training process.

And loading the weight of the two-layer model structure when predicting the task target and using the model to perform inference prediction to obtain target information.

Similarly, the three-layer network model structure has 8400 prediction frames, 400 frames (20 × 20), and the size of the corresponding anchor frame is 32 × 32.

In the same principle, the middle branch has 1600 prediction boxes (40 × 40), corresponding to anchor boxes of 16 × 16. The lowest branch has 6400 prediction boxes (80 × 80), and the size of the corresponding anchor box is 8 × 8;

the four-layer network model structure has 34000 prediction frames, wherein the first layer has 25600 frames (160 x 160), the corresponding anchor frame size is 4 x 4, and the other 8400 prediction frames have the same structure as the three-layer model structure.

For an unknown target detection task type, preprocessing and analyzing an input image to be detected by using an OSTU threshold segmentation algorithm, and dividing the image into a background part and a foreground part according to the gray characteristic of the image; referring to fig. 8, fig. 8 is a schematic diagram illustrating the effect of the OSTU threshold segmentation algorithm; in fig. 8, the foreground and background of the preliminary target are distinguished, whether the foreground target is a large target or a small target is determined according to the ratio of the foreground target pixel value to the image pixel value, then the number ratio of various targets is counted, the same parameters as those of the data set analysis are obtained, the parameters are input into a decision tree structure for decision, and the image is inferred by using a model structure output by the decision tree.

Referring to fig. 9 for a summary of the whole process, fig. 9 is a schematic diagram illustrating the detailed process of the method of the present invention;

for example, in the existing a, B, and.. X task types, each type has a corresponding data set, and the target detection model structures corresponding to the a, B, and.. X tasks are obtained through the foregoing S2 to S5 sections of the present application;

in the prediction stage, if the predicted task type is known to be one of A, B and. If the type of the target detection task is unknown in the prediction stage, obtaining the ratio of large, medium and small targets in the data to be predicted by adopting an OSTU threshold segmentation algorithm, and further obtaining a target detection model structure for prediction through decision tree analysis.

Finally, please refer to fig. 10 and fig. 11 as an example; FIG. 10 is a schematic diagram of the small target detection effect obtained using a conventional target detection three-layer network architecture; FIG. 11 is a schematic diagram of the small target detection effect obtained by using the improved four-layer feature fusion structure of the present application; the method has the advantages that the prediction is carried out in the image shot by the unmanned aerial vehicle at the same height, when the target in the image is compact and the area is small, more targets can be detected by using the four-layer model structure than the original general detection model, and the accuracy is higher.

In general, the key technical points of the application are as follows:

1. adaptive multiscale structure adjustment method (adaptive target detection task in this application)

Because the current target detection model is a general model based on public standard data, the optimization is better, and the method is suitable for various target detection in daily life.

However, in practical engineering application, a target data set to be detected often has corresponding characteristics, and the efficiency of using a general model is often limited, so that the model efficiency can be greatly improved by designing a corresponding structure according to the target characteristics of the data set.

Before a data set enters a model, data annotation information is transmitted, the area of a boundary box of all target targets is calculated, large, medium and small targets are distinguished through the ratio of the area of the boundary box of the target to the area of an image, usually, the evolution of the ratio of the area of the boundary box of the target to the area of the image is smaller than 0.03 and defined as small targets, the evolution of the ratio of the area of the boundary box of the target to the area of the image is larger than 0.2 and defined as large targets, the quantity of all the small targets and the large targets is subjected to statistical analysis, corresponding decision tree structures are designed, and the scale structures of the model are automatically decided and adjusted.

2. Micro target detection model (four-layer feature fusion structure in the application)

Because the shallow layer network receptive field of the convolutional neural network in deep learning is small, the spatial resolution is higher, the target position is accurate, and the convolutional neural network is suitable for detecting small targets, but the characteristic semantic information representation capability is weak, and the recall rate is low; although the semantic information extracted by the deep network with a large receptive field is more and more rich, because the small target has a small pixel ratio, the feature map is generally reduced after several times of downsampling processing, and the effective area for detecting the small target cannot be distinguished, a feature map fusion mechanism focusing on small target detection is designed, a shallow feature map with high resolution and a deep feature map with rich semantic information are fused, and the small target detection precision is improved.

3. Cross-layer fusion method for context information of four-layer feature layer

When the features of the main network are extracted, a feature layer with a smaller downsampling multiple is additionally extracted, compared with other three feature layers, the spatial information of the feature layer is richer, and the feature layer is suitable for detecting and detecting a tiny target, and the four feature layers are respectively represented by flat 1, flat 2, flat 3 and flat 4 from shallow depth to deep. The method comprises the steps of performing up-sampling on feat4 for three times, then fusing the feat4 with feat1, performing up-sampling on feat3 for two times, fusing the feat4 with feat1, performing up-sampling on feat4 for two times, and then fusing the feat4 with feat2, so that spatial information and semantic information of the feat1 and the feat2 are rich.

4. SSHF receptor field superposition mechanism

In order to reduce the number of model parameters and increase the inference speed, a series stack of several convolution kernels of 3x3 is used instead of convolution kernels of 5x5 and 7x 7.

The beneficial effects of the invention are: the method can carry out various target detection in real time and in a self-adaptive manner, improves the universality of target detection and ensures the precision of target detection.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A remote space self-adaptive multi-scale small target detection method is characterized by comprising the following steps: the method comprises two stages, which are respectively:

s4: the evolution of the ratio of the area of the target boundary frame to the area of the image is less than a preset first threshold valuea ₁ When the target is a small target; the evolution of the ratio of the area of the target bounding box to the area of the image is larger than a preset second threshold valuea ₂ When the target is a large target; the ratio of the area of the target bounding box to the image area is deriveda ₁ Anda ₂ in between, is a common target;

calculating the occupation ratios of a large target, a small target and a common target respectivelyC ₁ 、C ₂ 、C ₃ According to the occupation ratio of each target and the set occupation ratio threshold, judging the self-adaptive structure of the multi-scale target detection model, which specifically comprises the following steps: when the ratio of the small target number exceeds the preset percentage of the whole datapAdjusting the multi-scale target detection model into a four-layer feature fusion structure; when the number of the large targets is in proportionOver a predetermined percentagepIf not, adjusting the multi-scale target detection model into a three-layer feature fusion structure;

a multi-scale target detection model prediction stage:

s6: acquiring target data to be predicted;

2. A method for remote spatially adaptive multi-scale small object detection as claimed in claim 1, wherein: the two-layer feature fusion structure includes: the system comprises a backbone network, a CA attention mechanism module, a two-layer feature fusion module and a decoupling output module;

the four-layer feature fusion structure comprises: the system comprises a backbone network, a CA attention mechanism module, a four-layer feature fusion module, an SSHF receptive field superposition module and a decoupling output module; the SSHF receptive field superposition module is formed by sequentially connecting 3 convolution kernels with 3 multiplied by 3.

3. A method for detecting a small object with adaptive multiscale distance space as claimed in claim 2, wherein: the two-layer feature fusion structure is as follows:

4. A method for remote spatially adaptive multi-scale small object detection as claimed in claim 3, wherein: the two-layer feature fusion module comprises: the device comprises a convolution unit, a transposition convolution unit, two Contact + CSPLAyer structures and a down-sampling unit; the specific process of the feature fusion module for feature fusion is as follows:

the first fusion result is divided into two branches through a Contact + CSPLAyer structure, and one branch is directly decoupled through a decoupling output module to obtain first decoupling information; the other branch is subjected to down-sampling by a down-sampling unit to obtain down-sampling characteristics of a first fusion result;

the downsampling feature of the first fusion result and the second enhancement feature are fused through a second convolution result obtained after a convolution unit, and a second fusion result is obtained;

5. A method for remote spatially adaptive multi-scale small object detection as claimed in claim 2, wherein: the four-layer feature fusion structure is as follows;

6. A method for remote spatially adaptive multi-scale small object detection as claimed in claim 5, wherein: the four-layer feature fusion module comprises: the system comprises three convolution units, three transposition convolution units, six Contact + CSPLAyer structures and three downsampling units; the four-layer feature fusion module adopts a context jump connection mechanism, and the specific fusion process is as follows:

the fourth feature layer passes through a first convolution unit to obtain a first convolution result; after the first convolution result passes through a first transposition convolution unit, a first transposition convolution result is obtained; the first transposition convolution result and the third feature layer pass through a first Contact + CSP layer structure to obtain a first fusion result;

the first enhanced feature, the third transposition convolution result, the first fusion result and the first convolution result pass through a third Contact + CSPLAyer structure to obtain a third fusion result;

the second convolution result and the second downsampling result pass through a fifth Contact + CSPLAyer structure to obtain a fifth fusion result; the fifth fusion result is divided into two branches, and in the first branch, the fifth fusion result passes through a third decoupling output module to obtain third decoupling information; in the other branch, the fifth fusion result is subjected to third down-sampling operation through a third down-sampling unit, and the obtained third down-sampling result is input into a sixth Contact + CSP layer structure;