CN113723181B

CN113723181B - Unmanned aerial vehicle aerial photographing target detection method and device

Info

Publication number: CN113723181B
Application number: CN202110817078.2A
Authority: CN
Inventors: 王嘉荣; 李岩山; 张坤华
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2023-10-20
Anticipated expiration: 2041-07-20
Also published as: CN113723181A

Abstract

The embodiment of the application discloses a method and a device for detecting an aerial photographing target of an unmanned aerial vehicle, comprising the following steps: acquiring an unmanned aerial vehicle aerial image to be input; performing feature extraction on the unmanned aerial vehicle aerial image by using a preset CSPdark 53 extraction model, and performing feature fusion on the extracted features by using a preset IncreedFFM feature fusion model; and positioning and classifying the fused features through a preset YOLOv3Head network model, and obtaining a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation. According to the embodiment, a branch is added on the basis of the original network and used for concentrating information of each feature layer, and then the concentrated information is distributed to different feature layers, so that the purpose of fully fusing features is achieved. Furthermore, the loss function is improved for the data set sample imbalance problem.

Description

Unmanned aerial vehicle aerial photographing target detection method and device

Technical Field

The embodiment of the application relates to the field of electricity, in particular to a method and a device for detecting an aerial photographing target of an unmanned aerial vehicle.

Background

With the rapid development of unmanned aerial vehicle technology, unmanned aerial vehicles are widely applied to the fields of military reconnaissance, traffic control, geological detection, highway management and the like, and play an increasingly important role in modern society. Unmanned aerial vehicle aerial photography target detection is used as a key technology, and the detection effect determines the intelligent level of the unmanned aerial vehicle. Therefore, the unmanned aerial vehicle target detection technology has important research significance.

Object detection techniques aim to locate and classify objects in an image. The conventional target detection method is generally divided into three steps: image preprocessing, feature extraction and object classification detection. The image preprocessing method mostly depends on pixel difference and texture information, and foreground and background separation is realized through the screening and separation areas of pixel change trends; the feature extraction method comprises a direction gradient Histogram (HOG), scale Invariant Feature Transform (SIFT), a local binary pattern and the like. These methods have good performance and interpretability in some scenes, but are not suitable for detection tasks in complex scenes, and have weak generalization capability. With the development of artificial intelligence, students apply deep learning to target detection tasks, the detection effect greatly exceeds that of the traditional method, and target detection based on deep learning becomes a new research hotspot.

The target detection technology based on deep learning is mainly divided into two types, wherein one type is two-stage detection, and the representative algorithm comprises R-CNN, fast R-CNN, mask R-CNN and Cascade R-CNN; the other is single-stage detection, representing algorithm YOLO, SSD, retinaNet, efficientDet. The two-stage detection algorithm has high detection precision, but has slower detection speed, and is difficult to be applied to detecting scenes in real time; the single-stage detection algorithm has high detection speed, but the detection accuracy is reduced. The current research direction is to combine the two-stage precision advantage and the single-stage speed advantage, design and propose a high-performance target detection algorithm, and achieve good effects in the target detection task of a natural scene. However, the unmanned aerial vehicle aerial image has the characteristics of complex background, multiple targets, small targets, large field of view and the like, most ground targets are in overlook view angles, and the actual requirements cannot be met by directly migrating a target detection algorithm of a natural scene into the aerial image.

In the prior art, YOLOv4-CSP is a model in Scaled-YOLOv4, is suitable for a common GPU and is a typical single-stage object detector. The Yolov4-CSP introduces CSPNet design thought, skillfully utilizes strategies such as Mosaic data enhancement, PANet feature fusion, learning rate cosine annealing attenuation, CIoU-loss and the like, and obtains excellent target detection performance. However, the YOLOv4-CSP model was observed, and found to have the following problems.

1) In the feature fusion part, information transmission is realized through two paths of top-down and bottom-up. The characteristics are diluted through the middle layer in the process of transferring the information of the upper layer to the lower layer; the information of the lower layer is transferred to the upper layer through the intermediate layer, and the characteristics are diluted as well. Unmanned aerial vehicle aerial photography targets are mostly small targets, and the diluted characteristics seriously influence the detection effect of the aerial photography small targets.

2) In the loss function section, YOLOv4-CSP does not consider the class sample imbalance problem, it assigns the same weight to each class, each class contributing equally to the loss function. Most unmanned aerial vehicle aerial photographing data sets are distributed in long tails, sample distribution is uneven, and if the size categories have equal contribution rates to the loss function, detection of the subclass targets by the detector is not facilitated.

Disclosure of Invention

In order to solve the technical problems, an embodiment of the present application provides an unmanned aerial vehicle aerial photographing target detection method, including:

acquiring an unmanned aerial vehicle aerial image to be input;

performing feature extraction on the unmanned aerial vehicle aerial image by using a preset CSPdark 53 extraction model, and performing feature fusion on the extracted features by using a preset IncreedFFM feature fusion model;

and positioning and classifying the fused features through a preset YOLOv3Head network model, and obtaining a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation.

Further, the feature fusion of the extracted features by using a preset incosedffm feature fusion model includes:

transmitting top layer information to a bottom layer from top to bottom and transmitting bottom layer information to the top layer from bottom to top according to a channel in the incosedFFM feature fusion model to complete feature fusion;

carrying out feature concentration on the feature images of the unmanned aerial vehicle aerial images, carrying out network self-adaptive channel selection on the concentrated feature images to obtain channel weights, and respectively carrying out up-sampling, common rolling and down-sampling to obtain feature images with different sizes;

and adding the fused characteristic diagram and the concentrated characteristic diagram to obtain a characteristic diagram with fully fused characteristics.

Further, the transferring the top layer information to the bottom layer from top to bottom and transferring the bottom layer information to the top layer from bottom to top according to the channel in the incosedffm feature fusion model to complete feature fusion includes:

acquiring a small-scale feature map and a large-scale feature map from a feature map obtained by the aerial image of the unmanned aerial vehicle after feature extraction, wherein the feature map gradually becomes smaller along with deep feature extraction of the neural network;

up-sampling the small-scale feature map from top to bottom according to the channels in the incosedFFM feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the next adjacent channel, and performing stacking and convolution operation to transfer top-layer information to the bottom layer to finish feature fusion;

and downsampling the large-scale feature map from bottom to top according to the channels in the incosedFFM feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the adjacent previous channel, and carrying out stacking and convolution operation to transfer bottom layer information to the top layer to finish feature fusion.

Further, feature concentration is performed on the feature map of the unmanned aerial vehicle aerial image, and network self-adaptive channel selection is performed on the concentrated feature map to obtain channel weights, and up-sampling, ordinary rolling and down-sampling are performed respectively to obtain feature maps with different sizes, including:

up-sampling upper layer features in the feature images of the unmanned aerial vehicle, adjusting the up-sampled feature images to be the same as the size of the middle layer feature images to obtain a first feature image, down-sampling lower layer features in the feature images of the unmanned aerial vehicle, and adjusting the down-sampled feature images to be the same as the size of the middle layer feature images to obtain a second feature image;

adding the first feature map, the second feature map and the intermediate layer feature map to obtain a concentrated feature map;

and inputting the concentrated feature map into a preset channel attention structure model for network self-adaptive channel selection to obtain a concentrated feature map with channel weight, and respectively carrying out up-sampling, common convolution and down-sampling on the concentrated feature map with the channel weight to obtain three feature maps with different sizes.

Further, the model training process further comprises a loss function, the loss function is obtained by weighted summation of a positioning loss function, a confidence loss function and a classification loss function, wherein the classification loss function is a BalancedLoss classification loss function, and the BalancedLoss classification loss function calculating step comprises the following steps:

counting the number of samples of each category, and normalizing to obtain the occurrence frequency of the samples of each category;

calculating the occurrence frequency by using an inverse correlation function to obtain the weight of each category;

multiplying the weight of each category by the cross entropy loss function for enhancing the proportion of the positive sample to obtain the BalancedLoss classification loss function.

An image data processing apparatus comprising:

the acquisition module is used for acquiring an unmanned aerial vehicle aerial image to be input;

the processing module is used for extracting the characteristics of the unmanned aerial vehicle aerial image by using a preset CSPdark net53 extraction model and carrying out characteristic fusion on the extracted characteristics by using a preset IncreedFFM characteristic fusion model;

and the execution module is used for positioning and classifying the fused features through a preset YOLOv3Head network model and obtaining a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation.

Further, the acquisition module includes:

the first processing sub-module is used for transmitting top-layer information to a bottom layer from top to bottom and transmitting bottom-layer information to the top layer from bottom to top according to a channel in the incosedffm feature fusion model to complete feature fusion;

the second processing sub-module is used for carrying out feature concentration on the feature images of the unmanned aerial vehicle aerial images, carrying out network self-adaptive channel selection on the concentrated feature images to obtain channel weights, and respectively carrying out up-sampling, common rolling and down-sampling to obtain feature images with different sizes;

and the first execution sub-module is used for adding the fused characteristic diagram and the concentrated characteristic diagram to obtain a characteristic diagram with fully fused characteristics.

Further, the first processing submodule includes:

the first acquisition submodule is used for acquiring a small-scale feature image and a large-scale feature image from the feature image obtained by the aerial image of the unmanned aerial vehicle after feature extraction;

the third processing sub-module is used for upsampling the small-scale feature map from top to bottom according to the channels in the incosedffm feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the next adjacent channel, and performing stacking and convolution operation to transfer top-layer information to the bottom layer to finish feature fusion;

and the fourth processing submodule is used for downsampling the large-scale feature map from bottom to top according to the channels in the incosedffm feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the adjacent previous channel, and carrying out stacking and convolution operation to transfer bottom layer information to the top layer to finish feature fusion.

Further, the second processing sub-module includes:

a fifth processing sub-module, configured to upsample an upper layer feature in the feature map of the unmanned aerial vehicle aerial image, adjust the upsampled feature map to be the same as a size of the middle layer feature map to obtain a first feature map, downsample a lower layer feature in the feature map of the unmanned aerial vehicle aerial image, and adjust the downsampled feature map to be the same as the size of the middle layer feature map to obtain a second feature map;

the sixth processing submodule is used for adding the first characteristic diagram, the second characteristic diagram and the middle layer characteristic diagram to obtain a concentrated characteristic diagram;

and the second execution sub-module is used for inputting the concentrated feature map into a preset channel attention structure model to perform network self-adaptive channel selection to obtain a concentrated feature map with channel weight, and respectively performing up-sampling, common rolling and down-sampling on the concentrated feature map with the channel weight to obtain three feature maps with different sizes.

Further, the model training method further comprises a loss function, wherein the loss function is obtained by weighted summation of a positioning loss function, a confidence loss function and a classification loss function, the classification loss function is a BalancedLoss classification loss function, and the acquisition module further comprises:

a seventh processing sub-module, configured to count the number of samples in each category, and normalize the number to obtain the occurrence frequency of samples in each category;

an eighth processing sub-module, configured to calculate the occurrence frequency by using an inverse correlation function to obtain a weight of each category;

and the third execution submodule is used for multiplying the weight of each category with the cross entropy loss function for enhancing the proportion of the positive sample to obtain the BalancedLoss classification loss function.

The embodiment of the application has the beneficial effects that: the embodiment of the application provides an embedded YOLOv4-CSP target detector. The model adopts a single-stage target detection framework, sequentially inputs an input image into a feature extraction, feature fusion, positioning and classification network, and finally obtains a detection result through non-maximum suppression. The incorporated YOLOv4-CSP selects CSPdark 53 as the feature extraction network and YOLOv3Head as the locating and classifying network. Because the feature fusion network has the problem of information dilution in the feature transmission process, the scheme is added with a branch on the basis of the original network, and is used for concentrating the information of each feature layer, and then distributing the concentrated information to different feature layers, so that the purpose of fully fusing the features is achieved; furthermore, the loss function is improved for the data set sample imbalance problem. Compared with a reference model, the improved model has great improvement on various indexes.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an aerial photographing target detection method of an unmanned aerial vehicle according to an embodiment of the present application;

FIG. 2 is a graph of experimental data provided by an embodiment of the present application;

fig. 3 is a basic structural block diagram of an aerial photographing target detection device of an unmanned aerial vehicle according to an embodiment of the present application;

fig. 4 is a basic structural block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present application and in the foregoing figures, a plurality of operations occurring in a particular order are included, but it should be understood that the operations may be performed out of order or performed in parallel, with the order of operations such as 101, 102, etc., being merely used to distinguish between the various operations, the order of the operations themselves not representing any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

Referring to fig. 1, fig. 1 shows a method for processing image data according to an embodiment of the present application, and the method specifically includes the following steps:

s110, acquiring an unmanned aerial vehicle aerial image to be input;

s120, performing feature extraction on the unmanned aerial vehicle aerial image by using a preset CSPdark 53 extraction model, and performing feature fusion on the extracted features by using a preset IncreedFFM feature fusion model;

s130, positioning and classifying the fused features through a preset YOLOv3Head network model, and obtaining a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation.

The embodiment of the application provides an embedded YOLOv4-CSP target detector. The model adopts a single-stage target detection framework, sequentially inputs an input image into a feature extraction, feature fusion, positioning and classification network, and finally obtains a detection result through non-maximum suppression. The incorporated YOLOv4-CSP selects CSPdark 53 as a feature extraction network and YOLOv3Head as a locating and classifying network. Because the feature fusion network has the problem of information dilution in the feature transmission process, the scheme is added with a branch on the basis of the original network, and is used for concentrating the information of each feature layer, and then distributing the concentrated information to different feature layers, so that the purpose of fully fusing the features is achieved; furthermore, the loss function is improved for the data set sample imbalance problem. Compared with a reference model, the improved model has great improvement on various indexes.

The embodiment of the application provides a method for carrying out feature fusion on extracted features by utilizing a preset incosedFFM feature fusion model, namely, a step S110, wherein the method comprises the following steps:

step one, transmitting top layer information to a bottom layer from top to bottom and transmitting bottom layer information to the top layer from bottom to top according to the extracted feature map of the unmanned aerial vehicle aerial image in the incosedFFM feature fusion model to complete feature fusion.

In order to effectively cope with the characteristic dilution effect and improve the small target detection effect, the embodiment adopts an incosedffm characteristic fusion network, wherein the network comprises two branches, one branch performs bidirectional characteristic fusion, and the other branch concentrates characteristic graphs with different scales.

Specifically, one of the branches, namely the branch for performing bidirectional feature fusion, namely the step one, can be realized by the following method:

acquiring a small-scale feature map and a large-scale feature map from a feature map obtained finally from the unmanned aerial vehicle aerial image after feature extraction;

in this embodiment, the feature map is obtained by convolving an input image, the size of the large-scale feature map is 1/8 of the size of the input image, the size of the intermediate feature map is 1/16 of the size of the input image, and the size of the small-scale feature map is 1/32 of the size of the input image.

And secondly, carrying out feature concentration on the feature map of the unmanned aerial vehicle aerial image, carrying out network self-adaptive channel selection on the concentrated feature map to obtain channel weights, and respectively carrying out up-sampling, common convolution and down-sampling to obtain feature maps with different sizes.

The branch circuit in the embodiment of the application is responsible for the concentration characteristic, wherein the concentration characteristic, namely the step two, can be realized by the following method:

in one embodiment of the present application, as the neural network goes deep, the feature map is continuously downsampled, in the CSPDarknet53 network, five downsampling operations are performed, and each time the downsampling operation is performed, the image size becomes 1/2 of the original image size. Finally, the size of the feature map is 1/32 of the size of the input image. The three downsampled feature maps are selected for operation, and the size of the feature maps corresponds to 1/8, 1/16 and 1/32 of the original image.

And thirdly, adding the fused characteristic diagram and the concentrated characteristic diagram to obtain a characteristic diagram with fully fused characteristics.

According to the embodiment of the application, the feature map after the feature fusion in the second step and the feature map with the same size in the feature map after the feature concentration in the third step are added to obtain three feature maps, and the incosedffm can improve the dilution effect in the information transmission process to obtain the feature with fully fused features. It should be noted that the incosedffn includes two branches, one is a top-down, bottom-up feature fusion branch, and the other is a feature concentration branch. And the two branches output three feature images, and the feature images corresponding to the two branches are added to obtain three final feature images.

In the embodiment of the present application, when the whole model is trained, that is, the whole model includes: a preset CSPdark 53 extraction model, a preset IncreedFFM feature fusion model and a preset YOLOv3Head network model.

In this embodiment, the model further includes a loss function, where the loss function is obtained by weighted summation of a positioning loss function, a confidence loss function and a classification loss function, and the positioning loss is CIoU, and compared with IoU, GIoU, DIoU method, the method can include more prediction frames and real frames distributed possibly, and can control the position and shape of the prediction frames to be close to the real frames as much as possible. The confidence loss evaluation prior frame contains a target, a cross entropy loss function is adopted, meanwhile, the weight of a positive sample is increased, and error division cases caused by unbalance of the positive sample and the negative sample are reduced.

In order to solve the problem of sample unbalance, the embodiment of the application improves the detection rate of small class targets, the classification loss function is a BalancedLoss classification loss function, and the calculation step of the BalancedLoss classification loss function comprises the following steps:

One embodiment, counts the number of samples per class x _n Normalizing the data to obtain the frequency f of occurrence of each category _n Where M is the number of samples.

Then, the frequency value is sequentially sent into an inverse correlation function to obtain the weight w of each category _n . The anti-correlation function makes the higher frequency of occurrence class, the lower the calculated weight. The inverse correlation function used herein is an exponential function, and a weight factor α is added, which is used to adjust the relationship between the frequency level and the weight magnitude.

And finally, multiplying the weight by a cross entropy loss function for enhancing the proportion of the positive sample to obtain the improved classification loss. Balancedloss is defined as follows:

L _BL ＝-w _n [p _n y _n logσ(x _n )+(1-y _n )logσ(1-x _n )]

wherein x is _n Is the predicted value, y _n Is the true value, p _n Is the positive sample weight, σ (·) represents Sigmoid operation.

In one embodiment, the VisDrone dataset collected by the university of Tianjin machine learning and data mining laboratory AISKYEYE team is selected. The dataset comprises four tasks, namely image target detection, video target detection, single target tracking and multi-target tracking. VisDrone2019-DET is an image target detection dataset comprising 10 classes, pedestrian, human, bicycle, dolly, minibus, truck, tricycle, tricycles with canopy, bus and motorcycle, respectively; the training set contained 6471 images, the validation set contained 548 images, and the test set contained 1580 images. The VisDrone2019-DET is a data set collected under different weather and illumination conditions, and the image difference is large; the sample is extremely unbalanced, containing 25431 cart targets, but only 457 tricycle targets. These factors make the data set very challenging and the target detection effect is reduced significantly compared to common data sets such as COCO, VOC, etc.

The data are detected through a scheme setting experiment in the embodiment of the application, wherein the experiment runs on a Windows10 platform. The CPU model is Intel i5-10400f, the GPU is RTX3060, the CPU comprises 12G video memory, and the computer memory is 16G. The evaluation indexes used include [email protected], [email protected]:0.95, precision and Recall. The mAP, mean Average Precision, is the average of all class APs. AP is the area under PR curve, which can reflect the detection performance of the detector for a certain class; mAP can reflect the detection performance of the detector for all categories. [email protected] shows mAP, [email protected]:0.95 when IoU equals 0.5 shows the average mAP at different IoU thresholds (from 0.5 to 0.95, step size 0.05). As the IoU threshold increased, the mAP gradually decreased, indicating that the detection accuracy was decreasing as the evaluation criteria increased. Precision and Recall are the basis for calculating the mAP. Precision is the Precision rate, which represents the probability that a prediction frame predicts correctly; recall is the Recall rate, which represents the probability that a real box is detected. Precision and Recall are contradictory, so a balance is made between P and R, and a PR curve is drawn and the AP is used to evaluate detector performance.

The experimental results are shown in Table 1, from which it can be seen that our model has a not insignificant improvement over the reference model, [email protected]:0.95 improvement by 0.5%, [email protected] improvement by 1.4%; precision is raised by 0.4% and Recall is raised by 2.1%.

Table 1 model performance comparison

Compared with the YOLOv4-CSP, the YOLOv4-CSP added with the IncreedFFM structure has obvious enhancement effect. This is because the incosedffm distributes the concentrated features into three feature maps of different scales, supplementing the missing parts in the information transfer process, enabling the features to be fully fused before being fed into the positioning and classification network, and improving the performance of the detector to a new height.

BalancedLoss, [email protected]:0.95 is added on the basis of YOLOv4-CSP+IncreasedFFM, the detector obtains higher recall rate. BalancedLoss multiplies the subclass by a large weight, so that the subclass obtains a loss function contribution rate which is equal to the large class, sample imbalance is relieved, and more targets are recalled. However, the accuracy of the model added to BalancedLoss was reduced, which means that the false positive detection of the detector increased, but [email protected]:0.95 did not change, [email protected] increased by 0.3%, which is a positive improvement.

Fig. 2 is a visual result of the detection method. From the figure, it can be seen that the improved model has a better visual effect than the reference model. Compared with FIG. 2 (a), FIG. 2 (b) and FIG. 2 (c) have more bounding boxes, so that the recall rate is improved; adding BalancedLoss to the model, it can be seen that the bounding box of FIG. 2 (c) is more accurate than the bounding box of FIG. 2 (b), with fewer misclassifications.

The scheme YOLOv4-CSP model in the embodiment of the application is improved, and an built YOLOv4-CSP algorithm is provided for detecting the aerial image of the unmanned aerial vehicle. The main work includes providing an incosedFFM feature fusion network to replace the original PANet, so that the problem that information is diluted in the transmission process is solved; aiming at sample unbalance of a data set, a loss function is improved, balancedLoss classification loss is proposed, the corresponding class is multiplied by the inverse correlation weight, and the contribution rate of each class to the loss function is adjusted. The improved model represents a significant improvement over the baseline model. Subjectively, the prediction frame boundary of the improved model is more accurate, and the classification error rate is lower; objectively, evaluation indexes such as mAP of the improved model are improved.

As shown in fig. 3, in order to solve the above problem, an embodiment of the present application further provides an unmanned aerial vehicle aerial photographing target detection apparatus, including: the module 2100, the processing module 2200 and the execution module 2300 are configured to obtain the image data processing apparatus, wherein the module 2100 comprises: the acquisition module 2100 is used for acquiring an unmanned aerial vehicle aerial image to be input; the processing module 2200 is configured to perform feature extraction on the unmanned aerial vehicle aerial image by using a preset CSPdarknet53 extraction model, and perform feature fusion on the extracted features by using a preset incosedffm feature fusion model; the execution module 2300 is configured to locate and classify the fused features through a preset YOLOv3Head network model, and obtain a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation.

In some embodiments, the acquisition module comprises: the first processing sub-module is used for transmitting top-layer information to a bottom layer from top to bottom and transmitting bottom-layer information to the top layer from bottom to top according to a channel in the incosedffm feature fusion model to complete feature fusion; the second processing sub-module is used for carrying out feature concentration on the feature images of the unmanned aerial vehicle aerial images, carrying out network self-adaptive channel selection on the concentrated feature images to obtain channel weights, and respectively carrying out up-sampling, common rolling and down-sampling to obtain feature images with different sizes; and the first execution sub-module is used for adding the fused characteristic diagram and the concentrated characteristic diagram to obtain a characteristic diagram with fully fused characteristics.

In some embodiments, the first processing submodule includes: the first acquisition submodule is used for acquiring a small-scale feature image and a large-scale feature image from the feature image obtained by the aerial image of the unmanned aerial vehicle after feature extraction; the third processing sub-module is used for upsampling the small-scale feature map from top to bottom according to the channels in the incosedffm feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the next adjacent channel, and performing stacking and convolution operation to transfer top-layer information to the bottom layer to finish feature fusion; and the fourth processing submodule is used for downsampling the large-scale feature map from bottom to top according to the channels in the incosedffm feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the adjacent previous channel, and carrying out stacking and convolution operation to transfer bottom layer information to the top layer to finish feature fusion.

In some embodiments, the second processing sub-module comprises: a fifth processing sub-module, configured to upsample an upper layer feature in the feature map of the unmanned aerial vehicle aerial image, adjust the upsampled feature map to be the same as a size of the middle layer feature map to obtain a first feature map, downsample a lower layer feature in the feature map of the unmanned aerial vehicle aerial image, and adjust the downsampled feature map to be the same as the size of the middle layer feature map to obtain a second feature map; the sixth processing submodule is used for adding the first characteristic diagram, the second characteristic diagram and the middle layer characteristic diagram to obtain a concentrated characteristic diagram; and the second execution sub-module is used for inputting the concentrated feature map into a preset channel attention structure model to perform network self-adaptive channel selection to obtain a concentrated feature map with channel weight, and respectively performing up-sampling, common rolling and down-sampling on the concentrated feature map with the channel weight to obtain three feature maps with different sizes.

In some embodiments, the model training module further includes a loss function, where the loss function is a weighted sum of a positioning loss function, a confidence loss function, and a classification loss function, and the classification loss function is a BalancedLoss classification loss function, and the obtaining module further includes: a seventh processing sub-module, configured to count the number of samples in each category, and normalize the number to obtain the occurrence frequency of samples in each category; an eighth processing sub-module, configured to calculate the occurrence frequency by using an inverse correlation function to obtain a weight of each category; and the third execution submodule is used for multiplying the weight of each category with the cross entropy loss function for enhancing the proportion of the positive sample to obtain the BalancedLoss classification loss function.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

As shown in fig. 4, the internal structure of the computer device is schematically shown. As shown in fig. 4, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions, where the database may store a control information sequence, and the computer readable instructions, when executed by the processor, may cause the processor to implement an image processing method. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform an image processing method. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The processor in this embodiment is configured to execute the specific contents of the acquisition module 2100, the processing module 2200, and the execution module 2300 in fig. 3, and the memory stores program codes and various types of data required for executing the above modules. The network interface is used for data transmission between the user terminal or the server. The memory in the present embodiment stores program codes and data necessary for executing all the sub-modules in the image processing method, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.

According to the computer equipment provided by the embodiment of the application, the reference feature map is obtained by extracting the features of the high-definition image set in the reference pool, and because of the diversification of the images in the high-definition image set, the reference feature map contains all possible local features, so that high-frequency texture information can be provided for each low-resolution image, the feature richness is ensured, and the memory burden is reduced. In addition, the reference feature map is searched according to the low-resolution image, and the selected reference feature map can adaptively shield or enhance various different features, so that the details of the low-resolution image are richer.

The present application also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the image processing method of any of the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. The unmanned aerial vehicle aerial photographing target detection method is characterized by comprising the following steps of:

acquiring an unmanned aerial vehicle aerial image to be input;

positioning and classifying the fused features through a preset YOLOv3Head network model, and obtaining a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation;

the feature fusion of the extracted features by using a preset incosedffm feature fusion model comprises the following steps:

adding the fused characteristic diagram and the concentrated characteristic diagram to obtain a characteristic diagram with fully fused characteristics; the method for transmitting the top layer information to the bottom layer from top to bottom and transmitting the bottom layer information to the top layer from bottom to top according to the channel in the incosedffm feature fusion model to complete feature fusion comprises the following steps:

downsampling the large-scale feature map from bottom to top according to the channels in the incosedFFM feature fusion model, adjusting the sampled feature map to have the same size as the feature map of the adjacent previous channel, and transmitting bottom layer information to the top layer to finish feature fusion by stacking and convolution operation;

carrying out feature concentration on the feature map of the unmanned aerial vehicle aerial image, carrying out network self-adaptive channel selection on the concentrated feature map to obtain channel weights, and respectively carrying out up-sampling, common rolling and down-sampling to obtain feature maps with different sizes, wherein the method comprises the following steps:

2. The method of claim 1, further comprising a loss function in the model as the model is trained, the loss function being derived from a weighted sum of a positioning loss function, a confidence loss function, and a classification loss function, wherein the classification loss function is a BalancedLoss classification loss function, the BalancedLoss classification loss function calculating step comprising:

3. Unmanned aerial vehicle target detection device that takes photo by plane, its characterized in that includes:

the execution module is used for positioning and classifying the fused features through a preset YOLOv3Head network model and obtaining a detection result of the unmanned aerial vehicle aerial image after non-maximum suppression calculation;

the acquisition module comprises:

the first execution sub-module is used for adding the fused characteristic diagram and the concentrated characteristic diagram to obtain a characteristic diagram with fully fused characteristics;

the first processing submodule includes:

the first acquisition submodule is used for acquiring a small-scale feature map and a large-scale feature map from the feature map obtained by the aerial image of the unmanned aerial vehicle after feature extraction, wherein the feature map gradually becomes smaller along with deep feature extraction of the neural network;

a fourth processing sub-module, configured to downsample the large-scale feature map from bottom to top according to the channel in the incosedffm feature fusion model, adjust the sampled feature map to have the same size as the feature map of the previous channel, and perform stacking and convolution operations to transfer bottom layer information to the top layer to complete feature fusion;

the second processing sub-module comprises:

4. The apparatus of claim 3, further comprising a loss function in the model as the model is trained, the loss function being derived from a weighted sum of a positioning loss function, a confidence loss function, and a classification loss function, wherein the classification loss function is a BalancedLoss classification loss function, wherein the obtaining module further comprises: